A New, Transparent AI Tool May Help Detect Blood Poisoning


Ten years ago, 12-year-old Rory Staunton dove for a ball in gym class and scraped his arm. He woke up the next day with a 104 F fever, so his parents took him to the pediatrician and eventually the emergency room. It was just the stomach flu, they were told. Three days later, Rory died of sepsis after bacteria from the scrape infiltrated his blood and triggered organ failure.

“How does that happen in a modern society?” his father, Ciaran Staunton, said in a recent interview with Undark.

Each year in the United States, sepsis kills over a quarter million people — more than stroke, diabetes, or lung cancer. One reason for all this carnage is that sepsis isn’t well understood, and if not detected in time, it’s essentially a death sentence. Consequently, much research has focused on catching sepsis early, but the disease’s complexity has plagued existing clinical support systems — electronic tools that use pop-up alerts to improve patient care — with low accuracy and high rates of false alarm.

That may soon change. Back in July, Johns Hopkins researchers published a trio of studies in Nature Medicine and npj Digital Medicine, showcasing an early warning system that uses artificial intelligence. The system caught 82 percent of sepsis cases and reduced deaths by nearly 20 percent. While AI — in this case, machine learning — has long promised to improve health care, most studies demonstrating its benefits have been conducted on historical datasets. Sources told Undark that, to the best of their knowledge, when used on patients in real-time, no AI algorithm has shown success at scale. Suchi Saria, director of the Machine Learning and Health Care Lab at Johns Hopkins University and senior author of the studies, said the novelty of this research is how “AI is implemented at the bedside, used by thousands of providers, and where we’re seeing lives saved.”

The Targeted Real-time Early Warning System, or TREWS, scans through hospitals’ electronic health records — digital versions of patients’ medical histories — to identify clinical signs that predict sepsis, alert providers about at-risk patients, and facilitate early treatment. Leveraging vast amounts of data, TREWS provides real-time patient insights and a unique level of transparency into its reasoning, according to study co-author and Johns Hopkins internal medicine physician Albert Wu.

Wu said that this system also offers a glimpse into a new age of medical electronization. Since their introduction in the 1960s, electronic health records have reshaped how physicians document clinical information, but decades later, these systems primarily serve as “an electronic notepad,” he added. With a series of machine learning projects on the horizon, both from Johns Hopkins and other groups, Saria said that using electronic records in new ways could transform health care delivery, providing physicians with an extra set of eyes and ears — and help them make better decisions.

It’s an enticing vision, but one in which Saria, as CEO of the company developing TREWS, has a financial stake. This vision also discounts the difficulties of implementing any new medical technology: Providers might be reluctant to trust machine learning tools, and these systems might not work as well outside controlled research settings. Electronic health records also come with many existing problems, from burying providers under administrative work to risking patient safety because of software glitches.

Saria is nonetheless optimistic. “The technology exists, the data is there,” she said. “We really need high quality care augmentation tools that will allow providers to do more with less.”

Currently, there’s no single test for sepsis, so health care providers have to piece together their diagnoses by reviewing a patient’s medical history, conducting a physical exam, running tests, and relying on their own clinical impressions. Given such complexity, over the past decade doctors have increasingly leaned on electronic health records to help diagnose sepsis, mostly by employing a rules-based criteria — if this, then that.

One such example, known as the SIRS criteria, says a patient is at risk of sepsis if two of four clinical signs — body temperature, heart rate, breathing rate, white blood cell count — are abnormal. This broadness, while helpful for catching the various ways sepsis might present itself, triggers countless false positives. Take a patient with a broken arm. “A computerized system might say, ‘Hey look, fast heart rate, breathing fast.’ It might throw an alert,” said Cyrus Shariat, an ICU physician at Washington Hospital in California. The patient almost certainly doesn’t have sepsis but would nonetheless trip the alarm.

These alerts also appear on providers’ computer screens as a pop-up, which forces them to stop whatever they’re doing to respond. So, despite these rules-based systems occasionally reducing mortality, there’s a risk of alert fatigue, where health care workers start ignoring the flood of irritating reminders. According to M. Michael Shabot, a trauma surgeon and former chief clinical officer of Memorial Hermann Health System, “it’s like a fire alarm going off all the time. You tend to be desensitized. You don’t pay attention to it.”

Given such complexity, over the past decade doctors have increasingly leaned on electronic health records to help diagnose sepsis.

Already, electronic records aren’t particularly popular among doctors. In a 2018 survey, 71 percent of physicians said that the records greatly contribute to burnout and 69 percent that they take valuable time away from patients. Another 2016 study found that, for every hour spent on patient care, physicians have to devote two extra hours to electronic health records and desk work. James Adams, chair of the Department of Emergency Medicine at Northwestern University, called electronic health records a “congested morass of information.”

But Adams also said that the health care industry is at an inflection point to transform the files. An electronic record doesn’t have to simply involve a doctor or nurse putting data in, he said, but instead “needs to transform to be a clinical care delivery tool.” With their universal deployment and real-time patient data, electronic records could warn providers about sepsis and various other conditions, but that’ll require more than a rules-based approach.

What doctors need, according to Shabot, is an algorithm that can integrate various streams of clinical information to offer a clearer, more accurate picture when something’s wrong.

Machine learning, algorithms work by looking for patterns in data to predict a particular outcome, like a patient’s risk of sepsis. Researchers train the algorithms on existing datasets, which helps the algorithms create a model for how that world works and then make predictions on new datasets. The algorithms can also actively adapt and improve over time, without the interference of humans.

TREWS follows this general mold. The researchers first trained the algorithm on historical electronic records data of 175,000 patient encounters, so it could recognize early signs of sepsis. After this testing showed that TREWS could have identified patients with sepsis hours before they actually got treatment, the algorithm was deployed inside hospitals to influence patient care in real-time.

Saria and Wu published three studies around TREWS. The first tried to determine how accurate the system was, whether providers would actually use it, and if use led to earlier sepsis treatment. The second went a step further to see if using TREWS actually reduced patients’ mortality. And the third described what 20 providers who tested the tool thought about machine learning, including what factors facilitate versus hinder trust.

In these studies, TREWS monitored patients in the emergency department and inpatient wards, scanning through their data — vital signs, lab results, medications, clinical histories, and provider notes — for early signals of sepsis. (Providers could do this themselves, Saria said, but it might take them about 20 to 40 minutes.) If the system suspected organ dysfunction, based on its analysis of millions other data points, it flagged the patient and prompted providers to confirm sepsis, dismiss the alert, or temporarily pause it.

An electronic record doesn’t have to simply involve a doctor or nurse putting data in, Adams said, but instead “needs to transform to be a clinical care delivery tool.”

“This is a colleague telling you, based upon data and having reviewed all this person’s chart, why they believe there’s reason for concern,” Saria said. “We very much want our frontline providers to disagree because they have ultimately their eyes on the patient.” And TREWS continuously learns from these providers’ feedback. Such real-time improvements, as well as the diversity of data TREWS considers, is what distinguishes it from other electronic records tools for sepsis.

In addition to these functional differences, TREWS doesn’t alert providers with incessant pop-up boxes. Instead, the system uses a more passive approach, with alerts arriving as icons on the patient list that providers can click on later. Initially, Saria was worried that this might be too passive: “Providers aren’t going to listen. They’re not going to agree. You’re mostly going to get ignored.” Instead, clinicians responded to 89 percent of the system’s alerts. As the third study revealed via in-depth interviews, TREWS was seen as less “irritating” than the previous rules-based system.

Saria said that TREWS’ high adoption rate shows that providers will trust AI tools, but Fei Wang, an associate professor of health informatics at Cornell University, is more skeptical about how these findings will hold up if TREWS is deployed more broadly. Although he calls these studies first-of-a-kind and said he thinks their results are encouraging, he notes that providers can be conservative and reticent to change: “It’s just not easy to convince physicians to use another tool they are not familiar with,” Wang said. Any new system is a burden until proven otherwise. Trust takes time.

TREWS is further limited because it only knows what’s been inputted into the electronic health record — the system is not actually at the patient’s bedside. As one emergency department physician put it in an interview for the third study, the system “can’t help you with what it can’t see.” And even what it can see is filled with missing, faulty, and out-of-date data, according to Wang.

But Saria said that TREWS’ strengths and limitations are complementary to those of health care providers. While the algorithm can analyze massive amounts of clinical data in real-time, it will always be limited by the quality and comprehensiveness of the electronic health record. The goal, Saria added, is not to replace physicians, but to partner with them and augment their capabilities.

The most impressive aspect of TREWS, according to Zachary Lipton, an assistant professor of machine learning and operations research at Carnegie Mellon University, was not the model’s novelty, but the effort it must have taken to deploy it across five hospitals and 2,000 providers over a two-year period. “In this area, there is a tremendous amount of offline research,” Lipton said, but relatively few studies “actually make it to the level of being deployed widely in a major health system.” It’s so difficult to perform “in the wild” research like this, he added, because it requires collaborations across various disciplines, from product designers to systems engineers to administrators.

As such, by demonstrating how well the algorithm worked in a large clinical study, TREWS has joined an exclusive club. But this uniqueness may be fleeting. As one example, Duke University’s Sepsis Watch algorithm is currently being tested across three hospitals, with results forthcoming. In contrast with TREWS, Sepsis Watch uses a type of machine learning called deep learning. While this can provide more powerful insights, how the deep learning algorithm comes to its conclusions is unexplainable — a situation that computer scientists call the black box problem. The inputs and outputs are visible, but the process in between is impenetrable.

On one hand, there’s the question of whether this is really a problem. For instance, doctors don’t always know how drugs work, Adams said, “but at some point, we have to trust what the medicine is doing.” Lithium, for example, is a widely used, effective treatment for bipolar disorder, but nobody really understands how it works. If an AI system is similarly useful, maybe interpretability doesn’t matter.

Wang suggested that it’s a dangerous conclusion. “How can you confidently say your algorithm is accurate?” he asked. After all, it’s difficult to know anything for sure when a model’s mechanics are a black box. That’s why TREWS, as a simpler algorithm that can explain itself, might be a more promising approach. “If you have this set of rules,” Wang said, “people can easily validate that everywhere.”

Indeed, providers trusted TREWS largely because they could see the measurements used to arrive at its alert. Of the clinicians interviewed, none fully understood machine learning, but that level of comprehension wasn’t necessary. As one provider who used TREWS said, “the extent that I can see all of the factors that are playing a role in the decision, that’s helpful to me to trust it. I don’t think my understanding has to go beyond that.”

In machine learning, while the specific algorithmic design is important, the results have to speak for themselves. By catching 82 percent of sepsis cases and reducing time to antibiotics by 1.85 hours, TREWS reduced patient deaths by nearly one-fifth. “This tool is number one very good, number two received well by clinicians, and number three impacts mortality,” Adams said. “That combination makes it very special.”

Shariat, on the other hand, the ICU physician at Washington Hospital in California, was more cautious about these findings. For one, these studies only compared patients with sepsis who had the TREWS alert confirmed within 3 hours versus those who didn’t. “They’re just telling us that this alert system that we’re studying is more effective if someone responds to it,” Shariat said. A more robust approach would have been to conduct a randomized controlled trial, the gold standard of medical research, where half of patients got TREWS in their electronic record while the other half didn’t. Saria said that randomization would have been difficult to do given patient safety concerns, and Shariat agreed. Even so, he said that the absence “makes the data less rigorous.”

Shariat also worries that the sheer volume of alerts, with about two out of three being false positives, might still contribute to alert fatigue — and potentially overtreatment with fluids and antibiotics, which can lead to serious medical complications like pulmonary edema and antibiotic resistance. Saria acknowledged that TREWS’ false positive rate, while lower than existing electronic health record systems, could certainly improve, but said it will always be critical for clinicians to continue to use their own judgment.

The studies also have a conflict of interest: Saria is entitled to revenue distribution from TREWS, as is Johns Hopkins. “If this goes primetime and they sell it to every hospital, there’s so much money,” Shariat said. “It’s billions and billions of dollars.”

Saria maintained that these studies went through rigorous internal and external review processes to manage conflicts of interest and that the vast majority of study authors don’t have a financial stake in this research. Regardless, Shariat said it will be crucial to have independent validation to confirm these findings and ensure the system is truly generalizable.

The Epic Sepsis Model, a widely used algorithm that scans through electronic records but doesn’t use machine learning, is a cautionary example here, according to David Bates, chief of general internal medicine at Brigham and Women’s Hospital. He explained how the model was developed at a couple of hospitals with promising results before being deployed at hundreds others. The model then deteriorated, identifying only 33 percent of patients with sepsis and having a 88 percent false positive rate. “You can’t really predict how much the performance is going to degrade,” Bates said, “without actually going and looking.”

The studies also have a conflict of interest: Saria is entitled to revenue distribution from TREWS, as is Johns Hopkins.

Despite the potential drawbacks, Orlaith Staunton, Rory’s mother, told Undark that TREWS could have saved her son’s life. “There was complete breakdown in my son’s situation,” she said, with none of his clinicians considering sepsis until it was too late. An early warning system that alerted them about the condition, she added, “would make the world of difference.”

After Rory’s death, the Stauntons started the nonprofit End Sepsis to ensure that no other family would have to go through their pain. Because of their efforts, New York State mandated that hospitals develop sepsis protocols, and the CDC declared sepsis a medical emergency. But none of this will ever bring back Rory, Ciaran Staunton said: “We will never be happy again.”

This research is personal for Saria as well. Almost a decade ago, her nephew died of sepsis. By the time it was discovered, there was nothing his doctors could do. “It all happened too quickly, and we lost him,” she said. That’s precisely why early detection is so important. Life and death can be mere minutes away. “Last year, we flew helicopters on Mars,” Saria said, “but we’re still freaking killing patients every day.”

Simar Bajaj studies the history of science at Harvard University and is a research fellow at Stanford and Massachusetts General Hospital.