Medical Advice From a Bot: The Unproven Promise of Babylon Health
Hamish Fraser first encountered Babylon Health in 2017 when he and a colleague helped test the accuracy of several artificial intelligence-powered symptom checkers, meant to offer medical advice for anyone with a smartphone, for Wired U.K. Among the competitors, Babylon’s symptom checker performed worst in identifying common illnesses, including asthma and shingles. Fraser, then a health informatics expert at the University of Leeds in England, figured that the company would need to vastly improve to stick around.
“At that point I had no prejudice or knowledge of any of them, so I had no axe to grind, and I thought ‘Oh that’s not really good,’” says Fraser, now at Brown University. “I thought they would disappear, right? How wrong I was.”
Much has changed since the Wired U.K. article came out. Since early 2018, the London-based Babylon Health has grown from just 300 employees to approximately 1,500. The company has a valuation of more than $2 billion and says it wants to “put an affordable and accessible health service in the hands of every person on earth.” In England, Babylon operates the fifth-largest practice under the country’s mostly government-funded National Health Service, allowing patients near London and Birmingham to video chat with doctors or be seen in a clinic if necessary. The company claims to have processed 700,000 digital consultations between patients and physicians, with plans to offer services in other U.K. cities in the future.
“I thought they would disappear, right? How wrong I was.”
Babylon promises to save money on rising health care costs by using AI to filter patients so that only those who need medical attention will take up time and resources. Between Babylon’s work in England and its overseas ventures, the company says its symptom checker has been used more than 1.7 million times by people in England, the European Union, Canada, Southeast Asia, and Saudi Arabia. And Babylon is looking to soon expand more broadly, including in the United States and China.
Rapid expansion could be a problem because “this kind of tech — and not just symptom checkers but other digital interventions — you can spin them up and change them very quickly,” says David Wong, a lecturer in AI in health care at the University of Manchester in England, who worked with Fraser on the Wired U.K. test. “But the potential effects that they have are huge,” Wong says, and Babylon in particular is “an example of a company where they’ve had a very big effect very quickly.”
Such speedy deployment has raised serious concerns among experts who say Babylon Health has rushed to market without adequate proof that its products work. So far, there are no peer-reviewed, randomized control studies — the gold standard for evidence in medical science — showing how the AI performs in the real world on real patients. Yet Babylon’s symptom checker is already affecting thousands of people daily —with the approval of government regulators in countries where it’s offered.
“They have managed to be commissioned by the NHS to do this job without ever having to test the product on real patients and without any independent scrutiny, and yet this seems to be OK for regulators,” says Margaret McCartney, a general practitioner in Glasgow, Scotland and a Babylon critic. “I think it’s staggering.”
Babylon Health says it has satisfied NHS requirements and already meets or exceeds regulations in each country where its technology is used. The company also says it’s enlisting university researchers to help start randomized control studies. “We have taken important steps on testing and validating the safety and efficacy of the technology,” says Keith Grimes, clinical innovation director at Babylon Health. “It’s just that hasn’t been in the form of clinical trials yet.”
Babylon Health’s symptom checker appears as a chatbot that users can interact with through an app or website. When the user types out their main symptoms as a brief sentence or phrase, the symptom checker asks questions about possible related symptoms. At the end, the symptom checker identifies possible causes and recommends a course of action, such as booking a video consultation with a human physician or going to a hospital.
The symptom checker’s underlying technology is known as a knowledge graph, which functions like a digital encyclopedia of medicine that maps out relationships among various diseases, symptoms, and conditions. The relationships are represented by millions of data points from hundreds of medical sources that are continually updated. The symptom checker can also consult health records — including medical history and data gathered through interactions with Babylon — to map out possible connections among different users’ health conditions.
“They have managed to be commissioned by the NHS to do this job without ever having to test the product on real patients and without any independent scrutiny.”
The knowledge graph can be further tailored by adding data that helps assess the likelihood of different health conditions across populations and geographic locations.
Babylon creates “a model of medicine which is not just applicable for the United Kingdom and the United States, but globally,” says Saurabh Johri, chief scientist at Babylon Health. The model, he adds, must be adjustable “to reflect the local burden of disease, so if a patient presents with vomiting, fever, and diarrhea in London, it’s less likely that they have malaria than if we’re in Rwanda.”
Many other health companies use the popular AI technique of machine learning, and the subset deep learning, to train computer software to analyze patient data to detect symptoms and possibly diagnose patients. By sifting through huge amounts of raw health data, such techniques can often teach the system to find hidden patterns and connections among data points that humans — and human medical knowledge — might fail to recognize. Babylon’s approach is different because its AI assessments directly reflect existing human medical knowledge and human understanding of the relationships between symptoms and their causes, as opposed to relying upon the powerful but sometimes inscrutable machine-driven approach.
So far, Babylon makes some use of deep learning to help interpret patients’ messages to the chatbot. The Babylon AI also uses deep learning to speed up the computationally demanding task of searching the knowledge graph for all possible combinations of symptoms, diseases, and risk factors that could match each patient case. But overall, the Babylon AI draws primarily upon human medical knowledge rather than trusting machine reasoning to connect all the dots.
Babylon’s approach could have some advantages. One downside to machine learning and deep learning: They require huge amounts of relevant training data and massive computing power to learn patterns. Depending on the country and health care system, it’s not necessarily easy to access all the relevant health data needed to train the computer software. Another downside is transparency. Machine learning techniques often leave human experts clueless about exactly how the software made connections between various data points.
Babylon’s approach, on the other hand, may have an easier time when it comes to transparency. The company uses models that allow its team of clinicians and computer engineers to “look under the hood,” Johri says, and understand the symptom checker’s line of reasoning.
Despite these advantages, Babylon Health hasn’t exactly proven itself a model of responsible behavior. The company has tried to silence critics through legal action, and it received a reprimand from U.K. regulators for promoting “misleading” advertising. In interviews with Wired U.K. and Forbes, former Babylon employees painted a picture of a corporate culture at odds with the deliberate approach necessary for rigorously testing the safety and efficacy of AI in health care. According to Forbes: “Interviews with current and former Babylon staff and outside doctors reveal broad concerns that the company has rushed to deploy software that has not been carefully vetted, then exaggerated its effectiveness.” (The company has pushed back strongly against those claims.)
Babylon Health has also encountered public controversy by making certain claims that critics have described as misleading. On June 27, 2018, the company catapulted into the media spotlight by claiming its AI could diagnose common diseases as well as human physicians during a livestreamed presentation held at the Royal College of Physicians in London. Those claims relied on a company study that tested Babylon’s symptom checker against medical evaluations from a group of seven physicians. Babylon’s study also tested the symptom checker on parts of an exam required for U.K. physician certification, from the Membership of the Royal College of General Practitioners, and checked performance against historical results from an independent 2015 study that evaluated several symptom checkers.
But academics and medical organizations soon raised red flags. The Royal College of General Practitioners, the British Medical Association, and the Royal College of Physicians all issued statements questioning Babylon’s claims, even though the Royal College of Physicians hosted the company’s presentation and helped with the study. For starters, the study only tested the AI on parts of the official medical exam and did not include any testing in clinical settings with real patients.
Fraser and Wong, the researchers who helped Wired U.K. test symptom checkers in 2017, also questioned the study, which tested just a handful of doctors and didn’t go through a review by any independent experts. The two decided to take a closer look. In a paper published in 2018 in The Lancet, they concluded that Babylon’s study doesn’t offer convincing evidence that its symptom checker “can perform better than doctors in any realistic situation, and there is a possibility that it might perform significantly worse.”
The findings have real-world implications for patients. “If a symptom checker tells you to stay home and not see the doctor, that is a consequential decision if it means necessary care is delayed or never received,” says Enrico Coiera, director of the Center for Health Informatics at Macquarie University in Sydney, Australia, and an author on the 2018 Lancet paper.
Babylon’s study doesn’t offer convincing evidence that its symptom tracker can perform better than doctors in any realistic situation, and there is a possibility that it might perform significantly worse.
Even before that wave of criticism, Babylon had begun preliminary talks with Stanford University about an additional pilot study, says Megan Mahoney, a Stanford clinical researcher who co-authored Babylon’s 2018 paper.
“It looks like the AI actually might have some promise,” Mahoney says, explaining, “we really have the responsibility to take it to the next step of rigor in evaluating it, because it actually could be helpful in augmenting and supporting access to care.”
Mahoney described Babylon’s 2018 paper as “great for an internal validation study.” Despite her optimism, she cautioned that she would never think about integrating such AI into a real-world health care setting or medical practice based on the Babylon study alone.
When Undark asked about the 2018 study controversy, Babylon responded with a statement that read, in part: “Some media outlets may have misinterpreted what was claimed but we stand by our original science and results.” The statement also described the 2018 test as a “preliminary piece of work” that pitted the company’s AI against a “small sample of doctors.” And Babylon also referred to the study’s conclusion: “Further studies using larger, real-world cohorts will be required to demonstrate the relative performance of these systems to human doctors.”
Even Babylon acknowledges the preliminary study does not meet the gold standard of evidence for a medical study. But that hasn’t stopped the company — or regulators — from allowing patients to use the symptom checker.
The approach is akin to trying a new drug without testing it properly, says Isaac Kohane, a biomedical informatics researcher at Harvard Medical School. Computing, he adds, “may be the big 21st-century drug — let’s treat it just as responsibly.”
If Babylon does follow through with randomized control trials, Fraser says it could go a long way toward building trust as the company expands into American and Asian health care markets. The company plans to submit a trial protocol for publication in a peer-reviewed journal in the next several months, according to Johri, who adds: “We’ll be conducting those trials in the U.K. We’re in conversations with partners in China and the U.S.”
Current U.S. Food and Drug Administration guidelines suggest the agency will exercise “enforcement discretion” toward AI symptom checkers like Babylon’s because they pose a relatively lower risk to the public than other medical devices. The FDA has “chosen to exempt symptom checkers — and some similar interventions for whatever reason — to encourage innovation,” Fraser says. “But they appear to have the power to regulate these a lot more if they wanted to.”
For now, some independent experts continue to raise concerns about the current version of Babylon’s symptom checker. In early September, an anonymous NHS consultant who frequently critiques Babylon Health under the Twitter pseudonym Dr. Murphy, demonstrated a case of possible gender bias in Babylon’s symptom checker.
For a 59-year-old female smoker who complained about sudden chest pain and nausea, the symptom checker suggests either depression or a panic attack as the possible non-emergency causes. For an identical patient profile labeled as male, the symptom checker raises the additional possibilities of serious heart problems, with the recommendations of either going to the emergency room or calling an ambulance.
Rather than hitting back as it has done in the past, Babylon adopted a conciliatory tone in its Twitter response to the critique. And in a follow-up blog post about the controversy, Babylon acknowledged bias in health care as an issue but defended the performance of its symptom checker.
That has left Dr. Murphy unconvinced about the company’s willingness to address potential issues with its AI: “The most dangerous doctor is the one who fails to recognize or learn from their mistakes.”
Jeremy Hsu is a freelance journalist based in New York City. He frequently writes about science and technology for Scientific American, IEEE Spectrum, and NBC News, among other publications.
It’s not just and as much an issue about AI directed diagnosis of health as the very practice of modern medicine which has become acutely symptomatic over the years in its approach and understanding of human health. Such symptomatic, more & more empirical and atomised practice of medicine is what you get from most doctors who learn medicine based on less embodied and more discrete understanding of human body which is seen as sum of its parts or even merely it’s parts. Such particularistic and super specialised practice emerges from a flawed epistemic premise which quantifies well being or poor health in terms of a physiological behaviourism which can be measured and grasped through standard protocols and checklists. Such reductive practice by doctors and in hospitals using SOPs then lends itself to automated diagnosis which can then logically be outsourced to machines now increasingly better programmed by algorithmic software. Perils of instrumentalised practice of science really leave alone medicine.