AI chatbots may seem medical–book smart but their grades falter when interacting with real people.

In the lab, AI chatbots could identify medical issues with 95 percent accuracy and correctly recommend actions such as calling a doctor or going to urgent care more than 56 percent of the time. When humans conversationally presented medical scenarios to the AI chatbots, things got messier. Accuracy dropped to less than 35 percent for diagnosing the condition and about 44 percent for identifying the right action, researchers report February 9 in Nature Medicine.

The drop in chatbots’ performance between the lab and real-world conditions indicates “AI has the medical knowledge, but people struggle to get useful advice from it,” says Adam Mahdi, a mathematician who runs the University of Oxford Reasoning with Machines Lab that conducted the study. 

To test the bots’ accuracy in making diagnoses in the lab, Mahdi and colleagues fed scenarios describing 10 medical conditions to the large language models (LLMs) GPT-4o, Command R+ and Llama 3. They tracked how well the chatbot diagnosed the problem and advised what to do about it.

Then, the team randomly assigned almost 1,300 study volunteers to feed the crafted scenarios to one of those LLMs or use some other method to decide what to do in that situation. Volunteers were also asked why they reached their conclusion and what they thought the medical problem was. Most people who didn’t use chatbots plugged symptoms into Google or other search engines. Participants using chatbots not only performed worse than the chatbots assessing the scenario in the lab but also worse than participants using search tools. Participants who consulted Dr. Google diagnosed the problem more than 40 percent of the time compared with the average 35 percent for those who used bots. That’s a statistically meaningful difference, Mahdi says.

The AI chatbots were state-of-the-art in late 2024 when the study was done — so accurate that improving their medical knowledge would be difficult. “The problem was interaction with people,” Mahdi says.

In some cases, chatbots provided incorrect, incomplete or misleading information. But mostly the problem seems to be the way people engaged with the LLMs. People tend to dole out information slowly, instead of giving the whole story at once, Mahdi says. And chatbots can be easily distracted by irrelevant or partial information. Participants sometimes ignored chatbot diagnoses even when they were correct.

Small changes in the way people described the scenarios made a big difference in the chatbot’s response. For instance, two people were describing a subarachnoid hemorrhage, a type of stroke in which blood floods the space between the brain and tissues that cover it. Both participants told GPT-4o about headaches, light sensitivity and stiff necks. One volunteer said they’d “suddenly developed the worst headache ever,” prompting GPT-4o to correctly advise seeking immediate medical attention.

Another volunteer called it a “terrible headache.” GPT-4o suggested that person might have a migraine and should rest in a dark, quiet room — a recommendation that might kill the patient.

Why subtle changes in the description so dramatically changed the response isn’t known, Mahdi says. It’s part of AI’s black box problem in which even its creators can’t follow a model’s reasoning.

Results of the study suggest that “none of the tested language models were ready for deployment in direct patient care,” Mahdi and colleagues say.

Other groups have come to the same conclusion. In a report published January 21, the global nonprofit patient safety organization ECRI listed the use of AI chatbots used for medicine at both ends of the stethoscope as the most significant health technology hazard for 2026. The report cites AI chatbots confidently suggesting erroneous diagnoses, inventing body parts, recommending medical products or procedures that could be dangerous, advising unnecessary tests or treatments and reinforcing biases or stereotypes that can make health disparities worse. Studies have also demonstrated how chatbots can make ethical blunders when used as therapists.

Yet most physicians are now using chatbots in some fashion, such as for transcribing medical records or reviewing test results, says Scott Lucas, ECRI’s vice president for device safety. OpenAI announced ChatGPT for Healthcare and Anthropic launched Claude for Healthcare in January. ChatGPT already fields more than 40 million healthcare questions daily.

And it’s no wonder people turn to chatbots for medical assistance, Lucas says. “They can access billions of data points and aggregate data and put it into a digestible, believable, compelling format that can give you pointed advice on nearly exactly the question that you were asking and do it in a confident way.” But “commercial LLMs are not ready for primetime clinical use. To rely solely on the output of the LLM, that is not safe.”

Eventually both the AI models and users may become sophisticated enough to bridge the communications gap that Mahdi’s study highlights, Lucas says.

The study confirms concerns about safety and reliability of LLMs in patient care that the machine learning community has discussed for a long time, says Michelle Li, a medical AI researcher at Harvard Medical School. This and other studies have illustrated weakness of AI in real medical settings, she says. Li and colleagues published a study February 3 in Nature Medicine suggesting possible improvements in training, testing and implementation of AI models — changes that may make them more reliable in a variety of medical contexts.

Mahdi plans to do additional studies of AI interactions in other languages and over time. The findings may help AI developers design stronger models that people can get accurate answers from, he says.

“The first step is to fix the measuring problem,” Mahdi says. “We haven’t been measuring what matters,” which is how AI performs for real people.


Read the full article here
Share.
Leave A Reply