AI chatbots are worse than search engines for medical advice

There is a clear gap between the theoretical medical knowledge of large language models (LLMs) and their practical usefulness for patients, according not a new study from the Oxford Internet Institute and the Nuffield Department of Primary Care Health Sciences at the University of Oxford. The research, conducted in collaboration with MLCommons and other institutions, involved 1,298 people in the UK.

In the study, one group was asked to use LLMs such as GPT-4o, Llama 3, and Command R to assess health symptoms and suggest courses of action, while a control group relied on their usual methods, such as search engines or their own knowledge.

The results showed that the group using generative AI (genAI) tools performed no better than the control group in assessing the urgency of a condition. They were also worse at identifying the correct medical condition, according to The Register.

The researchers point to two main problems. First, users had difficulty providing chatbots with relevant and complete information. Second, the models sometimes gave contradictory or flat-out wrong advice.

The study also shows that traditional AI tests, such as medical test questions, do not reflect how people actually use the systems in real life. Passing a theoretical test is not the same as functioning safely in an interactive healthcare situation. As a result, the researchers believe today’s AI chatbots are not yet ready to be used as reliable medical advisors for the general public.

Read more: AI chatbots are worse than search engines for medical advice

Story added 10. February 2026, content source with full text you can find at link above.