Join the club for FREE to access the whole archive and other member benefits.

AI models show high accuracy in tests but fail during real-like patient conversations

Human doctors still excel in gathering patient information and making diagnoses

02-Jan-2025

Key points from article :

Researchers at Harvard University tested the diagnostic reasoning of advanced AI models like GPT-4, GPT-3.5, Meta’s Llama-2, and Mistral-v2 using simulated doctor-patient conversations. They developed a benchmark called CRAFT-MD, based on 2000 cases drawn from US medical board exams, to evaluate how well AI could gather medical histories and deliver accurate diagnoses. GPT-4 acted as the “patient AI” and also graded the performance, with human medical experts double-checking the results and reviewing the conversations for accuracy.

The study revealed a stark drop in diagnostic accuracy during open-ended interactions. GPT-4, the best-performing model, achieved 82% accuracy when presented with structured summaries and multiple-choice answers but fell to 26% during simulated patient conversations. The other models performed even worse. Moreover, the AI models struggled to collect complete medical histories, with GPT-4 succeeding only 71% of the time. This incomplete information further contributed to incorrect diagnoses in many cases.

Simulated conversations offered a more realistic test of clinical reasoning than traditional exams. They mirrored real-world scenarios where patients may not provide crucial details unless prompted effectively. Shreya Johri, a researcher on the project, noted that this approach highlights skills vital for clinical practice that aren’t tested in written case studies.

Eric Topol of the Scripps Research Translational Institute emphasized that while strong performance on such benchmarks could make AI a useful tool in healthcare, it cannot replace the holistic judgment of experienced physicians. Medical practice involves complexities like managing patients, coordinating teams, and addressing systemic healthcare issues that go beyond AI’s capabilities.

The findings underline the limitations of current AI in clinical reasoning, even as they show promise for supporting healthcare professionals in specific tasks. This research was published in a peer-reviewed journal Nature Medicine.

Mentioned in this article:

Click on resource name for more details.

Eric Topol

Founder and director of the Scripps Research Translational Institute, American cardiologist, scientist and author

Harvard University

Private Ivy League research university in Massachusetts

Nature Medicine

Scientific Journal providing information from all areas of medicine

Scripps Research Institute

Medical research facility with focus on research and education in the biomedical sciences

Topics mentioned on this page:
AI Doctor
AI models show high accuracy in tests but fail during real-like patient conversations