Key points from article :
OpenAI has launched HealthBench, a new open-source benchmark designed to evaluate how well large language models (LLMs) perform in real-world healthcare scenarios. Unlike previous evaluations based on multiple-choice medical exam questions, HealthBench tests AI systems in more realistic and complex situations. These include multi-turn, multilingual conversations between patients and clinicians across various specialties and contexts, reflecting how people actually use AI in healthcare settings.
Developed with input from 262 physicians across 60 countries, HealthBench includes 5,000 simulated health conversations. These interactions are graded using expert-written rubrics covering 48,562 unique criteria, such as accuracy, safety, communication, and clinical appropriateness. The benchmark also organizes conversations into seven themes, including emergency care and global health, to better mirror real clinical challenges.
According to Karan Singhal, who leads OpenAI's health AI team, HealthBench serves both the AI research community, by encouraging shared standards and improvement, and healthcare organizations, by providing high-quality evidence for AI evaluation. Ethan Goh, M.D., from Stanford AI Research, praised the benchmark for moving beyond outdated exam-style evaluations and better reflecting how AI is used in practice.
HealthBench marks OpenAI’s first major tool aimed specifically at healthcare, and comes amid growing partnerships with industry players. These include collaborations with Sanofi, Formation Bio, Iodine Software, Color Health, and UTHealth Houston, which are applying OpenAI models in areas such as cancer care, clinical trial recruitment, and hospital administration.