Large language models in clinical decision support: a systematic evaluation of diagnostic accuracy and safety considerations across five medical specialties
Abstract
Background: Large language models (LLMs) are increasingly proposed as adjuncts to clinical decision support systems; however, rigorous comparative evaluations of their diagnostic accuracy against practising physicians remain scarce, particularly across diverse specialty contexts.
Methods: We constructed a benchmark dataset of 450 standardised clinical vignettes drawn equally from five specialties: internal medicine, emergency medicine, paediatrics, rheumatology, and infectious disease. Three frontier LLMs — GPT-4o, Claude 3.5 Sonnet, and Gemini 1.5 Pro — were evaluated against a panel of 30 board-certified physicians (6 per specialty). Primary outcomes included diagnostic accuracy (top-1 and top-3), safety classification of diagnostic errors, and consistency across three independent model invocations.
Results: Physician panels achieved a mean top-1 diagnostic accuracy of 84.2% (95% CI: 81.1–87.3%). GPT-4o achieved 79.6%, Claude 3.5 Sonnet 81.4%, and Gemini 1.5 Pro 74.9%. LLM performance was highest in infectious disease (87.3% for Claude 3.5 Sonnet) and lowest in emergency medicine for all models. Critical safety errors occurred at rates of 3.1%, 2.4%, and 5.8% for GPT-4o, Claude 3.5 Sonnet, and Gemini 1.5 Pro respectively, compared with 1.1% for physician panels.
Conclusion: Contemporary LLMs approach but do not yet match board-certified physician diagnostic accuracy across mixed specialty clinical vignettes. The disproportionate safety error rate in time-sensitive scenarios represents the principal barrier to unsupervised clinical deployment and warrants targeted evaluation frameworks.
Keywords
Full Text
Access Article
Citation
Please cite as:
Juel Chowdhury, Suresh Babu Kokku. Large language models in clinical decision support: a systematic evaluation of diagnostic accuracy and safety considerations across five medical specialties. Next Journal of AI in Healthcare 2026;1(2). doi: 10.0000/naihj.16
doi: 10.0000/naihj.16
Authors & Affiliations
Juel Chowdhury 1
Suresh Babu Kokku 2
Director
Data Pending
Metrics and citation data are currently being gathered for this publication.