NGHC Publications
Toggle sidebar
Published on 21.May.2026 in July
Article Cover

Large language models in clinical decision support: a systematic evaluation of diagnostic accuracy and safety considerations across five medical specialties

Juel Chowdhury1; Suresh Babu Kokku2

Abstract

Background: Large language models (LLMs) are increasingly proposed as adjuncts to clinical decision support systems; however, rigorous comparative evaluations of their diagnostic accuracy against practising physicians remain scarce, particularly across diverse specialty contexts.

Methods: We constructed a benchmark dataset of 450 standardised clinical vignettes drawn equally from five specialties: internal medicine, emergency medicine, paediatrics, rheumatology, and infectious disease. Three frontier LLMs — GPT-4o, Claude 3.5 Sonnet, and Gemini 1.5 Pro — were evaluated against a panel of 30 board-certified physicians (6 per specialty). Primary outcomes included diagnostic accuracy (top-1 and top-3), safety classification of diagnostic errors, and consistency across three independent model invocations.

Results: Physician panels achieved a mean top-1 diagnostic accuracy of 84.2% (95% CI: 81.1–87.3%). GPT-4o achieved 79.6%, Claude 3.5 Sonnet 81.4%, and Gemini 1.5 Pro 74.9%. LLM performance was highest in infectious disease (87.3% for Claude 3.5 Sonnet) and lowest in emergency medicine for all models. Critical safety errors occurred at rates of 3.1%, 2.4%, and 5.8% for GPT-4o, Claude 3.5 Sonnet, and Gemini 1.5 Pro respectively, compared with 1.1% for physician panels.

Conclusion: Contemporary LLMs approach but do not yet match board-certified physician diagnostic accuracy across mixed specialty clinical vignettes. The disproportionate safety error rate in time-sensitive scenarios represents the principal barrier to unsupervised clinical deployment and warrants targeted evaluation frameworks.

Full Text

Citation

Please cite as:

Juel Chowdhury, Suresh Babu Kokku. Large language models in clinical decision support: a systematic evaluation of diagnostic accuracy and safety considerations across five medical specialties. Next Journal of Infectious Diseases 2026;1(2). doi: 10.0000/NJIDE.16

doi: 10.0000/NJIDE.16

Export Metadata

  • RIS for: Endnote
  • BibTeX for: BibDesk, LaTeX
  • RIS for: RefMan, Procite, RefWorks

Authors & Affiliations

Juel Chowdhury 1

Suresh Babu Kokku 2

Director

Data Pending

Metrics and citation data are currently being gathered for this publication.

Next Journal of Infectious Diseases ISSN: XXXX-XXXX

Support