LLMs Demonstrate Mixed Performance in Breast Cancer Diagnostics

A 2026 JMIR Medical Informatics study evaluated nine large language models, including ChatGPT‑4o and Claude 3 Opus, on 50 breast‑cancer guideline questions, comparing yes/no answers and analyses to radiologists (residents, fellows, attendings). Using 2024 NCCN and 2013 ACR BI‑RADS standards, ChatGPT‑4o and Claude models scored highest and outperformed fellow physicians in some metrics (P<.05), yet could not fully replace clinical expertise.
Scoring Rationale
Rigorous peer‑reviewed evaluation with practical clinician comparisons, but limited question set and scope limit generalizability.
Practice with real Health & Insurance data
90 SQL & Python problems · 15 industry datasets
250 free problems · No credit card
See all Health & Insurance problemsStep-by-step roadmaps from zero to job-ready — curated courses, salary data, and the exact learning order that gets you hired.
Sources
- Read OriginalEvaluation of Large Language Models for Radiologists’ Support in Multidisciplinary Breast Cancer Teams: Comparative Studymedinform.jmir.org


