Uriel Katz, M.D., Eran Cohen, M.D., Eliya Shachar, M.D., Jonathan Somer, B.Sc., Adam Fink, M.D., Eli Morse, M.D., Beki Shreiber, B.Sc., and Ido Wolf, M.D.
BACKGROUND Artificial intelligence (AI) is a burgeoning technological advancement, with considerable promise for influencing the field of medicine. As a preliminary step toward integrating AI into medical practice, it is imperative to ascertain whether model performance is comparable with that of physicians. We present a systematic comparison of performance by a large language model (LLM) versus that of a large cohort of physicians. The cohort includes all residents who took the medical specialist license examination in Israel in 2022 across the core medical disciplines: internal medicine, general surgery, pediatrics, psychiatry, and obstetrics and gynecology (OB/GYN). We provide the examinations as an accessible benchmark dataset for the medical machine learning and natural language processing communities, which may be adapted for future LLM studies.
METHODS We evaluated the performance of generative pretrained transformer 3.5 (GPT-3.5) and GPT-4 on the 2022 Israeli board residency examinations and compared the results with those of 849 practicing physicians. Official physician scores were obtained from the Israeli Medical Association. To compare GPT and physician performance, we computed model percentiles among physicians in each examination. We accounted for model stochasticity by applying the model to each examination 120 times.
RESULTS GPT-4 ranked higher than the majority of physicians in psychiatry, with a median percentile of 74.7% (95% confidence interval [CI] for the percentile, 66.2 to 81.0), and it performed similarly to the median physician in general surgery and internal medicine, displaying median percentiles of 44.4% (95% CI, 38.9 to 55.5) and 56.6% (95% CI, 44.0 to 65.7), respectively. GPT-4 performance was lower in pediatrics and OB/GYN but remained higher than a considerable fraction of practicing physicians, with a median score of 17.4% (95% CI, 9.55 to 30.9) and a median score of 23.44% (95% CI, 14.84 to 44.5), respectively. GPT-3.5 did not pass the examination in any discipline and was inferior to the majority of physicians in the five disciplines. Overall, GPT-4 passed the board residency examination in four of five specialties, revealing a median score higher than the official passing score of 65%.
CONCLUSIONS The advancement from GPT-3.5 to GPT-4 marks a critical milestone in which LLMs achieved physician-level performance. These findings underscore the potential maturity of LLM technology, urging the medical community to explore its widespread applications.