Adam Rodman, M.D., M.P.H., Laura Zwaan, Ph.D., Andrew Olson, M.D., and Arjun K. Manrai, Ph.D.
Improved performance of large language models (LLMs) on traditional reasoning assessments has led to benchmark saturation. This has spurred efforts to develop new benchmarks, including synthetic computational simulations of clinical practice involving multiple AI agents. We argue that it is crucial to ground such efforts in extensive human validation. We conclude by providing four recommendations for researchers to better evaluate LLMs for clinical practice.