Tao Tu, Ph.D., Shekoofeh Azizi, Ph.D., Danny Driess, M.S., Mike Schaekermann, Ph.D., Mohamed Amin, B.S., Pi-Chuan Chang, Ph.D., Andrew Carroll, Ph.D., Charles Lau, M.B.A., Ryutaro Tanno, Ph.D., Ira Ktena, Ph.D., Anil Palepu, M.S., Basil Mustafa, M.S., Aakanksha Chowdhery, Ph.D., Yun Liu, Ph.D., Simon Kornblith, Ph.D., David Fleet, Ph.D., Philip Mansfield, Ph.D., Sushant Prakash, M.S., Renee Wong, B.Sc., Sunny Virmani, M.S., Christopher Semturs, M.S., S. Sara Mahdavi, Ph.D., Bradley Green, Ph.D., Ewa Dominowska, M.S., Blaise Aguera y Arcas, M.S., Joelle Barral, Ph.D., Dale Webster, Ph.D., Greg S. Corrado, Ph.D., Yossi Matias, Ph.D., Karan Singhal, M.S., Pete Florence, Ph.D., Alan Karthikesalingam, M.D., Ph.D., and Vivek Natarajan, M.S.
BACKGROUND Medicine is inherently multimodal, requiring the simultaneous interpretation and integration of insights between many data modalities spanning text, imaging, genomics, and more. Generalist biomedical artificial intelligence systems that flexibly encode, integrate, and interpret these data might better enable impactful applications ranging from scientific discovery to care delivery.
METHODS To catalyze development of these models, we curated MultiMedBench, a new multimodal biomedical benchmark. MultiMedBench encompasses 14 diverse tasks, such as medical question answering, mammography and dermatology image interpretation, radiology report generation and summarization, and genomic variant calling. We then introduced Med-PaLM Multimodal (Med-PaLM M), our proof of concept for a generalist biomedical AI system that flexibly encodes and interprets biomedical data including clinical language, imaging, and genomics with the same set of model weights. To further probe the capabilities and limitations of Med-PaLM M, we conducted a radiologist evaluation of model-generated (and human) chest x-ray reports.
RESULTS We observed encouraging performance across model scales. Med-PaLM M reached performance competitive with or exceeding the state of the art on all MultiMedBench tasks, often surpassing specialist models by a wide margin. In a side-by-side ranking on 246 retrospective chest x-rays, clinicians expressed a pairwise preference for Med-PaLM Multimodal reports over those produced by radiologists in up to 40.50% of cases, suggesting potential clinical utility.
CONCLUSIONS Although considerable work is needed to validate these models in real-world cases and understand if cross-modality generalization is possible, our results represent a milestone toward the development of generalist biomedical artificial intelligence systems. (Funded by Alphabet Inc. and/or a subsidiary thereof.)