• LLM-Assisted Reanalysis of Unsolved Rare Disease Genomes Increases Diagnostic Yield

    Aaron Jaech, Ph.D., Morgan Cheatham, M.D., Suyash S. Shringarpure, Ph.D., Casie A. Genetti, M.S., Pratiksha Pradhan, M.S., Aarti Bagul, M.S., Trishan Panch, M.D., Benjamin Rader, M.D., M.P.H., Kara Sewalk, M.P.H., Rebecca Distler, M.P.H., Katherine N. Anderson, M.S., Melissa Stewart, M.P.H., Sarah Lejfer, M.P.H., David C. Glahn, Ph.D., Richard D. Goldstein, M.D., Monica Wojcik, M.D., M.P.H., Alan H. Beggs, Ph.D., John S. Brownstein, Ph.D., and Catherine A. Brownstein, Ph.D., M.P.H.

    Abstract

    Background:Rare and undiagnosed genetic disorders affect millions of patients globally, and many patients endure years of inconclusive testing. Conventional genomic interpretation can be insufficiently sensitive and costly and is rarely repeated as knowledge evolves. 

    Methods:We conducted a retrospective multicohort reanalysis using a large language model (LLM)�assisted workflow that ingests clinician notes, Human Phenotype Ontology (HPO) terms, and a filtered variant table to propose explanation-rich candidate hypotheses for expert adjudication under American College of Medical Genetics and Genomics and Association for Molecular Pathology criteria. A diagnosis was defined a priori as a variant classified as pathogenic or likely pathogenic, confirmed in a Clinical Laboratory Improvement Amendments�certified laboratory, and returned to families. Secondary outputs included �rediscoveries� of externally established diagnoses not yet available locally and hypothesis generation signals. 

    Results:Across four cohorts, new local diagnoses were made in 10 of 100 rare disease neurodevelopmental cases (10.0%, [exact binomial: 95% confidence interval (CI), 4.9 to 17.6]), 4 of 61 neuromuscular cases (6.6%, [CI, 1.8 to 16.0]), 2 of 200 cases of sudden unexpected death in pediatrics (1.0% [CI, 0.1 to 3.6]), and 2 of 15 early psychosis cases (13.3% [CI, 1.7 to 40.5]) for an overall diagnostic yield of 18 of 376 (4.8%, [CI, 2.9 to 7.5]). We identified seven rediscoveries in which pathogenic or likely pathogenic findings had been established externally but were not available in the local research record at the time of review. In one case, the model�s synthesis of genotype-quality patterns and phenotype concordance triaged a putative 22q11.2 deletion that was subsequently confirmed by whole-genome sequencing. The workflow also generated testable biological hypotheses, including a candidate association between the sphingosine-1-phosphate receptor 1 gene (S1PR1) and vitiligo. 

    Conclusions:In retrospective reanalysis, an explanation-first LLM applied to routine HPO terms and variant tables produced clinically relevant gains in diagnostic yield, surfaced overlooked pathogenic findings, and generated biologically grounded hypotheses. These results motivate prospective multicenter evaluation with predefined end points, calibration reporting, and comparator baselines. (Funded by the U.S. National Institute of Child Health and Human Development and others.)