Assessment of Short-Answer Questions by ChatGPT in a Medical School Course
Grey Kuling, Ph.D., Sam Pullman, B.Sc., Dzhuliyan Vasilev, M.D., M.M.Sc., Noelle Ozimek, M.S., Nathan Palmer, Ph.D., Randall W. King, M.D., Ph.D., Barbara Cockrill, M.D., and Henrike Besche, Ph.D.Abstract
Background:Frequent low-stakes testing is a powerful tool for learning. Short-answer questions promote critical thinking but are underutilized in formative testing because they require time-intensive grading. In this retrospective case study, we evaluated the performance of generative pretrained transformer (GPT) 4o in grading open-ended short-answer responses from first-year medical students. Our primary goal was to compare artificial intelligence (AI)� and human-assigned scores.
Methods:We evaluated GPT-4o grading of short-answer responses from 169 first-year medical students in the 2023�2024 Harvard Medical School Foundations course. Student answers were graded by 10 trained human graders on two distinct criteria: (1) factual accuracy and (2) completeness of the response. Two datasets were created: a multi-grader set (n=822 responses), in which the 10 human graders independently scored the same responses to enable prompt refinement along with pedagogical principles, and a single-grader set (n=8964 responses), in which each response was graded by one human grader to simulate real-world conditions for model evaluation. GPT prompts were iteratively refined to promote semantic fairness and pedagogical alignment, and the degree of agreement between human and AI grading was determined by quadratic-weighted Cohen�s kappa.
Results:In the multi-grader dataset, GPT achieved a quadratic-weighted kappa of 0.443�0.127 for factual accuracy and 0.429�0.145 for completeness when compared with human graders, with 94% of GPT-generated factual accuracy grades falling within the full range of human scores. In the single-grader dataset, the kappa for GPT�human agreement (adjusted for rater variability using an attenuation factor) was 0.741 for factual accuracy and 0.655 for completeness, with over 85% of GPT scores falling within 1 point of the human-assigned score.
Conclusions:Iterative prompt engineering guided by pedagogical principles and linguistic flexibility yielded moderate concordance between GPT and human graders. AI-driven grading systems have the potential to reduce the grading burden of short-answer test questions. (Funded by Harvard Medical School.)