Grading LLMs on the Ability to Grade

Rebecca Sternschein, M.D., M.H.P.E.
Abstract
In this study, Kuling et al. use a large language model to score short-answer question responses consistently and in a fraction of the time required by human graders. The authors position their work within the framework of automated short-answer grading research, which explores the use of technology to automate grading. Multiple-choice questions are the existing question format for which automatic grading is feasible, but these have limitations as a form of assessment. This study was performed with short-answer questions designed to serve as formative assessment for learners � to promote learning and provide insight into their reasoning. Thus, the potential value of facilitating or automating grading in this context is significant and tied to the goals of this type of assessment. This editorial focuses on the questions raised by this work about assessment in medical education: What is its purpose and how does this factor into the choice of assessment tools?

Grading LLMs on the Ability to Grade