Feasibility of Using the Privacy-preserving Large Language Model Vicuna for Labeling Radiology Reports
Pritam Mukherjee, Benjamin Hou, Ricardo B Lanfredi, Ronald M Summers
Radiology . 2023 Oct;309(1):e231147. doi: 10.1148/radiol.231147.
Background Large language models (LLMs) such as ChatGPT, though proficient in many text-based tasks, are not suitable for use with radiology reports due to patient privacy constraints. Purpose To test the feasibility of using an alternative LLM (Vicuna-13B) that can be run locally for labeling radiography reports. Materials and Methods Chest radiography reports from the MIMIC-CXR and National Institutes of Health (NIH) data sets were included in this retrospective study. Reports were examined for 13 findings. Outputs reporting the presence or absence of the 13 findings were generated by Vicuna by using a single-step or multistep prompting strategy (prompts 1 and 2, respectively). Agreements between Vicuna outputs and CheXpert and CheXbert labelers were assessed using Fleiss κ. Agreement between Vicuna outputs from three runs under a hyperparameter setting that introduced some randomness (temperature, 0.7) was also assessed. The performance of Vicuna and the labelers was assessed in a subset of 100 NIH reports annotated by a radiologist with use of area under the receiver operating characteristic curve (AUC). Results A total of 3269 reports from the MIMIC-CXR data set (median patient age, 68 years [IQR, 59-79 years]; 161 male patients) and 25 596 reports from the NIH data set (median patient age, 47 years [IQR, 32-58 years]; 1557 male patients) were included. Vicuna outputs with prompt 2 showed, on average, moderate to substantial agreement with the labelers on the MIMIC-CXR (κ median, 0.57 [IQR, 0.45-0.66] with CheXpert and 0.64 [IQR, 0.45-0.68] with CheXbert) and NIH (κ median, 0.52 [IQR, 0.41-0.65] with CheXpert and 0.55 [IQR, 0.41-0.74] with CheXbert) data sets, respectively. Vicuna with prompt 2 performed at par (median AUC, 0.84 [IQR, 0.74-0.93]) with both labelers on nine of 11 findings. Conclusion In this proof-of-concept study, outputs of the LLM Vicuna reporting the presence or absence of 13 findings on chest radiography reports showed moderate to substantial agreement with existing labelers.