Deep Learning: Chatgpt Imaging Pearls - Educational Tools | CT Scanning | CT Imaging

Deep Learning: Chatgpt Imaging Pearls - Educational Tools | CT Scanning | CT Imaging | CT Scan Protocols - CTisus

Imaging Pearls ❯ Deep Learning ❯ ChatGPT

View Pearls by Month:
-- OR --
View Pearls by Topic:
View Pearls by Subsection:

Background and Objectives: We explored the ability of large language models (LLMs) ChatGPT‐4 and Gemini 1.0 Ultra inguiding clinical decision‐making for six gastrointestinal cancers using the National Comprehensive Cancer Network (NCCN)Clinical Practice Guidelines.Methods: We reviewed the NCCN Guidelines for anal squamous cell carcinoma, small bowel, ampullary, and pancreatic adenocarcinoma, and biliary tract and gastric cancers. Clinical questions were designed and categorized by type, queried up to three times, and rated on a Likert scale: (5) Correct; (4) Correct following clarification; (3) Correct but incomplete; (2) Partially incorrect; (1) Absolutely incorrect. Subgroup analysis was conducted on Correctness (scores 3–5) and Accuracy (scores 4–5).
AI at the Forefront: Navigating Oncologic Care for SixGastrointestinal Cancers According to the NCCNGuidelines Utilizing Gemini‐1.0 Ultra and ChatGPT‐4
Tamir E. Bresler, Tyler Wilson, Tadevos Makaryan
Journal of Surgical Oncology, 2025; 1–6
Results: A total of 270 questions were generated (range‐per‐cancer 32–68). ChatGPT‐4 versus Gemini 1.0 Ultra score differenceswere not statistically‐significant (Mean Rank 278.30 vs. 262.70, p = 0.222). Correctness was seen in 77.78% versus 75.93% of responses, and Accuracy in 64.81% versus 57.41%. There were no statistically‐significant differences in Correctness or Accuracy between LLMs in terms of question or cancer type.Conclusions: Both LLMs demonstrated a limited capacity to assist with complex clinical decision‐making. Their current Accuracy level falls below the acceptable threshold for clinical use. Future studies exploring LLMs in the healthcare domain are warranted.
AI at the Forefront: Navigating Oncologic Care for SixGastrointestinal Cancers According to the NCCNGuidelines Utilizing Gemini‐1.0 Ultra and ChatGPT‐4
Tamir E. Bresler, Tyler Wilson, Tadevos Makaryan
Journal of Surgical Oncology, 2025; 1–6
The pace of development for AI is only quickening. Sinceconducting our research, OpenAI has released ChatGPT‐4o,which can interact with and search the Internet. On September12, 2024, an even newer version called ChatGPT‐o1 wasreleased, which fundamentally retrains the way ChatGPT processes and analyzes information. ChatGPT‐o1 is so powerful that it has outscored human subjects in PhD‐levelaccuracy on benchmark testing related to physics, biology, andchemistry problems. As this technology progresses, it willbecome more important than ever to understand not just theabilities, but also the limitations of these platforms.
AI at the Forefront: Navigating Oncologic Care for SixGastrointestinal Cancers According to the NCCNGuidelines Utilizing Gemini‐1.0 Ultra and ChatGPT‐4
Tamir E. Bresler, Tyler Wilson, Tadevos Makaryan
Journal of Surgical Oncology, 2025; 1–6
”In the age of informatics, where clinicians grapple with afaster‐than‐exponential expanding body of medical knowledge,AI emerges as a potential ally, and may fundamentallyreshape the practice of medicine. Our study falls into thisframework as an effort to continue probing the potentialpossibilities and pitfalls of this new technology in the field ofclinical oncology. We showed that the achievements of theLLMs are significant. However, ChatGPT‐4 and Gemini‐1.0Ultra are not yet capable of providing clinical recommendationswith enough Correctness or Accuracy to assist withclinical reasoning. We are encouraged by the future directionof this study and believe future studies may elucidate a higherlevel of LLM engagement and understanding as the technologyplatform grows.
AI at the Forefront: Navigating Oncologic Care for SixGastrointestinal Cancers According to the NCCNGuidelines Utilizing Gemini‐1.0 Ultra and ChatGPT‐4
Tamir E. Bresler, Tyler Wilson, Tadevos Makaryan
Journal of Surgical Oncology, 2025; 1–6

Objectives Large language models like GPT-4 have demonstrated potential for diagnosis in radiology. Previous studies investigating this potential primarily utilized quizzes from academic journals. This study aimed to assess the diagnostic capabilities of GPT-4-based Chat Generative Pre-trained Transformer (ChatGPT) using actual clinical radiology reports of brain tumors and compare its performance with that of neuroradiologists and general radiologists.
Methods We collected brain MRI reports written in Japanese from preoperative brain tumor patients at two institutions from January 2017 to December 2021. The MRI reports were translated into English by radiologists. GPT-4 and five radiologists were presented with the same textual findings from the reports and asked to suggest differential and final diagnoses. The pathological diagnosis of the excised tumor served as the ground truth. McNemar’s test and Fisher’s exact test were used for statistical analysis.
Comparative analysis of GPT-4-based ChatGPT’s diagnostic performance with radiologists using real-world radiology reports of brain tumors
Yasuhito Mitsuyama et al.
Eur Radiol. 2024 Aug 28. doi: 10.1007/s00330-024-11032-8. Online ahead of print.
Results In a study analyzing 150 radiological reports, GPT-4 achieved a final diagnostic accuracy of 73%, while radiologists’ accuracy ranged from 65 to 79%. GPT-4’s final diagnostic accuracy using reports from neuroradiologists was higher at 80%, compared to 60% using those from general radiologists. In the realm of differential diagnoses, GPT- 4’s accuracy was 94%, while radiologists’ fell between 73 and 89%. Notably, for these differential diagnoses, GPT-4’s accuracy remained consistent whether reports were from neuroradiologists or general radiologists.
Conclusion GPT-4 exhibited good diagnostic capability, comparable to neuroradiologists in differentiating brain tumors from MRI reports. GPT-4 can be a second opinion for neuroradiologists on final diagnoses and a guidance tool for general radiologists and residents.
Clinical relevance statement This study evaluated GPT-4-based ChatGPT’s diagnostic capabilities using real-world clinical MRI reports from brain tumor cases, revealing that its accuracy in interpreting brain tumors from MRI findings is competitive with radiologists.
Comparative analysis of GPT-4-based ChatGPT’s diagnostic performance with radiologists using real-world radiology reports of brain tumors
Yasuhito Mitsuyama et al.
Eur Radiol. 2024 Aug 28. doi: 10.1007/s00330-024-11032-8. Online ahead of print.
This study is the first attempt to evaluate GPT-4’s ability to interpret actual clinical radiology reports, rather than from settings like image diagnosis quizzes. The majority of previous research suggested the utility of GPT-4 in diagnostics, but these relied heavily on hypothetical environments such as quizzes from academic journals or examination questions. This approach can lead to a cognitive bias since the individuals formulating the imaging findings or exam questions also possess the answers. In these simulated scenarios, there’s also a propensity to leave out minor findings. Such minor findings, while often deemed insignificant in an experimental setup, are frequently encountered in real-world clinical practice and can have implications for diagnosis. In contrast, our study deviates from this previous methodology by using actual clinical findings, generated in a state of diagnostic uncertainty. This approach facilitates a more robust and practical evaluation of GPT-4’s accuracy, keeping in mind its potential applications in real-world clinical settings.
Comparative analysis of GPT-4-based ChatGPT’s diagnostic performance with radiologists using real-world radiology reports of brain tumors
Yasuhito Mitsuyama et al.
Eur Radiol. 2024 Aug 28. doi: 10.1007/s00330-024-11032-8. Online ahead of print.
“There are several limitations. This study only used the wording of actual clinical radiology reports and did not evaluate the effect of including other information such as patient history and the image itself, meaning the radiologists’ performance might not match their real-world diagnostic abilities. Furthermore, recent advancements in large language models have enabled the input of not only text but also images. Evaluating the performance of large language models that combine both radiology report texts and images could provide deeper insights into their potential usefulness in radiology diagnostics.”
Comparative analysis of GPT-4-based ChatGPT’s diagnostic performance with radiologists using real-world radiology reports of brain tumors
Yasuhito Mitsuyama et al.
Eur Radiol. 2024 Aug 28. doi: 10.1007/s00330-024-11032-8. Online ahead of print.
Aims and objectives: This study evaluates the accuracy of two AI language models, ChatGPT 4.0 and Google Gemini (as of August 2024), in answering a set of 79 text-based pediatric radiology questions from “Pediatric Imaging: A Core Review.” Accurate interpretation of text and images is critical in radiology, making AI tools valuable in medical education.
Methods: The study involved 79 questions selected from a pediatric radiology question set, focusing solely on text-based questions. ChatGPT 4.0 and Google Gemini answered these questions, and their responses were evaluated using a binary scoring system. Statistical analyses, including chi-square tests and relative risk(RR) calculations, were performed to compare the overall and subsection accuracy of the models.
Results: ChatGPT 4.0 demonstrated superior accuracy, correctly answering 83.5% (66/79) of the questions, compared to Google Gemini's 68.4% (54/79), with a statistically significant difference (p=0.0255, RR=1.221). No statistically significant differences were found between the models within individual subsections, with pvalues ranging from 0.136 to 1.
Conclusion: ChatGPT 4.0 outperformed Google Gemini in overall accuracy for text-based pediatric radiology questions, highlighting its potential utility in medical education. However, the lack of significant differences within subsections and the exclusion of image-based questions underscore the need for further research with larger sample sizes and multimodal inputs to fully assess AI models' capabilities in radiology.
Comparative Accuracy of ChatGPT 4.0 and Google Gemini in Answering Pediatric Radiology Text-Based Questions.
Abdul Sami M, Abdul Samad M, Parekh K, et al.
(October 05, 2024) Cureus 16(10): e70897. DOI 10.7759/cureus.70897
“The limitations of this study include the exclusion of image-based questions, utilizing a single subspecialized question set within pediatric radiology, and the low sample size for certain sections, all of which rendered it difficult to determine statistically significant differences in accuracy where they might exist. Moreover, this study did not provide a longitudinal assessment of the AI models' performance, which is relevant given the rapid pace of improvement in these models. The absence of evaluation of the models’ performances might evolve over time limiting our understanding of their potential and reliability in clinical settings. Additionally, the lack of comparison to a human radiologist's performance further limits the ability to contextualize these AI models. The implications of these limitations include an incomplete assessment of the models' true performance, particularly given that this study was conducted with a controlled question set rather than real-life patient scenarios involving actual patient data.”
Comparative Accuracy of ChatGPT 4.0 and Google Gemini in Answering Pediatric Radiology Text-Based Questions.
Abdul Sami M, Abdul Samad M, Parekh K, et al.
(October 05, 2024) Cureus 16(10): e70897. DOI 10.7759/cureus.70897
“The educational implications of this study are significant in fields such as general or specialized medical education. AI models like ChatGPT and Google Gemini could be utilized in training students with both text and image-based questions. For example, ChatGPT 4.0’s current performance suggests that it could be used to better understand case studies or accurately summarize educational content. The potential of AI in medical education extends beyond radiology, with applications in other specialties where analyzing text is critical. Integrating AI models could further support medical residents or physicians inbreaking down complex topics or creating personalized learning experiences.”
Comparative Accuracy of ChatGPT 4.0 and Google Gemini in Answering Pediatric Radiology Text-Based Questions.
Abdul Sami M, Abdul Samad M, Parekh K, et al.
(October 05, 2024) Cureus 16(10): e70897. DOI 10.7759/cureus.70897
“The study demonstrates a statistically significant difference in accuracy between ChatGPT 4.0 and Google Gemini when answering standardized radiology-related questions, with ChatGPT 4.0 achieving an accuracy rate of 83.5% compared to Google Gemini's 68.4%, suggesting that ChatGPT 4.0 may be more reliable for certain text-based tasks in medical education. However, the observed variability across different pediatric radiology subspecialties and the exclusion of image-based questions indicate that both AI models have distinct strengths and weaknesses that should be carefully considered. The findings emphasize the potential role of AI in enhancing medical education and diagnostic capabilities, particularly in radiology, while also underscoring the need for responsible integration of these technologies to complement human expertise.”
Comparative Accuracy of ChatGPT 4.0 and Google Gemini in Answering Pediatric Radiology Text-Based Questions.
Abdul Sami M, Abdul Samad M, Parekh K, et al.
(October 05, 2024) Cureus 16(10): e70897. DOI 10.7759/cureus.70897

Introduction: In health care, artificial intelligence (AI)–powered clinical documentation tools aim to reduce physician burnout, optimize workflows, and refine the accuracy of clinician documentation. Some of these AI tools can generate a preliminary clinical note by listening to the interaction between a clinician and a patient, then synthesizing the conversation into a draft clinical note. We evaluated clinicians’ experiences with clinical documentation before and after implementing an AI-powered clinical documentation tool. Discussion:Approximately half of clinicians using the AI-powered clinical documentation tool based on interest reported a positive outcome, potentially reducing burnout. However, a significant subset did not find time-saving benefits or improved EHR experience. Study limitations include potential selection bias and recall bias in both groups. Further research is needed to identify opportunities for improvement and understand the impact on different clinician subsets and health systems.
Objective: To evaluate the accuracy and quality of AI-generated chest radiograph interpretations in the emergency department setting.
Design, setting, and participants: This was a retrospective diagnostic study of 500 randomly sampled emergency department encounters at a tertiary care institution including chest radiographs interpreted by both a teleradiology service and on-site attending radiologist from January 2022 to January 2023. An AI interpretation was generated for each radiograph. The 3 radiograph interpretations were each rated in duplicate by 6 emergency department physicians using a 5-point Likert scale.
Main outcomes and measures: The primary outcome was any difference in Likert scores between radiologist, AI, and teleradiology reports, using a cumulative link mixed model. Secondary analyses compared the probability of each report type containing no clinically significant discrepancy with further stratification by finding presence, using a logistic mixed-effects model. Physician comments on discrepancies were recorded.
Generative Artificial Intelligence for Chest Radiograph Interpretation in the Emergency Department.
Huang J, Neill L, Wittbrodt M, et al..
JAMA Netw Open. 2023 Oct 2;6(10):e2336100.
Results: A total of 500 ED studies were included from 500 unique patients with a mean (SD) age of 53.3 (21.6) years; 282 patients (56.4%) were female. There was a significant association of report type with ratings, with post hoc tests revealing significantly greater scores for AI (mean [SE] score, 3.22 [0.34]; P < .001) and radiologist (mean [SE] score, 3.34 [0.34]; P < .001) reports compared with teleradiology (mean [SE] score, 2.74 [0.34]) reports. AI and radiologist reports were not significantly different. On secondary analysis, there was no difference in the probability of no clinically significant discrepancy between the 3 report types. Further stratification of reports by presence of cardiomegaly, pulmonary edema, pleural effusion, infiltrate, pneumothorax, and support devices also yielded no difference in the probability of containing no clinically significant discrepancy between the report types.
Conclusions and relevance: In a representative sample of emergency department chest radiographs, results suggest that the generative AI model produced reports of similar clinical accuracy and textual quality to radiologist reports while providing higher textual quality than teleradiologist reports. Implementation of the model in the clinical workflow could enable timely alerts to life-threatening pathology while aiding imaging interpretation and documentation.
Generative Artificial Intelligence for Chest Radiograph Interpretation in the Emergency Department.
Huang J, Neill L, Wittbrodt M, et al..
JAMA Netw Open. 2023 Oct 2;6(10):e2336100.
Key Points
Question How do emergency department physicians rate artificial intelligence (AI)–generated chest radiograph reports for quality and accuracy, compared with in-house radiologist and teleradiology reports?
Findings In this diagnostic study of the developed generative AI model on a representative sample of 500 emergency department chest radiographs from 500 unique patients, the AI model produced reports of similar clinical accuracy and textual quality to radiology reports while providing higher textual quality than teleradiology reports.
Meaning Results suggest that use of the generative AI tool may facilitate timely interpretation of chest radiography by emergency department physicians.
Generative Artificial Intelligence for Chest Radiograph Interpretation in the Emergency Department.
Huang J, Neill L, Wittbrodt M, et al..
JAMA Netw Open. 2023 Oct 2;6(10):e2336100.
“In this diagnostic study accounting for both clinical accuracy and textual quality, results suggest that our AI tool produced reports similar in performance to a radiologist and better than a teleradiology service in a representative sample of ED chest radiographs. AI report ratings were comparable with those of on-site radiologists across all evaluated pathology categories. Model integration in clinical workflows could enable timely alerts to life-threatening pathology while aiding physician imaging interpretation and speeding up documentation. Further efforts to prospectively evaluate clinical impact and generalizability are needed.”
Generative Artificial Intelligence for Chest Radiograph Interpretation in the Emergency Department.
Huang J, Neill L, Wittbrodt M, et al..
JAMA Netw Open. 2023 Oct 2;6(10):e2336100.
Huang et al. developed a multimodal generative AI model and evaluated its ability to produce full radiology reportsfor chest radiographs in the emergency department (ED) setting. An encoder-decoder model was trained on 900,000 chest radiographs and generated a report when given an input chest radiograph and its most recent prior radiograph. A retrospective analysis was performed on 500 unique ED encounters with chest radiographs interpreted by three reader categories: a teleradiology service (all with U.S. residency and board experience), 12 ED radiologists (mean, 14.6 ± 12.5 [SD] years of post residency clinical practice experience), and the AI model. Six ED physicians (10.6 ± 6.4 years of postresidency clinical practice experience) rated the AI, radiologist, and teleradiology reports in a blinded fashion using a 5-point Likert scale.
Beyond the AJR: Early Applications of Generative Artificial Intelligence for Radiology Report Interpretation.
Doo FX, Parekh VS.
AJR Am J Roentgenol. 2024 Aug;223(2):e2330696.
“The results indicated that the AI-generated reports and radiologist reports were of similar quality and accuracy (on a Likert scale, AI reports: 3.22 ± 0.34 [SD], radiologist reports: 3.34 ± 0.34, both of which scored significantly higher than teleradiology reports: 2.74 ± 0.34) (p < .001). There was no significant difference in the probability of reports containing clinically significant discrepancies among the three report types, even when stratified by specific findings. This study suggests that generative AI models can produce chest radiograph reports with clinical accuracy and textual quality comparable with those produced by radiologists, showing the potential of AI to enhance radiology services in EDs, especially in settings in which access to radiology services is limited.”
Beyond the AJR: Early Applications of Generative Artificial Intelligence for Radiology Report Interpretation.
Doo FX, Parekh VS.
AJR Am J Roentgenol. 2024 Aug;223(2):e2330696.
“This study represents a step toward clinical application of generative AI. A key strength is the evaluation of an AI model’s capability to generate full radiology reports, compared with prior narrow AI applications. However, one important factor affecting the generalizability of this Huang et al. study is that clinical quality was scored by referring ED physicians, rather than expert radiologists. A recent study on GPT-4 (OpenAI)-generated radiology report impressions has shown that radiologists overall graded AI impressions to be less coherent, less comprehensive, more factually inconsistent, and more medically harmful, whereas referring providers favored GPT-4 impressions for coherence and diminished harmfulness.”
Beyond the AJR: Early Applications of Generative Artificial Intelligence for Radiology Report Interpretation.
Doo FX, Parekh VS.
AJR Am J Roentgenol. 2024 Aug;223(2):e2330696.
“The debate about AI potentially replacing human radiologists has since shifted toward recognizing AI’s role as a supportive, rather than a substitutive, tool . Even when using AI as an adjunct, radiologists may not necessarily save time, especially if needing to study and correct errors . Although generative AI introduces additional ways that AI can augment radiologist workflow, this study also shows the need to further refine the technology to understand its limitations as an adjunct to human expertise and to apply it to more diverse care settings and patient populations . Also, we must carefully consider factors such as patient privacy, consistency of generated outputs, and the potential impact on clinical workflows .”
Beyond the AJR: Early Applications of Generative Artificial Intelligence for Radiology Report Interpretation.
Doo FX, Parekh VS.
AJR Am J Roentgenol. 2024 Aug;223(2):e2330696.
Takeaway Point
“A generative AI model for chest radiograph interpretations was comparable to radiologists and superior to teleradiology services in the ED setting, when judged by ED physicians, highlighting its potential as a supportive tool in emergency radiology.”
Beyond the AJR: Early Applications of Generative Artificial Intelligence for Radiology Report Interpretation.
Doo FX, Parekh VS.
AJR Am J Roentgenol. 2024 Aug;223(2):e2330696.
“Although there has been progress in addressing sources of algorithmic bias in health care, that progress will be negated if the rapidly evolving AI landscape does not have the safeguards in place to prevent bias in clinical algorithms—there is an opportunity to seize the heightened awareness and growing consensus around the need to pursue these protections. Successfully protecting patients from AI algorithmic bias will require shared responsibility across multiple partners, transparency from AI developers, and deliberate action from health care facilities throughout the AI life cycle. Only then will health care pioneers be positioned to successfully ensure that AI is used in a fair and equitable manner that maximizes patient benefit and advances health equity.”
Addressing AI Algorithmic Bias in Health Care.
Ratwani RM, Sutton K, Galarraga JE.
JAMA. Published online September 04, 2024. doi:10.1001/jama.2024.13486
“As artificial intelligence (AI) algorithms become an increasingly integral part of health care, ranging from diagnostic decisions and treatment plans to population health management, it is vital that rigorous processes to mitigate algorithmic bias are established. Addressing bias is not only about ensuring fair and just opportunities for optimal health outcomes but also about promoting universal safeguards for patient safety. Biased AI algorithms can result in certain patient populations not receiving appropriate care, potentially leading to significant harm. Previously, an AI algorithm developed to proactively support patients by predicting additional complex care needs yielded biased results along racial lines. The algorithm used health care costs as its target variable, underrepresenting Black patients due to systemic barriers in access to care despite their having a significant burden of illness. The algorithm may have reduced the number of Black patients identified for extra care by more than one-half.”
Addressing AI Algorithmic Bias in Health Care.
Ratwani RM, Sutton K, Galarraga JE.
JAMA. Published online September 04, 2024. doi:10.1001/jama.2024.13486
”In the future many of us will find that our professional success depends on our ability to elicit the best possible output from large language models (LLMs) like ChatGPT—and to learn and grow along with them. To excel in this new era of AI-human collaboration, most people will need one or more of what we call “fusion skills”—intelligent interrogation, judgment integration, and reciprocal apprenticing.”
Embracing Gen AI at Work
H. James Wilson and Paul R. Daugherty
Harvard Business Review (September-October 2024)
Intelligent interrogation involves prompting LLMs (or in lay terms, giving them instructions) in ways that will produce measurably better reasoning and outcomes. Put simply, it’s the skill of thinking with AI. Judgment integration is about bringing in your human discernment when a gen AI model is uncertain about what to do or lacks the necessary business or ethical context in its reasoning. The idea is to make the results of human-machine interactions more trustworthy.
Judgment integration requires sensing where, when, and how to step in, and its effectiveness is measured by the reliability, accuracy, and explainability of the AI’s output.
With reciprocal apprenticing, you help AI learn about your business tasks and needs by incorporating rich data and organizational knowledge into the prompts you give it, thereby training it to be your cocreator. It’s the skill of tailoring gen AI to your company’s specific business context so that it can achieve the outcomes you want. As you do that, you yourself learn how to train the AI to tackle more-sophisticated challenges. Once a capability that only data scientists and analytics experts building data models needed, reciprocal apprenticing has become increasingly crucial in nontechnical roles.
“The AI revolution isn’t coming; it’s already here, with leading companies using the technology to reimagine processes across industries, functions, and jobs. Gen AI has dramatically raised the bar, requiring us to think with AI, ensure that we trust it, and continually tailor it—and ourselves—to perform better. Though gen AI is part of the extended movement to create more-symbiotic relationships between humans and machines, it’s also unique in the annals of technology. No other major innovation in history has taken off so fast. Knowledge work is set to be transformed more quickly and powerfully than many of us can even imagine. Get ready. The future of business will be driven not by gen AI alone but by the people who know how to use it most effectively.”
Embracing Gen AI at Work
H. James Wilson and Paul R. Daugherty
Harvard Business Review (September-October 2024)
“In this retrospective study, we examined a cohort of 101 patients with stage II-III pancreatic cancer who underwent SBRT with sequential chemotherapy at a single institution (Stanford Health Care) between 1999- 2020. From their pre-SBRT contrast-enhanced CT images with segmented tumors, delineating regions-of-interest, we extracted 900 radiomic (quantitative pixel-level imaging characteristic) features. In the first phase, we identified radiomic features that predicted rapid tumor progression within three months following SBRT.”
Utility of radiomic features in predicting clinical outcomes in stage II-III pancreatic cancer.
Haruka Itakura,et al
American Society of Clinical Oncology Sept 2024 (abstract)
Background: We identified computed tomography (CT)-derived radiomic features predictive of tumor progression within three months, then examined their ability to prognosticate overall survival (OS) along with clinical features in pancreatic cancer. We evaluated these features in patients with unresected pancreatic cancer who underwent stereotactic body radiation therapy (SBRT) in sequence with chemotherapy, but not surgery.
Conclusions: CT-derived radiomic features predict rapid tumor progression following SBRT, confer nearly a twofold increase in mortality risk, and, along with patient age, enhance the identification of patients with stage II-IIIpancreatic cancer with poor OS.
Utility of radiomic features in predicting clinical outcomes in stage II-III pancreatic cancer.
Haruka Itakura,et al
American Society of Clinical Oncology Sept 2024 (abstract)

“Purpose: To evaluate the performance of large language models (LLMs) in automatically creating PDAC synoptic reports from original reports and to explore performance in categorizing tumor resectability.
Materials and Methods: In this institutional review board–approved retrospective study, 180 consecutive PDAC staging CT reports on patients referred to the authors’ European Society for Medical Oncology–designated cancer center from January to December 2018 were included. Reports were reviewed by two radiologists to establish the reference standard for 14 key findings and National Comprehensive Cancer Network (NCCN) resectability category. GPT-3.5 and GPT-4 (accessed September 18–29, 2023) were prompted to create synoptic reports from original reports with the same 14 features, and their performance was evaluated (recall,mprecision, F1 score). To categorize resectability, three prompting strategies (default knowledge, in-context knowledge, chain-of-thought)mwere used for both LLMs. Hepatopancreaticobiliary surgeons reviewed original and artificial intelligence (AI)–generated reports tomdetermine resectability, with accuracy and review time compared. The McNemar test, t test, Wilcoxon signed-rank test, and mixedeffects logistic regression models were used where appropriate.”
Large Language Models for Automated Synoptic Reports and Resectability Categorization in Pancreatic Cancer.
Bhayana R, Nanda B, Dehkharghanian T, et al.
Radiology. 2024 Jun;311(3):e233117. doi: 10.1148/radiol.233117. PMID: 38888478.
Results: GPT-4 outperformed GPT-3.5 in the creation of synoptic reports (F1 score: 0.997 vs 0.967, respectively). Compared with GPT-3.5, GPT-4 achieved equal or higher F1 scores for all 14 extracted features. GPT-4 had higher precision than GPT-3.5 for extracting superior mesenteric artery involvement (100% vs 88.8%, respectively). For categorizing resectability, GPT-4 outperformed GPT-3.5 for each prompting strategy. For GPT-4, chain-of-thought prompting was most accurate, outperforming in-context knowledge prompting (92% vs 83%, respectively; P = .002), which outperformed the default knowledge strategy (83% vs 67%, P < .001). Surgeons were more accurate in categorizing resectability using AI-generated reports than original reports (83% vs 76%, respectively; P = .03), while spending less time on each report (58%; 95% CI: 0.53, 0.62).
Conclusion: GPT-4 created near-perfect PDAC synoptic reports from original reports. GPT-4 with chain-of-thought achieved highaccuracy in categorizing resectability. Surgeons were more accurate and efficient using AI-generated reports.
Large Language Models for Automated Synoptic Reports and Resectability Categorization in Pancreatic Cancer.
Bhayana R, Nanda B, Dehkharghanian T, et al.
Radiology. 2024 Jun;311(3):e233117. doi: 10.1148/radiol.233117. PMID: 38888478.
Summary
“GPT-4 enabled near-perfect creation of pancreatic ductal adenocarcinoma synoptic reports from original reports and accurate resectability categorization using chain-of-thought prompting, which improved surgeon accuracy and efficiency in interpreting reports for resectability categorization.”
Large Language Models for Automated Synoptic Reports and Resectability Categorization in Pancreatic Cancer.
Bhayana R, Nanda B, Dehkharghanian T, et al.
Radiology. 2024 Jun;311(3):e233117. doi: 10.1148/radiol.233117. PMID: 38888478.
Key Results
■ GPT-4 was near-perfect in creating pancreatic ductal adenocarcinoma synoptic reports from original reports, outperforming GPT-3.5 in recall (99.6% vs 95.7%, respectively; P < .001), precision (99.7% vs 97.7%), and F1 score (0.997 vs 0.967).
■ For categorizing National Comprehensive Cancer Network resectability, GPT-4 with chain-of-thought prompting was more accurate (92%) than relying on default model knowledge (67%, P < .001) or providing in-context knowledge (83%, P = .002).
■ When interpreting reports, surgeons were more accurate in categorizing tumor resectability using artificial intelligence– generated reports than original reports (83% vs 76%, respectively; P = .03), while spending less time on each report (58%)
Large Language Models for Automated Synoptic Reports and Resectability Categorization in Pancreatic Cancer.
Bhayana R, Nanda B, Dehkharghanian T, et al.
Radiology. 2024 Jun;311(3):e233117. doi: 10.1148/radiol.233117. PMID: 38888478.
Large Language Models for Automated Synoptic Reports and Resectability Categorization in Pancreatic Cancer.
Bhayana R, Nanda B, Dehkharghanian T, et al.
Radiology. 2024 Jun;311(3):e233117. doi: 10.1148/radiol.233117. PMID: 38888478.
“For pancreatic cancer, detailed multidisciplinary discussion informs management for each patient. But for initial report review, HPB surgeons preferred AI-generated synoptic reports over original reports. This is consistent with prior studies where human-generated synoptic reports were preferred to free-text reports for pancreatic cancer staging. Importantly, the use of AI-generated reports improved the accuracy and efficiency of HPB surgeons when categorizing resectability compared with original reports. In practice, this improved communication, efficiency, and quality could contribute to improved patient management.”
Large Language Models for Automated Synoptic Reports and Resectability Categorization in Pancreatic Cancer.
Bhayana R, Nanda B, Dehkharghanian T, et al.
Radiology. 2024 Jun;311(3):e233117. doi: 10.1148/radiol.233117. PMID: 38888478.
“In conclusion, GPT-4 was near-perfect in creating pancreatic ductal adenocarcinoma synoptic reports from original reports. GPT-4 with chain-of-thought prompting achieved impressive accuracy on the complex task of categorizing tumor resectability using National Comprehensive Cancer Network criteria. For initial resectability categorization, hepatopancreaticobiliary surgeons were more accurate and efficient using artificial intelligence (AI)–generated reports compared with original reports. Surgeons also found AI-generated synoptic reports easier to extract key information from and preferred them to original reports for initial review. Our study demonstrates a useful application of large language models (LLMs) in pancreatic cancer care that can increase standardization, improve communication, and enhance efficiency and quality of report review by surgeons.”
Large Language Models for Automated Synoptic Reports and Resectability Categorization in Pancreatic Cancer.
Bhayana R, Nanda B, Dehkharghanian T, et al.
Radiology. 2024 Jun;311(3):e233117. doi: 10.1148/radiol.233117. PMID: 38888478.
“Large language models (LLMs) have shown significant promise related to their application in medical research, medical education, and clinical tasks. While acknowledging their capabilities, we face the challenge of striking a balance between defining and holding ethical boundaries and driving innovation in LLM technology for medicine. We herein propose a framework, grounded in four bioethical principles, to promote the responsible use of LLMs. This model requires the responsible application of LLMs by three parties —the patient, the clinician, and the systems that govern the LLM itself — and suggests potential approaches to mitigating the risks of LLMs in medicine. This approach allows us to use LLMs ethically, equitably, and effectively in medicine.”
Medical Ethics of Large Language Models in Medicine
Jasmine Chiat Ling Ong et al.
NEJM AI 2024;1(7)
The unique attributes and functions of LLMs provide them with distinct advantages and present unique ethical challenges. First, the unprecedented scale of their pretraining datasets, whether general-purpose models such as generative pretrained transformer-4 (GPT-4) or medical domain-specific LLMs such as Med-PaLM 2, have surpassed any deep-learning AI models. Second, LLMs represent a leap in text generation capabilities as well as language translation proficiency. LLM-based chatbots display a high degree of fluency, empathy, and personalization. Third, LLMs exhibit what are referred to as “emergent” properties such as simulating clinical reasoning and inferential skills across diverse medical disciplines, without the need for explicit training, setting them apart from other AI and deep-learning models. Finally, commercially available LLMs have increased access for AI model use. Their user-friendly interfaces and code-free environments have enabled medical professionals — and patients — to use and customize these models without the explicit need for coding.
Medical Ethics of Large Language Models in Medicine
Jasmine Chiat Ling Ong et al.
NEJM AI 2024;1(7)
“The ethics of algorithmic decision-making in medicine have been widely discussed and established in the literature. Concepts of particular relevance to LLMs include privacy, cognitive bias, accountability, and transparency. Although many of these ethical issues are not exclusive to LLMs and intersect with those observed in other AI applications within medicine, exclusive features and novel clinical applications of LLMs bring about new challenges and magnify these ethical concerns.”
Medical Ethics of Large Language Models in Medicine
Jasmine Chiat Ling Ong et al.
NEJM AI 2024;1(7)
“To harness the full potential of LLMs without compromising human rights, research and regulatory frameworks specific to the development or marketing of LLMs for medical applications are needed. Existing research frameworks and regulatory guidance could be adapted for the commercial development of generative AI applications, but these do not translate easily into the clinical space. Endeavors such as the TREGAI (Transparent Reporting of Ethics for Generative Artificial Intelligence) checklist and Victoria State Government Advisory on LLM tools for clinical use attempt to address the current gap Clear ethical guidelines serve as the foundation for robust regulations, ensuring that the use of LLMs remains both beneficial and just. Achieving a consensus on these ethical challenges is not merely an academic exercise — it is a practical necessity. Without such a consensus, regulation of LLMs becomes a complex task. Moreover, addressing these ethical dilemmas proactively prevents potential misuse or misinterpretation of LLM outputs and fosters trust among end-users. As the technology continues to evolve, ongoing dialogues among stakeholders are essential to guide the practicing medical professional in the ethical and responsible use of LLMs.”
Medical Ethics of Large Language Models in Medicine
Jasmine Chiat Ling Ong et al.
NEJM AI 2024;1(7)
“The rapid advancement of LLMs in medicine offers exciting prospects for improved health care delivery, from enhanced operational efficiency to decision support. However, implementation is complicated by ethical challenges such as data confidentiality, algorithmic bias, and clinical accountability. We present the case for applying a bioethical framework for responsible development and clinical use of LLMs. We propose shared responsibility by medical professionals, patients, and the systems governing LLMs. However, we recognize that this framework does not completely compensate for the lack of established approaches to verify the value propositions and risks for LLM-based applications in medicine.”
Medical Ethics of Large Language Models in Medicine
Jasmine Chiat Ling Ong et al.
NEJM AI 2024;1(7)
“The concept of “multimodal” chat experiences, such as GPT-4V, could unlock an interactive chat interface for engaging with the report and image, catering to radiologists, downstream providers, and patients. Tasks ranging from rewriting the report to include specific scores, to explaining the report in lay terms, to advising on preventive measures can be handled through this type of approach. This technology conceptually moves us closer to a “general medical imaging AI,” in which highly scaled medical-specific multimodal models can offer powerful insights into real-world health care data, potentially exhibiting emergent behaviors specific to health care.”
More Is Different: Large Language Models in Health Care
Matthew P. Lungren, MD, MPH, Elliot K. Fishman, MD, Linda C. Chu, MD, Ryan C. Rizk, BS, Steven P. Rowe, MD, PhD
J Am Coll Radiol. 2024 Jul;21(7):1151-1154.
“The bottom line is that this technology is going to change the world, including every aspect of health care, and the capabilities that have been arising and the rate at which people are adopting these new technologies are astounding. There are going to be societal consequences, and we can’t pretend to know what those will be. Change is sometimes slower than we expect it to be, but the technology is massively improving. Everyone should be encouraged to get their hands dirty with this technology, learn about it, and use it as you see fit. It will definitely change a lot of what we do.”
More Is Different: Large Language Models in Health Care
Matthew P. Lungren, MD, MPH, Elliot K. Fishman, MD, Linda C. Chu, MD, Ryan C. Rizk, BS, Steven P. Rowe, MD, PhD
J Am Coll Radiol. 2024 Jul;21(7):1151-1154.
“Next is “augmented interaction”; in other words, the model doesn’t merely rely on prompts; instead, it leverages “retrieval augmented generation,” which allows interaction with embeddings of specific content (eg, documents, patient records, etc.) by taking our input words and mathematically finding the closest content to the question I’m asking in the specific content embedding space. This workflow is often referred to as “grounding” the model. In health care, this approach is used to interact with the model to query specific documents, large amounts of patient data that do not fit into the context window, or new sets of guidelines that the model does not otherwise have knowledge of or access to. This level of interaction has led to powerful applications in clinical trials matching, content retrieval, and medical document quality assurance.”
More Is Different: Large Language Models in Health Care
Matthew P. Lungren, MD, MPH, Elliot K. Fishman, MD, Linda C. Chu, MD, Ryan C. Rizk, BS, Steven P. Rowe, MD, PhD
J Am Coll Radiol. 2024 Jul;21(7):1151-1154.
“This leads us to the “model ecosystem,” which can harness a network of specialized models or agents to simulate multidisciplinary teams, like a tumor board discussing a patient’s case. For example, I can have one GPT-4 agent as an oncologist, one as a surgeon, and another as a primary care doctor, and then I can input a patient record and tell the model to discuss the case, round it to best practices, and come up with a treatment plan. Furthermore, these models can use plugins pertaining to a specific context. For example, if I need the model to calculate the model for endstage liver disease (MELD) score or find how many pulmonary nodules are on a CT scan, the model ecosystem would use a specific software tool or a narrow AI model designed to do one specific task, retrieve that information, and return it in the context of the conversation. As you can see, there are a lot of capabilities to unlock in LLMs, which have the potential to be helpful collaborators in complex decision making processes.”
More Is Different: Large Language Models in Health Care
Matthew P. Lungren, MD, MPH, Elliot K. Fishman, MD, Linda C. Chu, MD, Ryan C. Rizk, BS, Steven P. Rowe, MD, PhD
J Am Coll Radiol. 2024 Jul;21(7):1151-1154.
“Examples nof real-world applications include the Epic inbox message response drafting, clinical note summarization, and new platforms, such as the Microsoft Dragon Ambient Experience (DAX) Co-Pilot solution and Abridge’s AI solution, which leverage a synergistic blend of ambient voice-to-text technology and generative AI models like GPT-4 to autonomously and securely draft clinical summaries from inperson examinations or telehealth patient dialogues within seconds, ready for immediate review and finalization in the electronic health record.”
More Is Different: Large Language Models in Health Care
Matthew P. Lungren, MD, MPH, Elliot K. Fishman, MD, Linda C. Chu, MD, Ryan C. Rizk, BS, Steven P. Rowe, MD, PhD
J Am Coll Radiol. 2024 Jul;21(7):1151-1154.
“The full training dataset for ChatGPT has not been disclosed, although it does include PubMed Central and a significant portion of the internet. ChatGPT is trained only on data generated until January 2021, although it can use search engine results to incorporate more recent data into its responses. It is restricted to freely available data; medical knowledge behind paywalls does not inform its training.”
BMJ Evidence-Based Medicine
Elwyn G, Ryan P, Blumkin D, Weeks WB
Epub ahead of print:2024 doi:10.1136/bmjebm-2023-112651
“To illustrate how LLMs can be used to support shared decision-making, we show how a doctor, with a basic understanding of how to use a few simple prompts to interrogate the application, can obtain summaries of scientific research at their fingertips in a few seconds using ChatGPT. Summaries can be shared with patients as a way to initiate better conversations about treatments. LLMs can also cite specific sources, including scientific literature, bolstering their transparency.”
BMJ Evidence-Based Medicine
Elwyn G, Ryan P, Blumkin D, Weeks WB
Epub ahead of print:2024 doi:10.1136/bmjebm-2023-112651
“This AI genie is out of the bottle, and clinicians in practice will inevitably meet the results of patients using LLMs in addition to the output of search engines. ChatGPT and Gemini are part of a growing body of new generative AI tools.Our simulation of how ChatGPT could be used to support clinical practice illustrates the potential application of an existing generative AI system. Some clinicians, realising the capabilities, are already adopting LLMs, and, as we illustrate, within seconds, can provide their patients with clear information about treatment options, harms and benefits. Patients can obtain the same information using the same tools.As our case study shows, LLMs could facilitate this task and facilitate shared decision-making.”
BMJ Evidence-Based Medicine
Elwyn G, Ryan P, Blumkin D, Weeks WB
Epub ahead of print:2024 doi:10.1136/bmjebm-2023-112651

“Generative artificial intelligence models exhibit amazing abilities but make serious errors. We have a very limited understanding of why they work well at all or of the circumstances under which they give incorrect responses. This suggests the need for additional research and great caution in deploying such models for critical applications. Since the availability of ChatGPT in late 2022, based on OpenAI’s GPT 3.5 large language model, those of us who have explored its capabilities have been amazed by its facility with language and its abilities to generate coherent — and even insightful — synopses; answer questions about everything from general knowledge to domain-specific topics; offer advice on how to accomplish tasks, including for medical diagnosis, therapy, and prognosis; deduce consequences of assumptions; and even write effective computer programs. Nevertheless, I would urge great caution in adopting such methods in health care, mainly because of our lack of understanding of how they accomplish the miraculous-seeming things they are able to do.”
Large Language Models Seem Miraculous, but Science Abhors Miracles
Peter Szolovits
NEJM AI 2024; 1 (6)
“Nevertheless, one can speculate about what might be happening inside LLMs. Data compression research shows that the most effective way to compress data and allow it to be regenerated (e.g., ZIP files, JPEG images, etc.) is to identify common patterns in the data. LLMs, using huge neural networks, can identify such patterns much better than earlier prespecified methods. So, perhaps LLMs may be learning multiple functions that can approximately reproduce their inputs, and these functions correspond to different facilities that the LLMs exhibit, such as relevant fact retrieval, inference, grammatical knowledge, writing style, etc. However, we do not know how to identify these facilities within the myriad trained weights in the network. To my knowledge, nobody has yet presented a coherent and insightful explanation for how the diverse capabilities of LLMs emerge.”
Large Language Models Seem Miraculous, but Science Abhors Miracles
Peter Szolovits
NEJM AI 2024; 1 (6)
“Research on the interpretation of how large artificial neural networks operate may lead to insights about LLMs as well.5 Concern about the easy success of adversarial attacks on machine-learning systems is also leading to a new understanding about how robustness against such attacks likely leads to more correct models that are robust to internal errors as well. Work on explaining why machine-learning models come to a particular conclusion is challenging because that conclusion is likely to be the result of a vast number of numerical operations that would not be innately meaningful to a user. However, it is often possible to build simpler, easier-to-explain models that approximate the outputs of the complex model, at least in the neighborhood of the question posed. Perhaps such methods can provide good-enough explanations of complex outputs to be useful.”
Large Language Models Seem Miraculous, but Science Abhors Miracles
Peter Szolovits
NEJM AI 2024; 1 (6)
“There is considerable policy discussion in government about the need for regulation of AI systems revolving around the issues discussed here. My personal view is that using generative AI is sensible as long as it is done under careful human supervision. Good uses include cleaningup or suggesting appropriate language, summarizing long texts, or using it in tasks such as cohort selection, where strict accuracy is desirable but errors are tolerable. I would be much more cautious in using such systems where critical decisions that affect specific individuals’ diagnosis and treatment are at stake and where errors can be very harmful. Moreover, I would certainly avoid situations in which a system is allowed to make consequential autonomous decisions. Because of our lack of understanding of how such systems work, caution is required, and we need to make much greater investments in understanding the “miracle” of generative AI and not just its exploitation.”
Large Language Models Seem Miraculous, but Science Abhors Miracles
Peter Szolovits
NEJM AI 2024; 1 (6)
The integration of artificial intelligence (AI) in medical image interpretation requires effective collaboration between clinicians and AI algorithms. Although previous studies demonstrated the potential of AI assistance in improving overall clinician performance, the individual impact on clinicians remains unclear. This large-scale study examined the heterogeneous effects of AI assistance on 140 radiologists across 15 chest X-ray diagnostic tasks and identified predictors of these effects. Surprisingly, conventional experience-based factors, such as years of experience, subspecialty and familiarity with AI tools, fail to reliably predict the impact of AI assistance. Additionally, lower-performing radiologists do not consistently benefit more from AI assistance, challenging prevailing assumptions. Instead, we found that the occurrence of AI errors strongly influences treatment outcomes, with inaccurate AI predictions adversely affecting radiologist performance on the aggregate of all pathologies and on half of the individual pathologies investigated. Our findings highlight the importance of personalized approaches to clinician–AI collaboration and the importance of accurate AI models. By understanding the factors that shape the effectiveness of AI assistance, this study provides valuable insights for targeted implementation of AI, enabling maximum benefits for individual clinicians in clinical practice.
Heterogeneity and predictors of the effects of AI assistance on radiologists
Feiyang Yu et al.
Nature Medicine | Volume 30 | March 2024 | 837–849 837
“The results revealed that the regression coefficient on unassisted error was significantly different from zero when considering all pathologies aggregated (P < 0.001; Fig. 3a and Supplementary Table 9). Similarly, the regression coefficient was significant on most individual pathologies (Benjamini–Hochberg-adjusted P < 0.05), except for atelectasis, pneumothorax and shoulder fracture (Benjamini– Hochberg-adjusted P > 0.05; Fig. 3b). We similarly constructed a linear regression model regressing from unassisted AUROC and an intercept term to assisted AUROC. The regression coefficient on unassisted AUROC was again significant on all pathologies aggregated (P < 0.001; Fig. 3c and Supplementary Table 33), whereas the coefficient was insignificant on abnormal (P > 0.05). Together, these findings indicate that unassisted error serves as a strong predictor of assisted error in most cases.”
Heterogeneity and predictors of the effects of AI assistance on radiologists
Feiyang Yu et al.
Nature Medicine | Volume 30 | March 2024 | 837–849 837
“We found that experience-based radiologist characteristics, including years of experience, subspecialty in thoracic radiology and experience with AI tools, did not serve as reliable predictors of treatment effect, in terms of both calibration performance and discrimination performance. These findings challenge the associations between experience-based radiologist characteristics and the treatment effect of AI assistance reported in previous research24–28. The observed variability could be attributed to our larger and more diverse sample size, encompassing 140 radiologists with varying skill levels, experiences and preferences. Additionally, our study’s inclusion of a wide range of diagnostic tasks enables a robust examination of the complex factors influencing the treatment effect. Furthermore, the performance characteristics and quality of the specific AI system may play an important role, highlighting the need for developers to consider these factors when deploying AI assistance. To optimize the implementation of AI assistance, a comprehensive assessment of multiple factors, including the clinical task, patient population and AI system, is essential.”
Heterogeneity and predictors of the effects of AI assistance on radiologists
Feiyang Yu et al.
Nature Medicine | Volume 30 | March 2024 | 837–849 837

Improving Efficiencies While Also Delivering Better Health Care Outcomes: A Role for Large Language Models.
Rao SK, Fishman EK, Rizk RC, Chu LC, Rowe SP.
J Am Coll Radiol. 2024 Jan 12:S1546-1440(24)00005-X. doi: 10.1016/j.jacr.2024.01.003. Epub ahead of print. PMID: 38220038.
“Generative artificial intelligence (AI), specifically the large language models (LLMs) that underlie impressive new applications such as ChatGPT, are already fundamentally changing medicine. Unlike more traditional AI systems that produce simple outputs such as a number (say, the predicted length of stay for a patient in the hospital) or a category (say, “malignant” or “benign” for a radiologic system), “generative AI” refers broadly to systems whose outputs take the form of more unstructured media objects, such as images and documents. Under the hood, many of these systems are actually built by executing models that serve a more classical purpose. Generative text models, for example, generate whole documents by iteratively predicting “what word comes next.” But the ability to produce a whole document with desired properties unlocks a host of exciting applications.”
Improving Efficiencies While Also Delivering Better Health Care Outcomes: A Role for Large Language Models.
Rao SK, Fishman EK, Rizk RC, Chu LC, Rowe SP.
J Am Coll Radiol. 2024 Jan 12:S1546-1440(24)00005-X. doi: 10.1016/j.jacr.2024.01.003. Epub ahead of print. PMID: 38220038
“We build and leverage generative AI to unburden clinicians from clerical work to help them focus more fully on their patients. Briefly, Abridge AI helps clinicians capture conversations, structure and summarize the information exchanged, and integrate that information back into the medical record. As clinicians, we are tasked to serve three constituents for patients we see. Those constituents are (1) our care team members, who benefit from clinical notes that convey our thought process; (2) ourselves, as the physicians who also need to place orders, report diagnostic codes, and handle procedure codes for billing and revenue cycle; and (3) our patients, the most important constituents, who benefit from visit summaries and access to their OpenNotes in their portals.”
Improving Efficiencies While Also Delivering Better Health Care Outcomes: A Role for Large Language Models.
Rao SK, Fishman EK, Rizk RC, Chu LC, Rowe SP.
J Am Coll Radiol. 2024 Jan 12:S1546-1440(24)00005-X. doi: 10.1016/j.jacr.2024.01.003. Epub ahead of print. PMID: 38220038
“The arrival of LLMs that can provide real-time assistance to physicians may allow a remarkable increase in their bandwidth, regardless of specialty. In radiology, leveraging these emerging technologies will potentially allow more scans to be read without adding burden or stress to the interpreting radiologist. We would be in the “high consequences for factual inaccuracies and high volume of decisions” quadrant of Figure 1, where the assistance of an LLM would be its key feature. The importance of that added bandwidth would be its potential to ameliorate disparities by democratizing the expertise of radiologists who are already able to handle large volumes or who may have special skill sets in less common examinations. Such improvements in access to care would hopefully have downstream effects of improved outcomes in marginalized populations.”
Improving Efficiencies While Also Delivering Better Health Care Outcomes: A Role for Large Language Models.
Rao SK, Fishman EK, Rizk RC, Chu LC, Rowe SP.
J Am Coll Radiol. 2024 Jan 12:S1546-1440(24)00005-X. doi: 10.1016/j.jacr.2024.01.003. Epub ahead of print. PMID: 38220038

“Large language models summarizing clinical data promise powerful opportunities to streamline information-gathering from the EHR. But by dealing in language, they also bring unique risks that are not clearly covered by existing FDA regulatory safeguards. As summarization tools speed closer to clinical practice, transparent development of standards for LLM-generated clinical summaries, paired with pragmatic clinical studies, will be critical to the safe and prudent rollout of these technologies. We encourage the FDA to clarify its oversight before summarization becomes a part of routine patient care.”
AI-Generated Clinical Summaries Require More Than Accuracy.
Goodman KE, Yi PH, Morgan DJ.
JAMA. 2024 Jan 29. doi: 10.1001/jama.2024.0555. Epub ahead of print. PMID: 38285439.
First, we need comprehensive standards for LLM-generated summaries, with domains that extend beyond accuracy and that include stress-testing for sycophancy and small but clinically important errors. These standards should reflect scientific and clinical consensus, with input beyond the few large technology companies developing health care LLMs. Second, LLMs performing clinical summarization are ultimately clinical aids. Regardless of current FDA regulation, we believe that they should be clinically tested to quantify clinical harms and benefits before widespread deployment.This testing carries minimal risk and could be performed as quality improvement in a learning health system.
AI-Generated Clinical Summaries Require More Than Accuracy.
Goodman KE, Yi PH, Morgan DJ.
JAMA. 2024 Jan 29. doi: 10.1001/jama.2024.0555. Epub ahead of print. PMID: 38285439.
“Yet there is also the potential for patient harm because LLMs performing summarization are unlikely to fall under FDA medical device over sight and could reach clinics without safety and efficacy safeguards. Indeed,FDA final guidance for clinical decision support software— published 2 months before ChatGPT’s release— provides an unintentional “roadmap” for how LLMs could avoid FDA regulation. Even LLMs performing sophisticated summarization tasks would not clearly qualify as devices because they provide general language-based outputs rather than specific predictions or numeric estimates of disease. With careful implementation, we expect that many LLMs summarizing clinical data could meet device-exemption criteria.”
AI-Generated Clinical Summaries Require More Than Accuracy.
Goodman KE, Yi PH, Morgan DJ.
JAMA. 2024 Jan 29. doi: 10.1001/jama.2024.0555. Epub ahead of print. PMID: 38285439.

“Informed consent is integral to the practice of medicine. Most informed consent documents are written at a reading level that surpasses the reading comprehension level of the average American. Large language models, a type of artificial intelligence (AI) with the ability to summarize and revise content, present a novel opportunity to make the language used in consent forms more accessible to the average American and thus, improve the quality of informed consent. In this study, we present the experience of the largest health care system in the state of Rhode Island in implementing AI to improve the read ability of informed consent documents, highlighting one tangible application for emerging AI in the clinical setting.”
Using ChatGPT to Facilitate Truly Informed Medical Consent
Fatima N. Mirza et al.
NEJM AI 2024; 1 (2)
“Several learning points can be gleaned from our experience using AI to improve informed consent at the largest health care system in Rhode Island. First, the process of text simplification is a task that plays to the natural “styletransfer” strengths of LLMs, which are trained on a large text corpus of heterogeneous writing styles and reading levels. For a task that may take a proofreader hours or days, GPT-4 simplified a 947-word document in less than 1 minute.”
Using ChatGPT to Facilitate Truly Informed Medical Consent
Fatima N. Mirza et al.
NEJM AI 2024; 1 (2)
“Our experience highlights a tangible way that AI can be deployed in the clinical setting to improve patient experiences in the here and now. Whereas the introduction of any new clinical technology presents the potential hazard of amplifying existing disparities in patient care, the reverse can also hold true. In the mission of creating a truly informed consent process, AI has the potential to broaden and strengthen our communication to reach countless vulnerable patients stranded at the periphery in a status quo mired by excessively arcane and verbose medical forms.”
Using ChatGPT to Facilitate Truly Informed Medical Consent
Fatima N. Mirza et al.
NEJM AI 2024; 1 (2)
“Generative artificial intelligence (AI), specifically the large language models (LLMs) that underlie impressive new applications such as ChatGPT, are already fundamentally changing medicine. Unlike more traditional AI systems that produce simple outputs such as a number (say, the predicted length of stay for a patient in the hospital) or a category (say, “malignant” or “benign” for a radiologic system), “generative AI” refers broadly to systems whose outputs take the form of more unstructured media objects, such as images and documents. Under the hood, many of these systems are actually built by executing models that serve a more classical purpose. Generative text models, for example, generate whole documents by iteratively predicting “what word comes next.” But the ability to produce a whole document with desired properties unlocks a host of exciting applications.”
Improving Efficiencies While Also Delivering Better Health Care Outcomes: A Role for Large Language Models.
Rao SK, Fishman EK, Rizk RC, Chu LC, Rowe SP.
J Am Coll Radiol. 2024 Jan 12:S1546-1440(24)00005-X. doi: 10.1016/j.jacr.2024.01.003. Epub ahead of print
“More interesting now, after the heights of the pandemic, we are beginning to see a market need for technology-driven efficiencies to help health care systems deal with historic losses over the past couple of years. For example, staffing shortages are top cost drivers and a concern for hospital executives. Strategies that may have worked in the past, raising prices via market mergers and consolidation or asking clinicians to see more patients and boost volumes, are unlikely to work this time given market dynamics. Instead, the new name of the game for health systems is productivity increase but with a twist. They need to improve care delivery experiences and outcomes while also improving efficiencies. Most important, they need to meet the health care demand without increasing the exodus of frontline health care workers.”
Improving Efficiencies While Also Delivering Better Health Care Outcomes: A Role for Large Language Models.
Rao SK, Fishman EK, Rizk RC, Chu LC, Rowe SP.
J Am Coll Radiol. 2024 Jan 12:S1546-1440(24)00005-X. doi: 10.1016/j.jacr.2024.01.003. Epub ahead of print
“As clinicians, we are tasked to serve three constituents for patients we see. Those constituents are (1) our care team members, who benefit from clinical notes that convey our thought process; (2) ourselves, as the physicians who also need to place orders, report diagnostic codes, and handle procedure codes for billing and revenue cycle; and (3) our patients, the most important constituents, who benefit from visit summaries and access to their OpenNotes in their portals.”
Improving Efficiencies While Also Delivering Better Health Care Outcomes: A Role for Large Language Models.
Rao SK, Fishman EK, Rizk RC, Chu LC, Rowe SP.
J Am Coll Radiol. 2024 Jan 12:S1546-1440(24)00005-X. doi: 10.1016/j.jacr.2024.01.003. Epub ahead of print
Improving Efficiencies While Also Delivering Better Health Care Outcomes: A Role for Large Language Models.
Rao SK, Fishman EK, Rizk RC, Chu LC, Rowe SP.
J Am Coll Radiol. 2024 Jan 12:S1546-1440(24)00005-X. doi: 10.1016/j.jacr.2024.01.003. Epub ahead of print
“There is broad recognition that the burnout-inducing amount of paperwork associated with modern medical practice needs to be addressed to limit the early departure of clinicians from the field. Amrecent report in the Journal of General Internal Medicine suggested that doctorsmneed 27 hours a day to complete all their work . An AMA study from 2021msuggested that 63% of physicians surveyed reported burnout . Expenses of $4.6 billion annually were related to physician turnover and reduced clinical hours in 2019. All of these statistics have only worsened on the other side of the (peak) pandemic. Our priority is to assist and integrate for the depth of the workflow in an enterprise way, spanning the work that happens before, during, and after a patient encounter from notes to orders and coding.”
Improving Efficiencies While Also Delivering Better Health Care Outcomes: A Role for Large Language Models.
Rao SK, Fishman EK, Rizk RC, Chu LC, Rowe SP.
J Am Coll Radiol. 2024 Jan 12:S1546-1440(24)00005-X. doi: 10.1016/j.jacr.2024.01.003. Epub ahead of print
“The arrival of LLMs that can provide real-time assistance to physicians may allow a remarkable increase in their bandwidth, regardless of specialty. In radiology, leveraging these emerging technologies will potentially allow more scans to be read without adding burden or stress to the interpreting radiologist. We would be in the “high consequences for factual inaccuracies and high volume of decisions” quadrant of Figure 1, where the assistance of an LLM would be its key feature. The importance of that added bandwidth would be its potential to ameliorate disparities by democratizing the expertise of radiologists who are already able to handle large volumes or who may have special skill sets in less common examinations. Such improvements in access to care would hopefully havedownstream effects of improved outcomes in marginalized populations.”
Improving Efficiencies While Also Delivering Better Health Care Outcomes: A Role for Large Language Models.
Rao SK, Fishman EK, Rizk RC, Chu LC, Rowe SP.
J Am Coll Radiol. 2024 Jan 12:S1546-1440(24)00005-X. doi: 10.1016/j.jacr.2024.01.003. Epub ahead of print
“Perhaps the most important lesson for radiologists is that we need to have a seat at the table as LLMs are adopted more broadly as assistants and augmenters. We can help drive the maximum value for ourselves and, most important, our patients from those emerging technologies.”
Improving Efficiencies While Also Delivering Better Health Care Outcomes: A Role for Large Language Models.
Rao SK, Fishman EK, Rizk RC, Chu LC, Rowe SP.
J Am Coll Radiol. 2024 Jan 12:S1546-1440(24)00005-X. doi: 10.1016/j.jacr.2024.01.003. Epub ahead of print

Application: Prior health AI applications often centered around diagnostics, but thereare many low-risk yet high-value scenarios across the entire health system that are ripe for impact
Evaluation: Real-world use cases are often under-represented in existing health AI benchmarks; scaling realistic benchmark creation and evaluation is of increasing urgency
Modeling: Unlike standard contrastive learning, multimodal generative AI can benefit from gravitating in text as the “interlingua” of all modalities, given the vast amount of human knowledge captured in state-of-the-art large language models.
Multimodal Generative AI for Precision Health
Hoifung Poon
NEJM AI (in press)

We assessed the performance of the newly released AI GPT-4 in diagnosing complex medical case challenges and compared the success rate to that of medical-journal readers. GPT-4 correctly diagnosed 57% of cases, outperforming 99.98% of simulated human readers generated from online answers. We highlight the potential for AI to be a powerful supportive tool for diagnosis; however, further improvements, validation, and addressing of ethical considerations are needed before clinical implementation.
Use of GPT-4 to Diagnose Complex Clinical Cases
Alexander V. Eriksen , M.D.,1,2 Soren Moller , M.Sc., Ph.D.,3,4 and Jesper Ryg , M.D., Ph.D.1,2
NEJM AI 2023; 1 (1)
“An important study limitation is the use of a poorly characterized population of human journal readers with unknown levels of medical skills. Moreover, we cannot assess whether the responses provided for the clinical cases reflect their maximum effort. Consequently, our results may represent a best-case scenario in favor of GPT-4. The assumption of independent answers on the 38 cases in our pseudo population is somewhat unrealistic, because some readers might consistently perform differently from others and the frequency at which participants respond correctly to the cases might depend on the level of medical skills as well as the distribution of these. However, even in the extreme case of maximally correlated correct answers among the medical-journal readers, GPT- 4 would still perform better than 72% of human readers.”
Use of GPT-4 to Diagnose Complex Clinical Cases
Alexander V. Eriksen , M.D.,1,2 Soren Moller , M.Sc., Ph.D.,3,4 and Jesper Ryg , M.D., Ph.D.1,2
NEJM AI 2023; 1 (1)
“Additionally, whereas GPT-4 in our study worked only on written records, future AI tools that are more specialized are expected to include other data sources, including medical imaging and structured numerical measurements, in their predictions. Importantly, future models should include training data from developing countries to ensure a broad, global benefit of this technology and reduce the potential for health care disparities. AI based on LLMs might be relevant not only for in-patient hospital settings but also for first-line screening that is performed either in general practice or by patients themselves. As we move toward this future, the ethical implications surrounding the lack of transparency by commercial models such as GPT-4 also need to be addressed,1 as well as regulatory issues on data protection and privacy. Finally, clinical studies evaluating accuracy, safety, and validity should precede future implementation.”
Use of GPT-4 to Diagnose Complex Clinical Cases
Alexander V. Eriksen , M.D.,1,2 Soren Moller , M.Sc., Ph.D.,3,4 and Jesper Ryg , M.D., Ph.D.1,2
NEJM AI 2023; 1 (1)

“Overall, GPT-4 has potential clinical use in older patients without a definitive clinical diagnosis after 1 month but requires comprehensive entry of demographic and clinical (including radiological and pharmacological) information. GPT-4 may increase confidence in diagnosis and earlier commencement of appropriate treatment, alert clinicians missing important diagnoses, and offer suggestions similar to specialists to achieve the correct clinical diagnosis, which has potential value in low-income countries with lack of specialist care. Clinicians need to be aware that GPT-4 is limited in multifocal infection, and the suggested management plan should be correlated with clinical context, as suggestions may be redundant. Clinicians should consider a drug review and review the possible diagnosis of malignant disease if suggested.”
Use of GPT-4 to Analyze Medical Records of Patients With Extensive Investigations and Delayed Diagnosis
Yat-Fung Shea et al.
JAMA Network Open. 2023;6(8):e2325000
“This study has several limitations. First, GPT-4 may not detect 2 focuses of infection or pinpoint the source of recurrent infection. Second, GPT-4 did not suggest the use of gallium scan or 18-fluorodeoxyglucose positron emission tomography to look for infections or malignant neoplasms in all but 1 patient. Third, some investigations may not be appropriate (eg, temporal artery biopsy in the absence of typical symptoms of giant cell arteritis). Overall, our findings suggest that the use of AI in diagnosis is both promising and challenging.”
Use of GPT-4 to Analyze Medical Records of Patients With Extensive Investigations and Delayed Diagnosis
Yat-Fung Shea et al.
JAMA Network Open. 2023;6(8):e2325000

“Radiologists have experienced rapid and far-reaching technological changes in their practices. Whether it is computerized tomography, where the conversation focused on number of detectors (64, 128, 256, 512), or transformation of numbers of X-ray tubes in a scanner (1 or 2) into a whole new technology with photon scanning, or magnetic resonance where the arguments have progressed from 1.5 to 3 Tesla scanners to 7 Tesla units, the technology is transformed quickly. Radiologists and their patients have benefited from change and celebrate it. But those changes took years, and the end-users of those technological changes generally watched its development from afar. Even the integration of Artificial Intelligence (AI) into clinical practice has developed gradually, from the theoretical use of AI to help with image interpretation or workflow management into frequent, integrated use in both private practice and academic practice has become standard clinical practice, whether for pulmonary embolism detection, analysis of mammograms, or triage of an emergency room reading list.”
Watching Innovation in Real Time: The Story of ChatGPT and Radiology.
Fishman EK, Weeks WB, Lavista Ferres JM, Chu LC.
Can Assoc Radiol J. 2023 May 3:8465371231174817. doi: 10.1177/08465371231174817. Epub ahead of print. PMID: 37138372.
“Some of the explanation for these improvements in performance may be explained by the fact that, to date, ChatGPT has been trained only on publicly available information and not specifically on medicine. As versions of large language models are trained on medical books, articles, journals, and guidelines, the power of large language models in medicine will become even stronger. That ChatGPT can pass medical licensing exams is especially impressive when one realizes that the systems were not specifically trained for these tasks and that the available information was often limited as, for example, they could not look at material behind a paywall. We are beginning to realize that if properly guided, the power of this new technology will redefine medicine.3,4 The key then will be to use this technology to strengthen the doctor-patient relationship alive rather than replacing it. Clinicians need to guide the technology into practice.”
Watching Innovation in Real Time: The Story of ChatGPT and Radiology.
Fishman EK, Weeks WB, Lavista Ferres JM, Chu LC.
Can Assoc Radiol J. 2023 May 3:8465371231174817. doi: 10.1177/08465371231174817. Epub ahead of print. PMID: 37138372.
“The speed of change in the world of technology has never been as rapid as it is today. For clinicians - as well as clinical researchers - this can translate to both the best of times and the worst of times. While clinicians and clinical researchers become excited by the development of a new computer chip, CT scanner, or vaccine, they know that time will pass before the new technology impacts our daily workflow. That is, until the introduction of large language models like ChatGPT, an offering that allows us to use the technology as it is being developed. This technology – which has been in development for decades - has the chance to impact everything we do at work or at home. And advancements in this technology, fostered by incredible computing power and amounts of data available for learning, occur so quickly that before you can write and publish an article, your descriptions may be one to two versions behind. The story ChatGPT is not one only of imagining what could be but also seeing the change in near real time.”
Watching Innovation in Real Time: The Story of ChatGPT and Radiology.
Fishman EK, Weeks WB, Lavista Ferres JM, Chu LC.
Can Assoc Radiol J. 2023 May 3:8465371231174817. doi: 10.1177/08465371231174817. Epub ahead of print. PMID: 37138372.
“In a recent editorial, Microsoft's president Brad Smith highlights that, like any other technology in history, AI will be used both as a tool and a weapon. In the fields of medicine and research, large language models can be intentionally misused to create scientific-sounding papers containing false information, which can easily spread on the internet. As some individuals are advocating for a shift away from traditional peer-review processes in medical journals, we must emphasize that the human peer-review process is more vital and critical than ever. Further, in what may be an arms race, reviewers and editors must make use of tools such as ChatGPT to uncover unethical activities on the part of authors.”
Beyond chatting: The opportunities and challenges of ChatGPT in medicine and radiology
Juan M. Lavista Ferres, William B. Weeks, Linda C. Chu, Steven P. Rowe, Elliot K. Fishman
Diagnostic and Interventional Imaging, Volume 104, Issue 6,2023, Pages 263-264,
“AI will undoubtedly have a profound impact across all aspects of civilization. While we should embrace its benefits, we must also be aware of its potential drawbacks and limitations. With every step forward, there are bound to be missteps, and the road ahead will likely be full of surprises. Technology will continue to challenge, motivate, and occasionally even frighten us, but ultimately it has the potential to enhance our lives in countless ways.”
Beyond chatting: The opportunities and challenges of ChatGPT in medicine and radiology
Juan M. Lavista Ferres, William B. Weeks, Linda C. Chu, Steven P. Rowe, Elliot K. Fishman
Diagnostic and Interventional Imaging, Volume 104, Issue 6,2023, Pages 263-264,
“Looking toward the future, AI systems will undoubtedly continue to advance and evolve [3]. As with any tool, it is essential to understand where it can contribute and in what scenarios it will not. When Tim Berners-Lee created the World Wide Web (WWW) in the 1990s, people, for the first time, had access to information that was not previously available. As we have come to learn, the web also is full of inaccuracies and misinformation. As a society, we are still working to help humans deal with this, but this was not a reason to throw out the WWW.”
Beyond chatting: The opportunities and challenges of ChatGPT in medicine and radiology
Juan M. Lavista Ferres, William B. Weeks, Linda C. Chu, Steven P. Rowe, Elliot K. Fishman
Diagnostic and Interventional Imaging, Volume 104, Issue 6,2023, Pages 263-264,

IMPORTANCE Consumers are increasingly using artificial intelligence (AI) chatbots as a source of information. However, the quality of the cancer information generated by these chatbots has not yet been evaluated using validated instruments.
OBJECTIVE To characterize the quality of information and presence of misinformation about skin, lung, breast, colorectal, and prostate cancers generated by 4 AI chatbots.
CONCLUSIONS AND RELEVANCE Findings of this cross-sectional study suggest that AI chatbots generally produce accurate information for the top cancer-related search queries, but the responses are not readily actionable and are written at a college reading level. These limitations suggest that AI chatbots should be used supplementarily and not as a primary source for medical information.
Assessment of Artificial Intelligence Chatbot Responses to Top Searched Queries About Cancer
Alexander Pan et al.
JAMA Oncol. doi:10.1001/jamaoncol.2023.2947
“The primary outcomeswere the quality of consumer health information based on the validated DISCERN instrument (scores from 1 [low] to 5 [high] for quality of information) and the understandability and actionability of this information based on the understandability and actionability domains of the Patient Education Materials Assessment Tool (PEMAT) (scores of 0%-100%, with higher scores indicating a higher level of understandability and actionability). Secondary outcomes included misinformation scored using a 5-item Likert scale (scores from 1 [no misinformation] to 5 [high misinformation]) and readability assessed using the Flesch-Kincaid Grade Level readability score.”
Assessment of Artificial Intelligence Chatbot Responses to Top Searched Queries About Cancer
Alexander Pan et al.
JAMA Oncol. doi:10.1001/jamaoncol.2023.2947
Key Points
Question What is the quality of cancer-related health information outputted by artificial intelligence (AI) chatbots?
Findings In this cross-sectional study, the responses of 4 AI chatbots to the top search queries related to the 5 most prevalent cancers were high quality but were written at a college reading level and had poor actionability.
Meaning Findings of this study suggest that AI chatbots are an accurate and reliable supplementary resource for medical information but are limited in their readability and should not replace health care professionals for individualized health care questions.
Assessment of Artificial Intelligence Chatbot Responses to Top Searched Queries About Cancer
Alexander Pan et al.
JAMA Oncol. doi:10.1001/jamaoncol.2023.2947
”Artificial intelligence chatbots are becoming a major source of medical information for consumers. Findings of this crosssectional study suggest that they generally produce reliable and accurate medical information about lung, breast, colorectal, skin, and prostate cancers. However, the usefulness of the information is limited by its poor readability and lack of visual aids. These limitations suggest that AI chatbots should be used supplementarily and not as a primary source for medical information. To this end, AI chatbots typically encourage users to seek medical attention relating to cancer symptoms and treatment.”
Assessment of Artificial Intelligence Chatbot Responses to Top Searched Queries About Cancer
Alexander Pan et al.
JAMA Oncol. doi:10.1001/jamaoncol.2023.2947
Large language models (LLMs) can respond to free-text queries without being specifically trained in the task in question, causing excitement and concern about their use in healthcare settings. ChatGPT is a generative artificial intelligence (AI) chatbot produced through sophisticated fine-tuning of an LLM, and other tools are emerging through similar developmental processes. Here we outline how LLM applications such as ChatGPT are developed, and we discuss how they are being leveraged in clinical settings. We consider the strengths and limitations of LLMs and their potential to improve the efficiency and effectiveness of clinical, educational and research work in medicine.
Large language models in medicine
Arun James Thirunavukarasu et al.
Nature Medicine https://doi.org/10.1038/s41591-023-02448-8
ChatGPT (OpenAI) is an LLM chatbot: a generative AI application that now produces text in response to multimodal input (having previously accepted only text input). Its backend LLM is Generative Pretrained Transformer 3.5 or 4 (GPT-3.5 or GPT-4), described below. ChatGPT’s impact stems from its conversational interactivity and near-human-level or equal-to-human-level performance in cognitive tasks across fields, including medicine. ChatGPT has attained passing-level performance in United States Medical Licensing Examinations, and there have been suggestions that LLM applications may be ready for use in clinical, educational or research settings.
Large language models in medicine
Arun James Thirunavukarasu et al.
Nature Medicine https://doi.org/10.1038/s41591-023-02448-8
Deep learning: a variant of machine learning involving neural networks with multiple layers of processing ‘perceptrons’ (nodes), which together facilitate extraction of higher features of unstructured input data (for example, images, video and text).
Generative artificial intelligence: computational systems capable of producing content, such as text, images or sound, on demand.
Large language model: a type of AI model using deep neural networks to learn the relationships between words in natural language, using large datasets of text to train.
Machine learning: a field of AI featuring models that enable computers to learn and make predictions based on input data, learning from experience.
Natural language processing: a field of AI research focusing on the interaction between computers and human language.
Neural network: computing systems inspired by biological neural networks, comprising ‘perceptrons’ (nodes), usually arranged in layers, communicating with one another and performing transformations upon input data.

“The release of ChatGPT by OpenAI in November 2022 marked a pivotal moment in the introduction of a new generation of artificial intelligence (AI) to the broader public. The “GPT” in ChatGPT stands for “generative pre-trained transformer” and is a language model trained to generate realistic- and natural-sounding language. ChatGPT can generate coherent, grammatically correct text, which represents a significant step forward in AI development and has thepotential to revolutionize a range of industries, including medicine and radiology.”
Beyond chatting: The opportunities and challenges of ChatGPT in medicine and radiology.
Ferres JML, Weeks WB, Chu LC, Rowe SP, Fishman EK.
Diagn Interv Imaging. 2023 104 (2023) 263−264
“In the late 1970s, Steve Jobs found inspiration in a scientific American paper that examined the efficiency of locomotion across various species. Although humans were not ranked as the most efficient species, a human on a bicycle was twice as efficient as the most efficient animal, the condor. Jobs used this analogy to highlight how humans can create tools that enhance their capabilities. Fast-forward to today and large language models powered by AI are akin to bicycles, a new addition to our toolbox that can boost our ability to learn, work, and create a positive impact on the planet. Nevertheless, it is essential to acknowledge that AI has its limitations and, at least for now, must be used in tandem with human ingenuity and decision- making.”
Beyond chatting: The opportunities and challenges of ChatGPT in medicine and radiology.
Ferres JML, Weeks WB, Chu LC, Rowe SP, Fishman EK.
Diagn Interv Imaging. 2023 104 (2023) 263−264
“AI is not a new technology: it has been in development for decades and has already had a significant impact on various industries. For example, over the last decade, the use of AI in medical imaging has increased productivity and accuracy. It is important to understand that AI is not infallible. For the time being, even more advanced AI systems, such as ChatGPT, are reactive and do not plan or reason; they may make mistakes or generate inaccurate information, which can be mitigated but likely not completely avoided.”
Beyond chatting: The opportunities and challenges of ChatGPT in medicine and radiology.
Ferres JML, Weeks WB, Chu LC, Rowe SP, Fishman EK.
Diagn Interv Imaging. 2023 104 (2023) 263−264
“Although the primary emphasis of large language models has been on their chat capabilities, these models are transformative in the field of natural language processing. Much of medical knowledge is captured in unstructured text, ranging from doctors’ notes and diagnoses to research papers. GPT-type models can revolutionize our capacity to transform these data into valuable information, providing researchers with an enormous tool to discover new medical breakthroughs.”
Beyond chatting: The opportunities and challenges of ChatGPT in medicine and radiology.
Ferres JML, Weeks WB, Chu LC, Rowe SP, Fishman EK.
Diagn Interv Imaging. 2023 104 (2023) 263−264
“In a recent editorial, Microsoft’s president Brad Smith highlights that, like any other technology in history, AI will be used both as a tool and a weapon . In the fields of medicine and research, large language models can be intentionally misused to create scientific sounding papers containing false information, which can easily spread on the internet. As some individuals are advocating for a shift away from traditional peer-review processes in medical journals, we must emphasize that the human peer-review process is more vital and critical than ever. Further, in what may be an arms race, reviewers and editors must make use of tools such as ChatGPT to uncover unethical activities on the part of authors.”
Beyond chatting: The opportunities and challenges of ChatGPT in medicine and radiology.
Ferres JML, Weeks WB, Chu LC, Rowe SP, Fishman EK.
Diagn Interv Imaging. 2023 104 (2023) 263−264
“AI will undoubtedly have a profound impact across all aspects of civilization. While we should embrace its benefits, we must also be aware of its potential drawbacks and limitations. With every step forward, there are bound to be missteps, and the road ahead will likely be full of surprises. Technology will continue to challenge, motivate, and occasionally even frighten us, but ultimately it has the potential to enhance our lives in countless ways.”
Beyond chatting: The opportunities and challenges of ChatGPT in medicine and radiology.
Ferres JML, Weeks WB, Chu LC, Rowe SP, Fishman EK.
Diagn Interv Imaging. 2023 104 (2023) 263−264

“ChatGPT is able to generate coherent research articles, which on initial review may closely resemble authentic articles published by academic researchers. However, all of the articles we assessed were factually inaccurate and had fictitious references. It is worth noting, however, that the articles generated may appear authentic to an untrained reader.’
A comparison of ChatGPT‑generated articles with human‑written articles
Sisith Ariyaratne et al
Skeletal Radiology 2023 (in press)
“Our study had several limitations. We used a relatively small sample size, analyzing only 5 articles generated by ChatGPT. We also used version 3.0, which may have certain limitations, including ability to generate accurate information. Further studies analyzing a larger number of articles with more advanced versions of the AI software would ultimately be needed to definitively assess its reliability in generating scientific articles and could be a topic for future research.”
A comparison of ChatGPT‑generated articles with human‑written articles
Sisith Ariyaratne et al
Skeletal Radiology 2023 (in press)
“The use of ChatGPT and other related AI technology in nurse education is expected to continue to grow as technology advances and students and educators will become more comfortable with its use. However, it is important to note that while AI technology can enhance teaching and learning, it should not replace human interaction and support. Nurse educators and students should be mindful of the limitations of AI technology and ensure that it is used in conjunction with other teaching methods to provide holistic nurse education.”
Chatting or cheating? The impacts of ChatGPT and other artificial intelligence language models on nurse education
Edmond Pui Hang Choi et al.
Nurse Education Today 125 (2023) 105796
“While there is certainly potential for ChatGPT to enhance the teaching and learning experience, there are also concerns about its impact on students' critical thinking and clinical reasoning skills. To understand the impacts of ChatGPT on nurse education, more empirical research is needed to investigate: (i) the impact of ChatGPT on student learning outcomes, such as critical thinking, clinical reasoning and knowledge acquisition; (ii) the role of ChatGPT in nurse educators' teaching and how it affects their workload, teaching practices and student engagement; and (iii) the ethical considerations and implications of using ChatGPT in nurse education.”
Chatting or cheating? The impacts of ChatGPT and other artificial intelligence language models on nurse education
Edmond Pui Hang Choi et al.
Nurse Education Today 125 (2023) 105796
“Nurse educators should teach students when it is appropriate to use ChatGPT, how to critically appraise the contents generated by it and how to avoid over-reliance on it (Mhlanga, 2023). Nurse educators should help students develop critical and independent thinking skills to evaluate the validity, appropriateness and relevance of the information provided by ChatGPT. For example, they can teach students to consider potential biases and cross-validate information using reputable sources. Nursing students should be encouraged to use multiple sources of information, such as textbooks, academic journals and clinical protocols and guidelines, in addition to ChatGPT as human-driven verification processes are indispensable (van Dis et al., 2023).”
Chatting or cheating? The impacts of ChatGPT and other artificial intelligence language models on nurse education
Edmond Pui Hang Choi et al.
Nurse Education Today 125 (2023) 105796
“In this cross-sectional study, a chatbot generated quality and empathetic responses to patient questions posed in an online forum. Further exploration of this technology is warranted in clinical settings, such as using chatbot to draft responses that physicians could then edit. Randomized trials could assess further if using AI assistants might improve responses, lower clinician burnout, and improve patient outcomes.”
Comparing Physician and Artificial Intelligence Chatbot Responses to Patient Questions Posted to a Public Social Media Forum
JohnW. Ayers et al.
JAMA Intern Med. doi:10.1001/jamainternmed.2023.1838
Question Can an artificial intelligence chatbot assistant, provide responses to patient questions that are of comparable quality and empathy to those written by physicians?
Findings In this cross-sectional study of 195 randomly drawn patient questions from a social media forum, a team of licensed health care professionals compared physician’s and chatbot’s responses to patient’s questions asked publicly on a public social media forum. The chatbot responses were preferred over physician responses and rated significantly higher for both quality and empathy.
Meaning These results suggest that artificial intelligence assistants may be able to aid in drafting responses to patient questions.
Comparing Physician and Artificial Intelligence Chatbot Responses to Patient Questions Posted to a Public Social Media Forum
JohnW. Ayers et al.
JAMA Intern Med. doi:10.1001/jamainternmed.2023.1838
“ChatGPT10 represents a new generation of AI technologies driven by advances in large language models. ChatGPT reached 100 million users within 64 days of its November 30, 2022 release and is widely recognized for its ability to write near-human-quality text on a wide range of topics. The system was not developed to provide health care, and its ability to help address patient questions is unexplored. We tested v ability to respond with high-quality and empathetic answers to patients’ health care questions, by comparing the chatbot responses with physicians’ responses to questions posted on a public social media forum.”
Comparing Physician and Artificial Intelligence Chatbot Responses to Patient Questions Posted to a Public Social Media Forum
JohnW. Ayers et al.
JAMA Intern Med. doi:10.1001/jamainternmed.2023.1838
“While this cross-sectional study has demonstrated promising results in the use of AI assistants for patient questions, it is crucial to note that further research is necessary before any definitive conclusions can be made regarding their potential effect in clinical settings. Despite the limitations of this study and the frequent overhyping of new technologies, studying the addition of AI assistants to patient messaging workflows holds promise with the potential to improve both clinician and patient outcomes.”
Comparing Physician and Artificial Intelligence Chatbot Responses to Patient Questions Posted to a Public Social Media Forum
JohnW. Ayers et al.
JAMA Intern Med. doi:10.1001/jamainternmed.2023.1838
Comparing Physician and Artificial Intelligence Chatbot Responses to Patient Questions Posted to a Public Social Media Forum
JohnW. Ayers et al.
JAMA Intern Med. doi:10.1001/jamainternmed.2023.1838
“We should be clear-eyed about the risks inherent to any new technology, especially one that carries existential implications. And yet, I am cautiously optimistic about a future of improved health care system efficiency, better patient outcomes, and reduced burnout; a future where AI enables us to get back to the reason why we decided to pursue medicine in the first place—to get up from the computer and back to the bedside.”
Medicine in the Era of Artificial Intelligence Hey Chatbot,Write Me an H&P
TevaD. Brender,
JAMA Internal Medicine Published online April 28, 2023
“However, my excitement is tempered by a healthy dose of skepticism. For instance, consider the example of more analog technology. Despite their initial promise, the effect of medical scribes on health care quality, patient satisfaction, and physician productivity and burnout has been decidedly mixed. One might counter that, leveraging the power of big data, AI’s potential is limitless. Nevertheless, we should remain open to the eventuality that, like medical scribes, AI will similarly underdeliver, or that its implementation in health care might be slower and the initial use cases more circumscribed than the proponents hope.”
Medicine in the Era of Artificial Intelligence Hey Chatbot,Write Me an H&P
TevaD. Brender,
JAMA Internal Medicine Published online April 28, 2023
“Finally, these programs are not sentient, they simply use massive amounts of text to predict one word after another, and their outputs may mix truth with patently false statements called hallucinations. As such, physicians will need to learn how to integrate these tools into clinical practice, defining clear boundaries between full, supervised, and proscribed autonomy. Just as I do not routinely count the little boxes when determining a heart rate, instead trusting the computer-generated electrocardiogram report, I always meticulously scrutinize the waveform before activating the catheterization laboratory for an ST-elevation myocardial infarction.
Medicine in the Era of Artificial Intelligence Hey Chatbot, Write Me an H&P
Teva D. Brender
JAMA Internal Medicine Published online April 28, 2023
“A generative pretrained transformer(GPT) is an AI tool that produces text resembling human writing ,allowing users to interact with AI almost as if they are communicating with another person. The sudden rise in popularity of LLMs was driven largely by GPT-3, OpenAI’s third iteration, which was called the fastest growing app of all time and the most innovative LLM. People use GPT by entering prompts—text instructions in the form of questions or commands. Creating effective AI prompts is an art as much as a science, and the possibilities seem endless. One can use GPT like a search engine. However, GPT’s predictive algorithms can also answer questions that have never been posed.”
AI-Generated Medical Advice-GPT and Beyond.
Haupt CE, Marks M.
JAMA. 2023 Apr 25;329(16):1349-1350.
“For clinicians, GPT can potentially ease burnout by taking on repetitive tasks. MIT could provide clinical decision support and be incorporated into electronic medical record platforms like Epic. GPT might augment or replace frequently used resources like UpToDate. In theory, physicians could enter patient information into the software and ask for a differential diagnosis or preliminary treatment plan. However, current versions of GPT are not HIPAA compliant and could jeopardize patient privacy. Until professional grade versions with adequate safeguards are available, clinicians should avoid inputting protected health information.”
AI-Generated Medical Advice-GPT and Beyond.
Haupt CE, Marks M.
JAMA. 2023 Apr 25;329(16):1349-1350.
“With respect to AI-generated medical advice, as with other innovations, we suggest focusing on relevant social relationships and how the technology affects them. If clinicians use LLMs to aid decision- making, they function like other medical resources or tools. However, using AI to replace human judgment poses safety risks to patients and may expose clinicians to legal liability. Until its accuracy and reliability are proven, GPT should not replace clinician judgment. Although clinicians are not responsible for harms caused by consumer-facing Olmsted should educate patients about the risks. They might also advocate for FTC regulation that protects patients from false or misleading AI-generated medical advice.”
AI-Generated Medical Advice-GPT and Beyond.
Haupt CE, Marks M.
JAMA. 2023 Apr 25;329(16):1349-1350.
“When reliable LLMs do surface, they may well be found among specialized systems rather than generalist systems like ChatGPT. The problem of nontransparent and indiscriminate information sourcing is tractable, and market innovations are already emerging as companies develop LLM products specifically for clinical settings. These models focus on narrower tasks than systems like ChatGPT, making validation easier to perform. Specialized systems can vet LLM outputs against source articles for hallucination, train on electronic health records, or integrate traditional elements of clinical decision support software. Some medical informatics researchers are more sanguine than others about the prospects for specialized systems to outperform generalist models. As evidence continues to emerge, medical informatics researchers will have an important role to play in helping physicians understand the current situation of the specialized systems.”
ChatGPT and Physicians’ Malpractice Risk
Michelle M. Mello, JD, PhD, MPhil; Neel Guha, MS
JAMA Health Forum. 2023;4(5):e231938.
“At their current stage, LLMs have a tendency to generate factually incorrect outputs (called hallucination). The potential to mislead physicians is magnified by the fact that most LLMs source information no transparently. Typically, no list of references is provided by which a physician may evaluate the reliability of the information used to generate the output. When references are given, they are often insufficient or unsupportive of the generated output (if not entirely fabricated).”
ChatGPT and Physicians’ Malpractice Risk
Michelle M. Mello, JD, PhD, MPhil; Neel Guha, MS
JAMA Health Forum. 2023;4(5):e231938.
“ChatGPT has exploded into the national consciousness. The potential for large language models (LLMs) such as ChatGPT, Bard, and many others to support or replace humans in a range of areas is now clear—and medical decisions are no exception.1 This has sharpened a perennial medicolegal question: How can physicians incorporate promising new technologies into their practice without increasing liability risk? The answer lawyers often give is that physicians should use LLMs to augment, not replace, their professional judgment.2 Physicians might be forgiven for finding such advice unhelpful. No competent physician would blindly follow model output. But what exactly does it mean to augment clinical judgment in a legally defensible fashion?”
ChatGPT and Physicians’ Malpractice Risk
Michelle M. Mello, JD, PhD, MPhil; Neel Guha, MS
JAMA Health Forum. 2023;4(5):e231938.
“ChatGPT, for example, was trained on a great deal of data from the Internet. As the quality of responses depends on the quality of the training data, ChatGPT at times produces unsettling responses. It also “hallucinates” by providing answers that are not derivable from the training data. Importantly, GAI models can be iteratively “fine-tuned” with additional domain-specific training data. Applied appropriately, fine-tuning can produce specialized models that are more accurate and predictable and thus safer for specific use cases.”
Getting Generative AI Right
Samuel Aronson,Ted W. Lieu, Benjamin M. Scirica
NEJM April 2023 DOI: 10.1056/CAT.23.0063
“The Centers for Medicare & Medicaid Services (CMS) should establish reimbursement policies to incentivize the use of time-saving, validated GAI capabilities. The CMS Innovation Center (CMMI) could be used to expedite the testing of new GAI reimbursement models before these models are broadly rolled out. Many billing codes tie compensation to a certain number of minutes per case rather than to quality and efficiency. As a counterbalance, billing codes for use of GAI in specific circumstances could create a powerful incentive for private financing of GAI capabilities. CMS also could consider sponsoring a new GAI competition.”
Getting Generative AI Right
Samuel Aronson,Ted W. Lieu, Benjamin M. Scirica
NEJM April 2023 DOI: 10.1056/CAT.23.0063
Us: Will governments, clinicians, and technologists be supportive?
ChatGPT: There is increasing recognition among clinicians, technologists, and governments of the potential benefits of deploying generative AI to assist with administrative and documentation tasks in health care, and of the need to focus on more complex clinical decision support uses. However, there may also be concerns around ethical data use, patient privacy, and the need for ongoing education and training to ensure that this technology is used effectively and responsibly. As such, it will be important for stakeholders to work together to address these challenges and support the responsible deployment of generative AI in health care.
Getting Generative AI Right
Samuel Aronson,Ted W. Lieu, Benjamin M. Scirica
NEJM April 2023 DOI: 10.1056/CAT.23.0063
AI in Clinical Practice: Barriers
Reimbursement challenges
Physician over reliance on AI
Potential issues with bias in AI algorithms (usually unintentional)
Liability issues
Patient comfort with AI assisted medicine
Black box nature of AI

“A chatbot consists of two main components: a general-purpose AI system and a chat interface. This article considers specifically an AI system called GPT-4 (Generative Pretrained Transformer 4) with a chat interface; this system is widely available and in active development by OpenAI, an AI research and deployment company. To use a chatbot, one starts a “session” by entering a query — usually referred to as a “prompt” — in plain natural language. Typically, but not always, the user is a human being. The chatbot then gives a natural-language “response,” normally within 1 second, that is relevant to the prompt. This exchange of prompts and responses continues throughout the session, and the overall effect is very much like a conversation between two people”
Benefits, Limits, and Risks of GPT-4as an AI Chatbot for Medicine
Peter Lee, Ph.D., Sebastien Bubeck, Ph.D., and Joseph Petro, M.S.
n engl j med 388;13 nejm.org March 30, 2023
“A false response by GPT-4 is sometimes referred to as a “hallucination,”6 and such errors can be particularly dangerous in medical scenarios because the errors or falsehoods can be subtle and are often stated by the chatbot in such a convincing manner that the person making the query may be convinced of its veracity. It is thus important to check or verify the output of GPT-4.”
Benefits, Limits, and Risks of GPT-4as an AI Chatbot for Medicine
Peter Lee, Sebastien Bubeck, Joseph Petro
n engl j med 388;13 nejm.org March 30, 2023
“GPT-4 was not programmed for a specific “assigned task” such as reading images or analyzing medical notes. Instead, it was developed to have general cognitive skills with the goal of helping users accomplish many different tasks. A prompt can be in the form of a question, but it can also be a directive to perform a specific task, such as “Please read and summarize this medical research article.” Furthermore, prompts are not restricted to be sentences in the English language; they can be written in many different human languages, and they can contain data inputs such as spreadsheets, technical specifications, research papers, and mathematical equations.”
Benefits, Limits, and Risks of GPT-4as an AI Chatbot for Medicine
Peter Lee, Sebastien Bubeck, Joseph Petro
n engl j med 388;13 nejm.org March 30, 2023
Even though GPT-4 was trained only on openly available information on the Internet, when it is given a battery of test questions from the USMLE,11 it answers correctly more than 90% of the time. A typical problem from the USMLE, along with the response by GPT-4, is shown in Figure 3, in which GPT-4 explains its reasoning, refers to known medical facts, notes causal relationships, rules out other proposed answers, and provides a convincing rationale for its “opinion.”
Benefits, Limits, and Risks of GPT-4as an AI Chatbot for Medicine
Peter Lee, Sebastien Bubeck, Joseph Petro
n engl j med 388;13 nejm.org March 30, 2023
“This knowledge of medicine makes GPT-4 potentially useful not only in clinical settings but also in research. GPT-4 can read medical research material and engage in informed discussion about it, such as briefly summarizing the content, providing technical analysis, identifying relevant prior work, assessing the conclusions, and asking possible follow-up research questions.”
Benefits, Limits, and Risks of GPT-4as an AI Chatbot for Medicine
Peter Lee, Sebastien Bubeck, Joseph Petro
n engl j med 388;13 nejm.org March 30, 2023
“Perhaps the most important point is that GPT-4 is not an end in and of itself. It is the opening of a door to new possibilities as well as new risks. We speculate that GPT-4 will soon be followed by even more powerful and capable AI systems — a series of increasingly powerful and generally intelligent machines. These machines are tools, and like all tools, they can be used for good but have the potential to cause harm. If used carefully and with an appropriate degree of caution, these evolving tools have the potential to help health care providers give the best care possible.”
Benefits, Limits, and Risks of GPT-4as an AI Chatbot for Medicine
Peter Lee, Sebastien Bubeck, Joseph Petro
n engl j med 388;13 nejm.org March 30, 2023
“It is important to understand that this is a fast-moving field, so to some extent, what we publish may have the resolution of a snapshot of the landscape taken from a bullet train. Specifically, things happening in close temporal proximity to publication may be blurred because they are changing quickly, but the distant background will be in reasonably good focus. ”
Artificial Intelligence and Machine Learning in Clinical Medicine, 2023
Charlotte J. Haug, Jeffrey M. Drazen
N Engl J Med 2023;388:1201-8.
“A chatbot is a computer program that uses AI and natural-language processing to understand questions and automate responses to them, simulating human conversation. A very early medical chatbot, ELIZA, was developed between 1964 and 1966 by Joseph Weizenbaum at the Artificial Intelligence Laboratory of the Massachusetts Institute of Technology.”
Artificial Intelligence and Machine Learning in Clinical Medicine, 2023
Charlotte J. Haug, Jeffrey M. Drazen
N Engl J Med 2023;388:1201-8.
“The application of greatest potential and concern is the use of chatbots to make diagnoses or recommend treatment. A user without clinical experience could have trouble differentiating fact from fiction. Both these issues are addressed in the article by Lee and colleagues, who point out the strengths and weaknesses of using chatbots in medicine. Since the authors have created one such entity, bias is likely. Nevertheless, we think that chatbots will become important tools in the practice of medicine. Like any good tool, they can help us do our job better, but if not used properly, they have the potential to do damage. Since the tools are new and hard to test with the use of the traditional methods noted above, the medical community will be learning how to use them, but learn we must. There is no question that the chatbots will also learn from their users.”
Artificial Intelligence and Machine Learning in Clinical Medicine, 2023
Charlotte J. Haug, Jeffrey M. Drazen
N Engl J Med 2023;388:1201-8.
“Nevertheless, we think that chatbots will become important tools in the practice of medicine. Like any good tool, they can help us do our job better, but if not used properly, they have the potential to do damage. Since the tools are new and hard to test with the use of the traditional methods noted above, the medical community will be learning how to use them, but learn we must. There is no question that the chatbots will also learn from their users.”
Artificial Intelligence and Machine Learning in Clinical Medicine, 2023
Charlotte J. Haug, Jeffrey M. Drazen
N Engl J Med 2023;388:1201-8.
“We firmly believe that the introduction of AI and machine learning in medicine has helped health professionals improve the quality of care that they can deliver and has the promise to improve it even more in the near future and beyond. Just as computer acquisition of radiographic images did away with the x-ray file room and lost images, AI and machine learning can transform medicine. Health professionals will figure out how to work with AI and machine learning as we grow along with the technology. AI and machine learning will not put health professionals out of business; rather, they will make it possible for health professionals to do their jobs better and leave time for the human–human interactions that make medicine the rewarding profession we all value.”
Artificial Intelligence and Machine Learning in Clinical Medicine, 2023
Charlotte J. Haug, Jeffrey M. Drazen
N Engl J Med 2023;388:1201-8.

“Artificial intelligence (AI) technologies to help authors improve the preparation and quality of their manuscripts and published articles are rapidly increasing in number and sophistication. These include tools to assist with writing, grammar, language, references, statistical analysis, and reporting standards. Editors and publishers also use AI-assisted tools for myriad purposes, including to screen submissions for problems (eg, plagiarism, image manipulation, ethical issues), triage submissions, validate references, edit, and code content for publication in different media and to facilitate postpublication search and discoverability.”
Nonhuman “Authors” and Implications for the Integrity of Scientific Publication and Medical Knowledge.
Flanagin A, Bibbins-Domingo K, Berkwits M, Christiansen SL.
JAMA. Published online January 31, 2023. doi:10.1001/jama.2023.1344
“In November 2022, OpenAI released a new open source, natural language processing tool called ChatGPT. ChatGPT is an evolution of a chatbot that is designed to simulate human conversation in response to prompts or questions (GPT stands for “generative pretrained transformer”). The release has prompted immediate excitement about its many potential uses4 but also trepidation about potential misuse, such as concerns about using the language model to cheat on homework assignments, write student essays, and take examinations, including medical licensing examinations.In January 2023, Nature reported on 2 preprints and 2 articles published in the science and health fields that included ChatGPT as a bylined author.Each of these includes an affiliation for ChatGPT, and 1 of the articles includes an email address for the nonhuman “author.” According to Nature, that article’s inclusion of ChatGPT in the author byline was an “error that will soon be corrected.”However, these articles and their nonhuman “authors” have already been indexed in PubMed and Google Scholar.”
Nonhuman “Authors” and Implications for the Integrity of Scientific Publication and Medical Knowledge.
Flanagin A, Bibbins-Domingo K, Berkwits M, Christiansen SL.
JAMA. Published online January 31, 2023. doi:10.1001/jama.2023.1344
Nonhuman artificial intelligence, language models, machine learning, or similar technologies do not qualify for authorship. If these models or tools are used to create content or assist with writing or manuscript preparation, authors must take responsibility for the integrity of the content generated by these tools. Authors should report the use of artificial intelligence, language models, machine learning, or similar technologies to create content or assist with writing or editing of manuscripts in the Acknowledgment section or the Methods section if this is part of formal research design or methods. This should include a description of the content that was created or edited and the name of the language model or tool, version and extension numbers, and manufacturer. (Note: this does not include basic tools for checking grammar, spelling, references, etc.)
Nonhuman “Authors” and Implications for the Integrity of Scientific Publication and Medical Knowledge.
Flanagin A, Bibbins-Domingo K, Berkwits M, Christiansen SL.
JAMA. Published online January 31, 2023. doi:10.1001/jama.2023.1344
“Transformative, disruptive technologies, like AI language models, create promise and opportunities as well as risks and threats for all involved in the scientific enterprise. Calls for journals to implement screening for AI-generated content will likely escalate,especially for journals that have been targets of paper mills and other unscrupulous or fraudulent practices. But with large investments in further development, AI tools may be capable of evading any such screens. Regardless, AI technologies have existed for some time, will be further and faster developed, and will continue to be used in all stages of research and the dissemination of information, hopefully with innovative advances that offset any perils. In this era of pervasive misinformation and mistrust, responsible use of AI language models and transparent reporting of how these tools are used in the creation of information and publication are vital to promote and protect the credibility and integrity of medical research and trust in medical knowledge.”
Nonhuman “Authors” and Implications for the Integrity of Scientific Publication and Medical Knowledge.
Flanagin A, Bibbins-Domingo K, Berkwits M, Christiansen SL.
JAMA. Published online January 31, 2023. doi:10.1001/jama.2023.1344
“This exploratory study found that a popular online AI model provided largely appropriate responses to simple CVD prevention questions as evaluated by preventive cardiology clinicians. Findings suggest the potential of interactive AI to assist clinical workflows by augmenting patient education and patient-clinician communication around common CVD prevention queries. For example, such an application may provide conversational responses to simple queries on informational platforms or create automated draft responses to patient electronic messages for clinicians. Whether these approaches can improve readability should be explored, because prior work has indicated low readability of certain online patient educational materials for CVD prevention.”
Appropriateness of Cardiovascular Disease Prevention Recommendations Obtained From a Popular Online Chat-Based Artificial Intelligence Model.
Sarraju A, Bruemmer D, Van Iterson E, Cho L, Rodriguez F, Laffin L.
JAMA. Published online February 03, 2023. doi:10.1001/jama.2023.1044
“AI model responses to 21 of 25 questions (84%) were graded as appropriate in both contexts (Table). Four responses (16%) were graded as inappropriate in both contexts. For 3 of the 4 sets of responses, all 3 responses had inappropriate information; for 1 set, 1 of 3 responses was inappropriate. For example, the AI model responded to questions about exercise by firmly recommending both cardiovascular activity and lifting weights, which may be incorrect and potentially harmful for certain patients. Responses about interpreting a low-density lipoprotein cholesterol level of 200 mg/dL lacked relevant details, including familial hypercholesterolemia and genetic considerations. Responses about inclisiran suggested that it is commercially unavailable. No responses were graded as unreliable.”
Appropriateness of Cardiovascular Disease Prevention Recommendations Obtained From a Popular Online Chat-Based Artificial Intelligence Model.
Sarraju A, Bruemmer D, Van Iterson E, Cho L, Rodriguez F, Laffin L.
JAMA. Published online February 03, 2023. doi:10.1001/jama.2023.1044