Comparison of AI and resident doctors in diagnosing and managing acute respiratory cases: a retrospective study

Introduction: The use of artificial intelligence (AI) models, such as ChatGPT, in clinical medicine has generated significant interest due to their potential to support diagnostic and treatment decisions.1 These AI models have shown an ability to process and summarise complex medical information; ho...

Full description

Saved in:
Bibliographic Details
Main Authors: Abdurrahmaan Manga, Saher Sultan, Ammarah Yasmeen, Fasihul Khan
Format: Article
Language:English
Published: Elsevier 2025-06-01
Series:Future Healthcare Journal
Online Access:http://www.sciencedirect.com/science/article/pii/S2514664525001626
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Introduction: The use of artificial intelligence (AI) models, such as ChatGPT, in clinical medicine has generated significant interest due to their potential to support diagnostic and treatment decisions.1 These AI models have shown an ability to process and summarise complex medical information; however, their effectiveness compared with resident doctors in real-world clinical settings is not yet well understood.2 It is particularly important to assess their role in acute respiratory medicine, where timely and accurate decision-making can significantly affect patient outcomes. This study aimed to evaluate how the diagnostic and management decisions made by resident doctors and ChatGPT compared to a consultant-defined gold standard within a tertiary hospital setting. Methods: A retrospective analysis was conducted on 45 patients admitted to the respiratory admissions unit. Exclusion criteria included patients lacking capacity (eg, confusion or dementia), those with language barriers preventing a coherent history, patients with a primary diagnosis unrelated to respiratory medicine, and patients clerked alongside registrars or consultants, because we wanted to compare AI outcomes solely to those of resident doctors.Patient-clerking proformas and investigations were scanned and inputted into ChatGPT 4.0 using a structured prompt to generate diagnostic and management plans. Plans devised by resident doctors and by ChatGPT were independently assessed by blinded respiratory specialists using a 5-point Likert scale (1 = poor, 5 = fully concordant) relative to the consultant-defined gold standard. A 5-point Likert scale was chosen because it provides a straightforward way to assess agreement, while avoiding decision fatigue by adjudicators. Blinding was ensured by randomly mixing the order of plans and standardising AI’s language to resemble resident doctors’ style.Medians and interquartile ranges (IQRs) were used to summarise scores. Wilcoxon signed-rank tests assessed differences and weighted Cohen’s Kappa examined agreement between resident doctors and ChatGPT; the percentage of perfect agreement (score = 5) was also calculated. Results and discussion: No statistically significant difference was found between resident doctors and ChatGPT for diagnosis (Wilcoxon p=0.35) or management (Wilcoxon p=0.14). Median (IQR) for resident doctor diagnosis was 4.00 (4–5), while for ChatGPT diagnosis it was 5.00 (4–5). However, for management plans, the resident doctor median (IQR) score was 4.00 (3–4), whereas ChatGPT scored 4.00 (4–5). ChatGPT also achieved higher perfect agreement with consultant plans compared with resident doctors in both diagnosis (51.1% vs 42.2%) and management (40.0% vs 24.4%).Weighted Kappa revealed very poor agreement between resident doctors and ChatGPT for diagnosis (0.056) but fair agreement for management (0.305).Score distributions indicated that ChatGPT more frequently achieved perfect scores (5), whereas resident doctors more commonly scored 4 (Figs 1 and 2). Conclusion: ChatGPT’s diagnostic and management plans performed comparably to resident doctors and demonstrated higher perfect agreement with consultant-defined standards. However, a low diagnostic agreement between ChatGPT and resident doctors suggests differences in clinical reasoning. This study highlights the potential of AI to improve diagnostic accuracy and help resident doctors refine management strategies. Further research is needed to explore its integration into clinical workflows and training programs, as well as its ability to augment decision-making to enhance, rather than replace, resident doctor expertise.3
ISSN:2514-6645