Performance of large language models in the differential diagnosis of benign and malignant biliary stricture
BackgroundDistinguishing benign from malignant biliary strictures remains challenging. Large Language Models (LLMs) show promise in enhancing diagnostic accuracy. This study aimed to evaluate the performances of ten LLMs in the differential diagnosis of benign and malignant biliary strictures.Method...
Saved in:
Main Authors: | , , , , , , , , , , , , , , , , , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
Frontiers Media S.A.
2025-07-01
|
Series: | Frontiers in Oncology |
Subjects: | |
Online Access: | https://www.frontiersin.org/articles/10.3389/fonc.2025.1613818/full |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Summary: | BackgroundDistinguishing benign from malignant biliary strictures remains challenging. Large Language Models (LLMs) show promise in enhancing diagnostic accuracy. This study aimed to evaluate the performances of ten LLMs in the differential diagnosis of benign and malignant biliary strictures.MethodsConsecutive patients with biliary strictures undergoing endoscopic retrograde cholangiopancreatography (ERCP) at Xijing Hospital between January and December 2024 were retrospectively analyzed. Ten LLMs were systematically prompted with standardized clinical, laboratory, and imaging data. Performance was compared against tumor markers (CA19-9, CEA), a new multivariable clinical model, and ten independent pancreaticobiliary exoerienced physicians. Subgroup analyses assessed hilar (n=29) versus non-hilar strictures. Gold-standard diagnosis relied on histopathology and ≥3-month follow-up.ResultsAmong the 159 included patients (83 benign, 76 malignant), four LLMs (Kimi, Deepseek-R1, Claude-3.5S, Llama-3.1), the clinical model (AUC:0.83), and six physicians achieved >80% accuracy. Kimi demonstrated superior accuracy (87%), significantly outperforming 70% of physicians (7/10, p<0.01). Three other LLMs (Deepseek-R1:83%, Claude-3.5S:82%, Llama-3.1:81%) and the clinical model performed comparably to physicians (78-84%, p>0.05), collectively surpassing tumor markers (CA19–9 accuracy:66%, CEA:71%). Physicians demonstrated higher accuracy for hilar strictures (87% vs. 79% for non-hilar, p<0.001). LLMs showed similar performance across stricture locations (hilar:64-95%; non-hilar:62-88%, p>0.05). For hilar strictures, 7/10 physicians achieved significantly higher accuracy (87-90%) than 8/10 LLMs (64-84%, p<0.05).ConclusionsUsing clinical, lab, and imaging data, some LLMs achieved diagnostic accuracy comparable to or exceeding clinical models and experienced physicians for differentiating benign versus malignant strictures. However, for hilar strictures, LLM performance was inferior to over half of the physicians. |
---|---|
ISSN: | 2234-943X |