Performance of large language models in the differential diagnosis of benign and malignant biliary stricture

BackgroundDistinguishing benign from malignant biliary strictures remains challenging. Large Language Models (LLMs) show promise in enhancing diagnostic accuracy. This study aimed to evaluate the performances of ten LLMs in the differential diagnosis of benign and malignant biliary strictures.Method...

Full description

Saved in:
Bibliographic Details
Main Authors: Chenxi Kang, Jing Li, Xintian Yang, Gui Ren, Linhui Zhang, Wei Wang, Xin Liu, Lei Wang, Guochen Shang, Jianglong Hong, Bingnian Wan, Yu Du, Wei Zeng, Yaling Liu, Tongxin Li, Lijun Lou, Hui Luo, Shuhui Liang, Yong Lv, Yanglin Pan
Format: Article
Language:English
Published: Frontiers Media S.A. 2025-07-01
Series:Frontiers in Oncology
Subjects:
Online Access:https://www.frontiersin.org/articles/10.3389/fonc.2025.1613818/full
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1839641860045275136
author Chenxi Kang
Jing Li
Xintian Yang
Gui Ren
Linhui Zhang
Wei Wang
Xin Liu
Lei Wang
Guochen Shang
Jianglong Hong
Bingnian Wan
Yu Du
Wei Zeng
Yaling Liu
Tongxin Li
Lijun Lou
Hui Luo
Shuhui Liang
Yong Lv
Yanglin Pan
author_facet Chenxi Kang
Jing Li
Xintian Yang
Gui Ren
Linhui Zhang
Wei Wang
Xin Liu
Lei Wang
Guochen Shang
Jianglong Hong
Bingnian Wan
Yu Du
Wei Zeng
Yaling Liu
Tongxin Li
Lijun Lou
Hui Luo
Shuhui Liang
Yong Lv
Yanglin Pan
author_sort Chenxi Kang
collection DOAJ
description BackgroundDistinguishing benign from malignant biliary strictures remains challenging. Large Language Models (LLMs) show promise in enhancing diagnostic accuracy. This study aimed to evaluate the performances of ten LLMs in the differential diagnosis of benign and malignant biliary strictures.MethodsConsecutive patients with biliary strictures undergoing endoscopic retrograde cholangiopancreatography (ERCP) at Xijing Hospital between January and December 2024 were retrospectively analyzed. Ten LLMs were systematically prompted with standardized clinical, laboratory, and imaging data. Performance was compared against tumor markers (CA19-9, CEA), a new multivariable clinical model, and ten independent pancreaticobiliary exoerienced physicians. Subgroup analyses assessed hilar (n=29) versus non-hilar strictures. Gold-standard diagnosis relied on histopathology and ≥3-month follow-up.ResultsAmong the 159 included patients (83 benign, 76 malignant), four LLMs (Kimi, Deepseek-R1, Claude-3.5S, Llama-3.1), the clinical model (AUC:0.83), and six physicians achieved >80% accuracy. Kimi demonstrated superior accuracy (87%), significantly outperforming 70% of physicians (7/10, p<0.01). Three other LLMs (Deepseek-R1:83%, Claude-3.5S:82%, Llama-3.1:81%) and the clinical model performed comparably to physicians (78-84%, p>0.05), collectively surpassing tumor markers (CA19–9 accuracy:66%, CEA:71%). Physicians demonstrated higher accuracy for hilar strictures (87% vs. 79% for non-hilar, p<0.001). LLMs showed similar performance across stricture locations (hilar:64-95%; non-hilar:62-88%, p>0.05). For hilar strictures, 7/10 physicians achieved significantly higher accuracy (87-90%) than 8/10 LLMs (64-84%, p<0.05).ConclusionsUsing clinical, lab, and imaging data, some LLMs achieved diagnostic accuracy comparable to or exceeding clinical models and experienced physicians for differentiating benign versus malignant strictures. However, for hilar strictures, LLM performance was inferior to over half of the physicians.
format Article
id doaj-art-da97d55f4965448f83c2bc49b3de6b42
institution Matheson Library
issn 2234-943X
language English
publishDate 2025-07-01
publisher Frontiers Media S.A.
record_format Article
series Frontiers in Oncology
spelling doaj-art-da97d55f4965448f83c2bc49b3de6b422025-07-03T04:10:23ZengFrontiers Media S.A.Frontiers in Oncology2234-943X2025-07-011510.3389/fonc.2025.16138181613818Performance of large language models in the differential diagnosis of benign and malignant biliary strictureChenxi Kang0Jing Li1Xintian Yang2Gui Ren3Linhui Zhang4Wei Wang5Xin Liu6Lei Wang7Guochen Shang8Jianglong Hong9Bingnian Wan10Yu Du11Wei Zeng12Yaling Liu13Tongxin Li14Lijun Lou15Hui Luo16Shuhui Liang17Yong Lv18Yanglin Pan19Xijing Hospital of Digestive Diseases, Air Force Medical University, Xi’an, ChinaXijing Hospital of Digestive Diseases, Air Force Medical University, Xi’an, ChinaXijing Hospital of Digestive Diseases, Air Force Medical University, Xi’an, ChinaXijing Hospital of Digestive Diseases, Air Force Medical University, Xi’an, ChinaXijing Hospital of Digestive Diseases, Air Force Medical University, Xi’an, ChinaDepartment of Gastroenterology, People’s Liberation Army Joint Logistics Support Force 940th Hospital, Lanzhou, Gansu, ChinaDepartment of Gastroenterology, Third People’s Hospital of Gansu Province, Lanzhou, Gansu, ChinaDepartment of Gastroenterology, Ankang Traditional Chinese Medicine Hospital, Ankang, ChinaTongji Medical College, Huazhong University of Science and Technology, Wuhan, Hubei, ChinaFirst Affiliated Hospital of Anhui Medical University, Hefei, Anhui, ChinaYantai Ludong Hospital, Shandong Provincial Hospital Group, Yantai, Shandong, ChinaDepartment of Gastroenterology, Qinzhou Second People’s Hospital, Qinzhou, ChinaXiang’an Hospital, Xiamen University, Xiamen, Fujian, ChinaXijing Hospital of Digestive Diseases, Air Force Medical University, Xi’an, ChinaXijing Hospital of Digestive Diseases, Air Force Medical University, Xi’an, ChinaXijing Hospital of Digestive Diseases, Air Force Medical University, Xi’an, ChinaXijing Hospital of Digestive Diseases, Air Force Medical University, Xi’an, ChinaXijing Hospital of Digestive Diseases, Air Force Medical University, Xi’an, ChinaXijing Hospital of Digestive Diseases, Air Force Medical University, Xi’an, ChinaXijing Hospital of Digestive Diseases, Air Force Medical University, Xi’an, ChinaBackgroundDistinguishing benign from malignant biliary strictures remains challenging. Large Language Models (LLMs) show promise in enhancing diagnostic accuracy. This study aimed to evaluate the performances of ten LLMs in the differential diagnosis of benign and malignant biliary strictures.MethodsConsecutive patients with biliary strictures undergoing endoscopic retrograde cholangiopancreatography (ERCP) at Xijing Hospital between January and December 2024 were retrospectively analyzed. Ten LLMs were systematically prompted with standardized clinical, laboratory, and imaging data. Performance was compared against tumor markers (CA19-9, CEA), a new multivariable clinical model, and ten independent pancreaticobiliary exoerienced physicians. Subgroup analyses assessed hilar (n=29) versus non-hilar strictures. Gold-standard diagnosis relied on histopathology and ≥3-month follow-up.ResultsAmong the 159 included patients (83 benign, 76 malignant), four LLMs (Kimi, Deepseek-R1, Claude-3.5S, Llama-3.1), the clinical model (AUC:0.83), and six physicians achieved >80% accuracy. Kimi demonstrated superior accuracy (87%), significantly outperforming 70% of physicians (7/10, p<0.01). Three other LLMs (Deepseek-R1:83%, Claude-3.5S:82%, Llama-3.1:81%) and the clinical model performed comparably to physicians (78-84%, p>0.05), collectively surpassing tumor markers (CA19–9 accuracy:66%, CEA:71%). Physicians demonstrated higher accuracy for hilar strictures (87% vs. 79% for non-hilar, p<0.001). LLMs showed similar performance across stricture locations (hilar:64-95%; non-hilar:62-88%, p>0.05). For hilar strictures, 7/10 physicians achieved significantly higher accuracy (87-90%) than 8/10 LLMs (64-84%, p<0.05).ConclusionsUsing clinical, lab, and imaging data, some LLMs achieved diagnostic accuracy comparable to or exceeding clinical models and experienced physicians for differentiating benign versus malignant strictures. However, for hilar strictures, LLM performance was inferior to over half of the physicians.https://www.frontiersin.org/articles/10.3389/fonc.2025.1613818/fulllarge language modelbiliary stricturecholangiocarcinomaprediction modeldiagnosis
spellingShingle Chenxi Kang
Jing Li
Xintian Yang
Gui Ren
Linhui Zhang
Wei Wang
Xin Liu
Lei Wang
Guochen Shang
Jianglong Hong
Bingnian Wan
Yu Du
Wei Zeng
Yaling Liu
Tongxin Li
Lijun Lou
Hui Luo
Shuhui Liang
Yong Lv
Yanglin Pan
Performance of large language models in the differential diagnosis of benign and malignant biliary stricture
Frontiers in Oncology
large language model
biliary stricture
cholangiocarcinoma
prediction model
diagnosis
title Performance of large language models in the differential diagnosis of benign and malignant biliary stricture
title_full Performance of large language models in the differential diagnosis of benign and malignant biliary stricture
title_fullStr Performance of large language models in the differential diagnosis of benign and malignant biliary stricture
title_full_unstemmed Performance of large language models in the differential diagnosis of benign and malignant biliary stricture
title_short Performance of large language models in the differential diagnosis of benign and malignant biliary stricture
title_sort performance of large language models in the differential diagnosis of benign and malignant biliary stricture
topic large language model
biliary stricture
cholangiocarcinoma
prediction model
diagnosis
url https://www.frontiersin.org/articles/10.3389/fonc.2025.1613818/full
work_keys_str_mv AT chenxikang performanceoflargelanguagemodelsinthedifferentialdiagnosisofbenignandmalignantbiliarystricture
AT jingli performanceoflargelanguagemodelsinthedifferentialdiagnosisofbenignandmalignantbiliarystricture
AT xintianyang performanceoflargelanguagemodelsinthedifferentialdiagnosisofbenignandmalignantbiliarystricture
AT guiren performanceoflargelanguagemodelsinthedifferentialdiagnosisofbenignandmalignantbiliarystricture
AT linhuizhang performanceoflargelanguagemodelsinthedifferentialdiagnosisofbenignandmalignantbiliarystricture
AT weiwang performanceoflargelanguagemodelsinthedifferentialdiagnosisofbenignandmalignantbiliarystricture
AT xinliu performanceoflargelanguagemodelsinthedifferentialdiagnosisofbenignandmalignantbiliarystricture
AT leiwang performanceoflargelanguagemodelsinthedifferentialdiagnosisofbenignandmalignantbiliarystricture
AT guochenshang performanceoflargelanguagemodelsinthedifferentialdiagnosisofbenignandmalignantbiliarystricture
AT jianglonghong performanceoflargelanguagemodelsinthedifferentialdiagnosisofbenignandmalignantbiliarystricture
AT bingnianwan performanceoflargelanguagemodelsinthedifferentialdiagnosisofbenignandmalignantbiliarystricture
AT yudu performanceoflargelanguagemodelsinthedifferentialdiagnosisofbenignandmalignantbiliarystricture
AT weizeng performanceoflargelanguagemodelsinthedifferentialdiagnosisofbenignandmalignantbiliarystricture
AT yalingliu performanceoflargelanguagemodelsinthedifferentialdiagnosisofbenignandmalignantbiliarystricture
AT tongxinli performanceoflargelanguagemodelsinthedifferentialdiagnosisofbenignandmalignantbiliarystricture
AT lijunlou performanceoflargelanguagemodelsinthedifferentialdiagnosisofbenignandmalignantbiliarystricture
AT huiluo performanceoflargelanguagemodelsinthedifferentialdiagnosisofbenignandmalignantbiliarystricture
AT shuhuiliang performanceoflargelanguagemodelsinthedifferentialdiagnosisofbenignandmalignantbiliarystricture
AT yonglv performanceoflargelanguagemodelsinthedifferentialdiagnosisofbenignandmalignantbiliarystricture
AT yanglinpan performanceoflargelanguagemodelsinthedifferentialdiagnosisofbenignandmalignantbiliarystricture