Performance of large language models in the differential diagnosis of benign and malignant biliary stricture
BackgroundDistinguishing benign from malignant biliary strictures remains challenging. Large Language Models (LLMs) show promise in enhancing diagnostic accuracy. This study aimed to evaluate the performances of ten LLMs in the differential diagnosis of benign and malignant biliary strictures.Method...
Saved in:
Main Authors: | , , , , , , , , , , , , , , , , , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
Frontiers Media S.A.
2025-07-01
|
Series: | Frontiers in Oncology |
Subjects: | |
Online Access: | https://www.frontiersin.org/articles/10.3389/fonc.2025.1613818/full |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
_version_ | 1839641860045275136 |
---|---|
author | Chenxi Kang Jing Li Xintian Yang Gui Ren Linhui Zhang Wei Wang Xin Liu Lei Wang Guochen Shang Jianglong Hong Bingnian Wan Yu Du Wei Zeng Yaling Liu Tongxin Li Lijun Lou Hui Luo Shuhui Liang Yong Lv Yanglin Pan |
author_facet | Chenxi Kang Jing Li Xintian Yang Gui Ren Linhui Zhang Wei Wang Xin Liu Lei Wang Guochen Shang Jianglong Hong Bingnian Wan Yu Du Wei Zeng Yaling Liu Tongxin Li Lijun Lou Hui Luo Shuhui Liang Yong Lv Yanglin Pan |
author_sort | Chenxi Kang |
collection | DOAJ |
description | BackgroundDistinguishing benign from malignant biliary strictures remains challenging. Large Language Models (LLMs) show promise in enhancing diagnostic accuracy. This study aimed to evaluate the performances of ten LLMs in the differential diagnosis of benign and malignant biliary strictures.MethodsConsecutive patients with biliary strictures undergoing endoscopic retrograde cholangiopancreatography (ERCP) at Xijing Hospital between January and December 2024 were retrospectively analyzed. Ten LLMs were systematically prompted with standardized clinical, laboratory, and imaging data. Performance was compared against tumor markers (CA19-9, CEA), a new multivariable clinical model, and ten independent pancreaticobiliary exoerienced physicians. Subgroup analyses assessed hilar (n=29) versus non-hilar strictures. Gold-standard diagnosis relied on histopathology and ≥3-month follow-up.ResultsAmong the 159 included patients (83 benign, 76 malignant), four LLMs (Kimi, Deepseek-R1, Claude-3.5S, Llama-3.1), the clinical model (AUC:0.83), and six physicians achieved >80% accuracy. Kimi demonstrated superior accuracy (87%), significantly outperforming 70% of physicians (7/10, p<0.01). Three other LLMs (Deepseek-R1:83%, Claude-3.5S:82%, Llama-3.1:81%) and the clinical model performed comparably to physicians (78-84%, p>0.05), collectively surpassing tumor markers (CA19–9 accuracy:66%, CEA:71%). Physicians demonstrated higher accuracy for hilar strictures (87% vs. 79% for non-hilar, p<0.001). LLMs showed similar performance across stricture locations (hilar:64-95%; non-hilar:62-88%, p>0.05). For hilar strictures, 7/10 physicians achieved significantly higher accuracy (87-90%) than 8/10 LLMs (64-84%, p<0.05).ConclusionsUsing clinical, lab, and imaging data, some LLMs achieved diagnostic accuracy comparable to or exceeding clinical models and experienced physicians for differentiating benign versus malignant strictures. However, for hilar strictures, LLM performance was inferior to over half of the physicians. |
format | Article |
id | doaj-art-da97d55f4965448f83c2bc49b3de6b42 |
institution | Matheson Library |
issn | 2234-943X |
language | English |
publishDate | 2025-07-01 |
publisher | Frontiers Media S.A. |
record_format | Article |
series | Frontiers in Oncology |
spelling | doaj-art-da97d55f4965448f83c2bc49b3de6b422025-07-03T04:10:23ZengFrontiers Media S.A.Frontiers in Oncology2234-943X2025-07-011510.3389/fonc.2025.16138181613818Performance of large language models in the differential diagnosis of benign and malignant biliary strictureChenxi Kang0Jing Li1Xintian Yang2Gui Ren3Linhui Zhang4Wei Wang5Xin Liu6Lei Wang7Guochen Shang8Jianglong Hong9Bingnian Wan10Yu Du11Wei Zeng12Yaling Liu13Tongxin Li14Lijun Lou15Hui Luo16Shuhui Liang17Yong Lv18Yanglin Pan19Xijing Hospital of Digestive Diseases, Air Force Medical University, Xi’an, ChinaXijing Hospital of Digestive Diseases, Air Force Medical University, Xi’an, ChinaXijing Hospital of Digestive Diseases, Air Force Medical University, Xi’an, ChinaXijing Hospital of Digestive Diseases, Air Force Medical University, Xi’an, ChinaXijing Hospital of Digestive Diseases, Air Force Medical University, Xi’an, ChinaDepartment of Gastroenterology, People’s Liberation Army Joint Logistics Support Force 940th Hospital, Lanzhou, Gansu, ChinaDepartment of Gastroenterology, Third People’s Hospital of Gansu Province, Lanzhou, Gansu, ChinaDepartment of Gastroenterology, Ankang Traditional Chinese Medicine Hospital, Ankang, ChinaTongji Medical College, Huazhong University of Science and Technology, Wuhan, Hubei, ChinaFirst Affiliated Hospital of Anhui Medical University, Hefei, Anhui, ChinaYantai Ludong Hospital, Shandong Provincial Hospital Group, Yantai, Shandong, ChinaDepartment of Gastroenterology, Qinzhou Second People’s Hospital, Qinzhou, ChinaXiang’an Hospital, Xiamen University, Xiamen, Fujian, ChinaXijing Hospital of Digestive Diseases, Air Force Medical University, Xi’an, ChinaXijing Hospital of Digestive Diseases, Air Force Medical University, Xi’an, ChinaXijing Hospital of Digestive Diseases, Air Force Medical University, Xi’an, ChinaXijing Hospital of Digestive Diseases, Air Force Medical University, Xi’an, ChinaXijing Hospital of Digestive Diseases, Air Force Medical University, Xi’an, ChinaXijing Hospital of Digestive Diseases, Air Force Medical University, Xi’an, ChinaXijing Hospital of Digestive Diseases, Air Force Medical University, Xi’an, ChinaBackgroundDistinguishing benign from malignant biliary strictures remains challenging. Large Language Models (LLMs) show promise in enhancing diagnostic accuracy. This study aimed to evaluate the performances of ten LLMs in the differential diagnosis of benign and malignant biliary strictures.MethodsConsecutive patients with biliary strictures undergoing endoscopic retrograde cholangiopancreatography (ERCP) at Xijing Hospital between January and December 2024 were retrospectively analyzed. Ten LLMs were systematically prompted with standardized clinical, laboratory, and imaging data. Performance was compared against tumor markers (CA19-9, CEA), a new multivariable clinical model, and ten independent pancreaticobiliary exoerienced physicians. Subgroup analyses assessed hilar (n=29) versus non-hilar strictures. Gold-standard diagnosis relied on histopathology and ≥3-month follow-up.ResultsAmong the 159 included patients (83 benign, 76 malignant), four LLMs (Kimi, Deepseek-R1, Claude-3.5S, Llama-3.1), the clinical model (AUC:0.83), and six physicians achieved >80% accuracy. Kimi demonstrated superior accuracy (87%), significantly outperforming 70% of physicians (7/10, p<0.01). Three other LLMs (Deepseek-R1:83%, Claude-3.5S:82%, Llama-3.1:81%) and the clinical model performed comparably to physicians (78-84%, p>0.05), collectively surpassing tumor markers (CA19–9 accuracy:66%, CEA:71%). Physicians demonstrated higher accuracy for hilar strictures (87% vs. 79% for non-hilar, p<0.001). LLMs showed similar performance across stricture locations (hilar:64-95%; non-hilar:62-88%, p>0.05). For hilar strictures, 7/10 physicians achieved significantly higher accuracy (87-90%) than 8/10 LLMs (64-84%, p<0.05).ConclusionsUsing clinical, lab, and imaging data, some LLMs achieved diagnostic accuracy comparable to or exceeding clinical models and experienced physicians for differentiating benign versus malignant strictures. However, for hilar strictures, LLM performance was inferior to over half of the physicians.https://www.frontiersin.org/articles/10.3389/fonc.2025.1613818/fulllarge language modelbiliary stricturecholangiocarcinomaprediction modeldiagnosis |
spellingShingle | Chenxi Kang Jing Li Xintian Yang Gui Ren Linhui Zhang Wei Wang Xin Liu Lei Wang Guochen Shang Jianglong Hong Bingnian Wan Yu Du Wei Zeng Yaling Liu Tongxin Li Lijun Lou Hui Luo Shuhui Liang Yong Lv Yanglin Pan Performance of large language models in the differential diagnosis of benign and malignant biliary stricture Frontiers in Oncology large language model biliary stricture cholangiocarcinoma prediction model diagnosis |
title | Performance of large language models in the differential diagnosis of benign and malignant biliary stricture |
title_full | Performance of large language models in the differential diagnosis of benign and malignant biliary stricture |
title_fullStr | Performance of large language models in the differential diagnosis of benign and malignant biliary stricture |
title_full_unstemmed | Performance of large language models in the differential diagnosis of benign and malignant biliary stricture |
title_short | Performance of large language models in the differential diagnosis of benign and malignant biliary stricture |
title_sort | performance of large language models in the differential diagnosis of benign and malignant biliary stricture |
topic | large language model biliary stricture cholangiocarcinoma prediction model diagnosis |
url | https://www.frontiersin.org/articles/10.3389/fonc.2025.1613818/full |
work_keys_str_mv | AT chenxikang performanceoflargelanguagemodelsinthedifferentialdiagnosisofbenignandmalignantbiliarystricture AT jingli performanceoflargelanguagemodelsinthedifferentialdiagnosisofbenignandmalignantbiliarystricture AT xintianyang performanceoflargelanguagemodelsinthedifferentialdiagnosisofbenignandmalignantbiliarystricture AT guiren performanceoflargelanguagemodelsinthedifferentialdiagnosisofbenignandmalignantbiliarystricture AT linhuizhang performanceoflargelanguagemodelsinthedifferentialdiagnosisofbenignandmalignantbiliarystricture AT weiwang performanceoflargelanguagemodelsinthedifferentialdiagnosisofbenignandmalignantbiliarystricture AT xinliu performanceoflargelanguagemodelsinthedifferentialdiagnosisofbenignandmalignantbiliarystricture AT leiwang performanceoflargelanguagemodelsinthedifferentialdiagnosisofbenignandmalignantbiliarystricture AT guochenshang performanceoflargelanguagemodelsinthedifferentialdiagnosisofbenignandmalignantbiliarystricture AT jianglonghong performanceoflargelanguagemodelsinthedifferentialdiagnosisofbenignandmalignantbiliarystricture AT bingnianwan performanceoflargelanguagemodelsinthedifferentialdiagnosisofbenignandmalignantbiliarystricture AT yudu performanceoflargelanguagemodelsinthedifferentialdiagnosisofbenignandmalignantbiliarystricture AT weizeng performanceoflargelanguagemodelsinthedifferentialdiagnosisofbenignandmalignantbiliarystricture AT yalingliu performanceoflargelanguagemodelsinthedifferentialdiagnosisofbenignandmalignantbiliarystricture AT tongxinli performanceoflargelanguagemodelsinthedifferentialdiagnosisofbenignandmalignantbiliarystricture AT lijunlou performanceoflargelanguagemodelsinthedifferentialdiagnosisofbenignandmalignantbiliarystricture AT huiluo performanceoflargelanguagemodelsinthedifferentialdiagnosisofbenignandmalignantbiliarystricture AT shuhuiliang performanceoflargelanguagemodelsinthedifferentialdiagnosisofbenignandmalignantbiliarystricture AT yonglv performanceoflargelanguagemodelsinthedifferentialdiagnosisofbenignandmalignantbiliarystricture AT yanglinpan performanceoflargelanguagemodelsinthedifferentialdiagnosisofbenignandmalignantbiliarystricture |