Population-based colorectal cancer risk prediction using a SHAP-enhanced LightGBM model

BackgroundColorectal cancer (CRC) is a highly frequent cancer worldwide, and early detection and risk stratification playing a critical role in reducing both incidence and mortality. we aimed to develop and validate a machine learning (ML) model using clinical data to improve CRC identification and...

Full description

Saved in:
Bibliographic Details
Main Authors: Guinian Du, Hui Lv, Yishan Liang, Jingyue Zhang, Qiaoling Huang, Guiming Xie, Xian Wu, Hao Zeng, Lijuan Wu, Jianbo Ye, Wentan Xie, Xia Li, Yifan Sun
Format: Article
Language:English
Published: Frontiers Media S.A. 2025-07-01
Series:Frontiers in Oncology
Subjects:
Online Access:https://www.frontiersin.org/articles/10.3389/fonc.2025.1575844/full
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1839626738829623296
author Guinian Du
Hui Lv
Yishan Liang
Jingyue Zhang
Qiaoling Huang
Guiming Xie
Xian Wu
Hao Zeng
Lijuan Wu
Jianbo Ye
Wentan Xie
Xia Li
Yifan Sun
author_facet Guinian Du
Hui Lv
Yishan Liang
Jingyue Zhang
Qiaoling Huang
Guiming Xie
Xian Wu
Hao Zeng
Lijuan Wu
Jianbo Ye
Wentan Xie
Xia Li
Yifan Sun
author_sort Guinian Du
collection DOAJ
description BackgroundColorectal cancer (CRC) is a highly frequent cancer worldwide, and early detection and risk stratification playing a critical role in reducing both incidence and mortality. we aimed to develop and validate a machine learning (ML) model using clinical data to improve CRC identification and prognostic evaluation.MethodsWe analyzed multicenter datasets comprising 676 CRC patients and 410 controls from Guigang City People’s Hospital (2020-2024) for model training/internal validation, with 463 patients from Laibin City People’s Hospital for external validation. Seven ML algorithms were systematically compared, with Light Gradient Boosting Machine (LightGBM) ultimately selected as the optimal framework. Model performance was rigorously assessed through area under the receiver operating characteristic (AUROC) analysis, calibration curves, Brier scores, and decision curve analysis. SHAP (SHapley Additive exPlanations) methodology was employed for feature interpretation.ResultsThe LightGBM model demonstrated exceptional discrimination with AUROCs of 0.9931 (95% CI: 0.9883-0.998) in the training cohort and 0.9429 (95% CI: 0.9176-0.9682) in external validation. Calibration curves revealed strong prediction-actual outcome concordance (Brier score=0.139). SHAP analysis identified 13 key predictors, with age (mean SHAP value=0.216) and CA19-9 (mean SHAP value=0.198) as dominant contributors. Other significant variables included hematological parameters (WBC, RBC, HGB, PLT), biochemical markers (ALT, TP, ALB, UREA, uric acid), and gender. A clinically implementable web-based risk calculator was successfully developed for real-time probability estimation.ConclusionsOur LightGBM-based model achieves high predictive accuracy while maintaining clinical interpretability, effectively bridging the gap between complex ML systems and practical clinical decision-making. The identified biomarker panel provides biological insights into CRC pathogenesis. This tool shows significant potential for optimizing early diagnosis and personalized risk assessment in CRC management.
format Article
id doaj-art-7d1a1111f52b4ed19d74f70b22c43c0c
institution Matheson Library
issn 2234-943X
language English
publishDate 2025-07-01
publisher Frontiers Media S.A.
record_format Article
series Frontiers in Oncology
spelling doaj-art-7d1a1111f52b4ed19d74f70b22c43c0c2025-07-17T04:10:20ZengFrontiers Media S.A.Frontiers in Oncology2234-943X2025-07-011510.3389/fonc.2025.15758441575844Population-based colorectal cancer risk prediction using a SHAP-enhanced LightGBM modelGuinian Du0Hui Lv1Yishan Liang2Jingyue Zhang3Qiaoling Huang4Guiming Xie5Xian Wu6Hao Zeng7Lijuan Wu8Jianbo Ye9Wentan Xie10Xia Li11Yifan Sun12Department of Laboratory Medicine, Eighth Affiliated Hospital of Guangxi Medical University, Guigang City People’s Hospital, Guigang, Guangxi, ChinaDepartment of Laboratory Medicine, Eighth Affiliated Hospital of Guangxi Medical University, Guigang City People’s Hospital, Guigang, Guangxi, ChinaDepartment of Laboratory Medicine, Eighth Affiliated Hospital of Guangxi Medical University, Guigang City People’s Hospital, Guigang, Guangxi, ChinaDepartment of Laboratory Medicine, Eighth Affiliated Hospital of Guangxi Medical University, Guigang City People’s Hospital, Guigang, Guangxi, ChinaDepartment of Laboratory Medicine, The People Hospital of Laibin, Laibin, Guangxi, ChinaDepartment of Laboratory Medicine, Eighth Affiliated Hospital of Guangxi Medical University, Guigang City People’s Hospital, Guigang, Guangxi, ChinaDepartment of Laboratory Medicine, Eighth Affiliated Hospital of Guangxi Medical University, Guigang City People’s Hospital, Guigang, Guangxi, ChinaDepartment of Laboratory Medicine, Eighth Affiliated Hospital of Guangxi Medical University, Guigang City People’s Hospital, Guigang, Guangxi, ChinaDepartment of Laboratory Medicine, Eighth Affiliated Hospital of Guangxi Medical University, Guigang City People’s Hospital, Guigang, Guangxi, ChinaDepartment of Endocrinology, The People Hospital of Guigang, Guigang, Guangxi, ChinaDepartment of Laboratory Medicine, Eighth Affiliated Hospital of Guangxi Medical University, Guigang City People’s Hospital, Guigang, Guangxi, ChinaThe Office of Administration, Liuzhou Municipal Liutie Central Hospital, Liuzhou, Guangxi, ChinaDepartment of Laboratory Medicine, Eighth Affiliated Hospital of Guangxi Medical University, Guigang City People’s Hospital, Guigang, Guangxi, ChinaBackgroundColorectal cancer (CRC) is a highly frequent cancer worldwide, and early detection and risk stratification playing a critical role in reducing both incidence and mortality. we aimed to develop and validate a machine learning (ML) model using clinical data to improve CRC identification and prognostic evaluation.MethodsWe analyzed multicenter datasets comprising 676 CRC patients and 410 controls from Guigang City People’s Hospital (2020-2024) for model training/internal validation, with 463 patients from Laibin City People’s Hospital for external validation. Seven ML algorithms were systematically compared, with Light Gradient Boosting Machine (LightGBM) ultimately selected as the optimal framework. Model performance was rigorously assessed through area under the receiver operating characteristic (AUROC) analysis, calibration curves, Brier scores, and decision curve analysis. SHAP (SHapley Additive exPlanations) methodology was employed for feature interpretation.ResultsThe LightGBM model demonstrated exceptional discrimination with AUROCs of 0.9931 (95% CI: 0.9883-0.998) in the training cohort and 0.9429 (95% CI: 0.9176-0.9682) in external validation. Calibration curves revealed strong prediction-actual outcome concordance (Brier score=0.139). SHAP analysis identified 13 key predictors, with age (mean SHAP value=0.216) and CA19-9 (mean SHAP value=0.198) as dominant contributors. Other significant variables included hematological parameters (WBC, RBC, HGB, PLT), biochemical markers (ALT, TP, ALB, UREA, uric acid), and gender. A clinically implementable web-based risk calculator was successfully developed for real-time probability estimation.ConclusionsOur LightGBM-based model achieves high predictive accuracy while maintaining clinical interpretability, effectively bridging the gap between complex ML systems and practical clinical decision-making. The identified biomarker panel provides biological insights into CRC pathogenesis. This tool shows significant potential for optimizing early diagnosis and personalized risk assessment in CRC management.https://www.frontiersin.org/articles/10.3389/fonc.2025.1575844/fullColorectal cancerRisk predictionMachine learningLightGBM modelEarly diagnosis
spellingShingle Guinian Du
Hui Lv
Yishan Liang
Jingyue Zhang
Qiaoling Huang
Guiming Xie
Xian Wu
Hao Zeng
Lijuan Wu
Jianbo Ye
Wentan Xie
Xia Li
Yifan Sun
Population-based colorectal cancer risk prediction using a SHAP-enhanced LightGBM model
Frontiers in Oncology
Colorectal cancer
Risk prediction
Machine learning
LightGBM model
Early diagnosis
title Population-based colorectal cancer risk prediction using a SHAP-enhanced LightGBM model
title_full Population-based colorectal cancer risk prediction using a SHAP-enhanced LightGBM model
title_fullStr Population-based colorectal cancer risk prediction using a SHAP-enhanced LightGBM model
title_full_unstemmed Population-based colorectal cancer risk prediction using a SHAP-enhanced LightGBM model
title_short Population-based colorectal cancer risk prediction using a SHAP-enhanced LightGBM model
title_sort population based colorectal cancer risk prediction using a shap enhanced lightgbm model
topic Colorectal cancer
Risk prediction
Machine learning
LightGBM model
Early diagnosis
url https://www.frontiersin.org/articles/10.3389/fonc.2025.1575844/full
work_keys_str_mv AT guiniandu populationbasedcolorectalcancerriskpredictionusingashapenhancedlightgbmmodel
AT huilv populationbasedcolorectalcancerriskpredictionusingashapenhancedlightgbmmodel
AT yishanliang populationbasedcolorectalcancerriskpredictionusingashapenhancedlightgbmmodel
AT jingyuezhang populationbasedcolorectalcancerriskpredictionusingashapenhancedlightgbmmodel
AT qiaolinghuang populationbasedcolorectalcancerriskpredictionusingashapenhancedlightgbmmodel
AT guimingxie populationbasedcolorectalcancerriskpredictionusingashapenhancedlightgbmmodel
AT xianwu populationbasedcolorectalcancerriskpredictionusingashapenhancedlightgbmmodel
AT haozeng populationbasedcolorectalcancerriskpredictionusingashapenhancedlightgbmmodel
AT lijuanwu populationbasedcolorectalcancerriskpredictionusingashapenhancedlightgbmmodel
AT jianboye populationbasedcolorectalcancerriskpredictionusingashapenhancedlightgbmmodel
AT wentanxie populationbasedcolorectalcancerriskpredictionusingashapenhancedlightgbmmodel
AT xiali populationbasedcolorectalcancerriskpredictionusingashapenhancedlightgbmmodel
AT yifansun populationbasedcolorectalcancerriskpredictionusingashapenhancedlightgbmmodel