Interpretable Machine Learning for Serum-Based Metabolomics in Breast Cancer Diagnostics: Insights from Multi-Objective Feature Selection-Driven LightGBM-SHAP Models

<i>Background and Objectives:</i> Breast cancer accounts for 12.5% of all new cancer cases in women worldwide. Early detection significantly improves survival rates, but traditional biomarkers like CA 15-3 and HER2 lack sensitivity and specificity, particularly for early-stage disease. A...

Full description

Saved in:
Bibliographic Details
Main Authors: Emek Guldogan, Fatma Hilal Yagin, Hasan Ucuzal, Sarah A. Alzakari, Amel Ali Alhussan, Luca Paolo Ardigò
Format: Article
Language:English
Published: MDPI AG 2025-06-01
Series:Medicina
Subjects:
Online Access:https://www.mdpi.com/1648-9144/61/6/1112
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1839653395801047040
author Emek Guldogan
Fatma Hilal Yagin
Hasan Ucuzal
Sarah A. Alzakari
Amel Ali Alhussan
Luca Paolo Ardigò
author_facet Emek Guldogan
Fatma Hilal Yagin
Hasan Ucuzal
Sarah A. Alzakari
Amel Ali Alhussan
Luca Paolo Ardigò
author_sort Emek Guldogan
collection DOAJ
description <i>Background and Objectives:</i> Breast cancer accounts for 12.5% of all new cancer cases in women worldwide. Early detection significantly improves survival rates, but traditional biomarkers like CA 15-3 and HER2 lack sensitivity and specificity, particularly for early-stage disease. Advances in metabolomics and machine learning, particularly explainable artificial intelligence (XAI), offer new opportunities for identifying robust biomarkers and improving diagnostic accuracy. This study aimed to identify and validate serum-based metabolic biomarkers for breast cancer using advanced metabolomic profiling techniques and a Light Gradient Boosting Machine (LightGBM) model. Additionally, SHapley Additive exPlanations (SHAP) were applied to enhance model interpretability and biological insight. <i>Materials and Methods:</i> The study included 103 breast cancer patients and 31 healthy controls. Serum samples underwent liquid and gas chromatography–time-of-flight mass spectrometry (LC-TOFMS and GC-TOFMS). Mutual Information (MI), Sparse Partial Least Squares (sPLS), Boruta, and Multi-Objective Feature Selection (MOFS) approaches were applied to the data for biomarker discovery. LightGBM, AdaBoost, and Random Forest were employed for classification and to identify class imbalance with the Synthetic Minority Oversampling Technique (SMOTE). SHAP analysis ranked metabolites based on their contribution to model predictions. <i>Results:</i> Compared to other feature selection approaches, the MOFS approach was more robust in terms of predictive performance, and metabolites identified by this method were used in subsequent analyses for biomarker discovery. LightGBM outperformed the AdaBoost and Random Forest models, achieving 86.6% accuracy, 89.1% sensitivity, 84.2% specificity, and an F1-score of 87.0%. SHAP analysis identified 2-Aminobutyric acid, choline, and coproporphyrin as the most influential metabolites, with dysregulation of these markers associated with breast cancer risk. <i>Conclusions:</i> This study is among the first to integrate SHAP explainability with metabolomic profiling, bridging computational predictions and biological insights for improved clinical adoption. This study demonstrates the effectiveness of combining metabolomics with XAI-driven machine learning for breast cancer diagnostics. The identified biomarkers not only improve diagnostic accuracy but also reveal critical metabolic dysregulations associated with disease progression.
format Article
id doaj-art-b2fe86a504ea486aa148a2bcbea1a5a7
institution Matheson Library
issn 1010-660X
1648-9144
language English
publishDate 2025-06-01
publisher MDPI AG
record_format Article
series Medicina
spelling doaj-art-b2fe86a504ea486aa148a2bcbea1a5a72025-06-25T14:09:55ZengMDPI AGMedicina1010-660X1648-91442025-06-01616111210.3390/medicina61061112Interpretable Machine Learning for Serum-Based Metabolomics in Breast Cancer Diagnostics: Insights from Multi-Objective Feature Selection-Driven LightGBM-SHAP ModelsEmek Guldogan0Fatma Hilal Yagin1Hasan Ucuzal2Sarah A. Alzakari3Amel Ali Alhussan4Luca Paolo Ardigò5Department of Biostatistics, and Medical Informatics, Faculty of Medicine, Inonu University, 44280 Malatya, TurkeyDepartment of Biostatistics, Faculty of Medicine, Malatya Turgut Ozal University, 44210 Malatya, TurkeyDepartment of Biostatistics, and Medical Informatics, Faculty of Medicine, Inonu University, 44280 Malatya, TurkeyDepartment of Computer Sciences, College of Computer and Information Sciences, Princess Nourah bint Abdulrahman University, P.O. Box 84428, Riyadh 11671, Saudi ArabiaDepartment of Computer Sciences, College of Computer and Information Sciences, Princess Nourah bint Abdulrahman University, P.O. Box 84428, Riyadh 11671, Saudi ArabiaDepartment of Teacher Education, NLA University College, Linstows Gate 3, 0166 Oslo, Norway<i>Background and Objectives:</i> Breast cancer accounts for 12.5% of all new cancer cases in women worldwide. Early detection significantly improves survival rates, but traditional biomarkers like CA 15-3 and HER2 lack sensitivity and specificity, particularly for early-stage disease. Advances in metabolomics and machine learning, particularly explainable artificial intelligence (XAI), offer new opportunities for identifying robust biomarkers and improving diagnostic accuracy. This study aimed to identify and validate serum-based metabolic biomarkers for breast cancer using advanced metabolomic profiling techniques and a Light Gradient Boosting Machine (LightGBM) model. Additionally, SHapley Additive exPlanations (SHAP) were applied to enhance model interpretability and biological insight. <i>Materials and Methods:</i> The study included 103 breast cancer patients and 31 healthy controls. Serum samples underwent liquid and gas chromatography–time-of-flight mass spectrometry (LC-TOFMS and GC-TOFMS). Mutual Information (MI), Sparse Partial Least Squares (sPLS), Boruta, and Multi-Objective Feature Selection (MOFS) approaches were applied to the data for biomarker discovery. LightGBM, AdaBoost, and Random Forest were employed for classification and to identify class imbalance with the Synthetic Minority Oversampling Technique (SMOTE). SHAP analysis ranked metabolites based on their contribution to model predictions. <i>Results:</i> Compared to other feature selection approaches, the MOFS approach was more robust in terms of predictive performance, and metabolites identified by this method were used in subsequent analyses for biomarker discovery. LightGBM outperformed the AdaBoost and Random Forest models, achieving 86.6% accuracy, 89.1% sensitivity, 84.2% specificity, and an F1-score of 87.0%. SHAP analysis identified 2-Aminobutyric acid, choline, and coproporphyrin as the most influential metabolites, with dysregulation of these markers associated with breast cancer risk. <i>Conclusions:</i> This study is among the first to integrate SHAP explainability with metabolomic profiling, bridging computational predictions and biological insights for improved clinical adoption. This study demonstrates the effectiveness of combining metabolomics with XAI-driven machine learning for breast cancer diagnostics. The identified biomarkers not only improve diagnostic accuracy but also reveal critical metabolic dysregulations associated with disease progression.https://www.mdpi.com/1648-9144/61/6/1112breast cancermetabolomicsexplainable AILightGBMSHAPbiomarkers
spellingShingle Emek Guldogan
Fatma Hilal Yagin
Hasan Ucuzal
Sarah A. Alzakari
Amel Ali Alhussan
Luca Paolo Ardigò
Interpretable Machine Learning for Serum-Based Metabolomics in Breast Cancer Diagnostics: Insights from Multi-Objective Feature Selection-Driven LightGBM-SHAP Models
Medicina
breast cancer
metabolomics
explainable AI
LightGBM
SHAP
biomarkers
title Interpretable Machine Learning for Serum-Based Metabolomics in Breast Cancer Diagnostics: Insights from Multi-Objective Feature Selection-Driven LightGBM-SHAP Models
title_full Interpretable Machine Learning for Serum-Based Metabolomics in Breast Cancer Diagnostics: Insights from Multi-Objective Feature Selection-Driven LightGBM-SHAP Models
title_fullStr Interpretable Machine Learning for Serum-Based Metabolomics in Breast Cancer Diagnostics: Insights from Multi-Objective Feature Selection-Driven LightGBM-SHAP Models
title_full_unstemmed Interpretable Machine Learning for Serum-Based Metabolomics in Breast Cancer Diagnostics: Insights from Multi-Objective Feature Selection-Driven LightGBM-SHAP Models
title_short Interpretable Machine Learning for Serum-Based Metabolomics in Breast Cancer Diagnostics: Insights from Multi-Objective Feature Selection-Driven LightGBM-SHAP Models
title_sort interpretable machine learning for serum based metabolomics in breast cancer diagnostics insights from multi objective feature selection driven lightgbm shap models
topic breast cancer
metabolomics
explainable AI
LightGBM
SHAP
biomarkers
url https://www.mdpi.com/1648-9144/61/6/1112
work_keys_str_mv AT emekguldogan interpretablemachinelearningforserumbasedmetabolomicsinbreastcancerdiagnosticsinsightsfrommultiobjectivefeatureselectiondrivenlightgbmshapmodels
AT fatmahilalyagin interpretablemachinelearningforserumbasedmetabolomicsinbreastcancerdiagnosticsinsightsfrommultiobjectivefeatureselectiondrivenlightgbmshapmodels
AT hasanucuzal interpretablemachinelearningforserumbasedmetabolomicsinbreastcancerdiagnosticsinsightsfrommultiobjectivefeatureselectiondrivenlightgbmshapmodels
AT sarahaalzakari interpretablemachinelearningforserumbasedmetabolomicsinbreastcancerdiagnosticsinsightsfrommultiobjectivefeatureselectiondrivenlightgbmshapmodels
AT amelalialhussan interpretablemachinelearningforserumbasedmetabolomicsinbreastcancerdiagnosticsinsightsfrommultiobjectivefeatureselectiondrivenlightgbmshapmodels
AT lucapaoloardigo interpretablemachinelearningforserumbasedmetabolomicsinbreastcancerdiagnosticsinsightsfrommultiobjectivefeatureselectiondrivenlightgbmshapmodels