Enhancing early gestational diabetes mellitus prediction with imputation-based machine learning framework: A comparative study on real-world clinical records
Objective Gestational diabetes mellitus (GDM) is one of the most common pregnancy complications. Electronic health records (EHRs) promise GDM risk prediction, but missing data poses a challenge to developing reliable and generalizable risk prediction models. This study aims to address the problem of...
Saved in:
Main Authors: | , , , , , , , , , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
SAGE Publishing
2025-07-01
|
Series: | Digital Health |
Online Access: | https://doi.org/10.1177/20552076251352436 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
_version_ | 1839609357664256000 |
---|---|
author | Leyao Ma Lin Yang Yaxin Wang Jie Hao Yini Li Liangkun Ma Ziyang Wang Ye Li Suhan Zhang Mingyue Hu Jiao Li Yin Sun |
author_facet | Leyao Ma Lin Yang Yaxin Wang Jie Hao Yini Li Liangkun Ma Ziyang Wang Ye Li Suhan Zhang Mingyue Hu Jiao Li Yin Sun |
author_sort | Leyao Ma |
collection | DOAJ |
description | Objective Gestational diabetes mellitus (GDM) is one of the most common pregnancy complications. Electronic health records (EHRs) promise GDM risk prediction, but missing data poses a challenge to developing reliable and generalizable risk prediction models. This study aims to address the problem of missing EHR data in GDM prediction before 12 weeks gestation. Methods A total of 5066 women with singleton pregnancies, aged 18 to 50, were included in this retrospective study. This study evaluated 6 imputation methods, combined with 4 classification machine learning models. The evaluation encompassed downstream predictive performance, robustness to variable missingness, ability to restore original data distribution, and influence on feature selection based on 10-fold cross-validation. Results Our findings revealed a significant improvement in model performance with imputation. When using the top 30 features, logistic regression (LR) with multivariate imputation by chained equations using classification and regression trees (mice) achieved the highest area under the receiver operating characteristic curve of 0.6899, compared to 0.6336 for the LR model without imputation. Mice also led to the best average performance across prediction models and yielded the most accurate restoration of the original data distribution. LR models trained on data imputed by mice remained the most robust across varying levels of missingness. The classification algorithm primarily accounted for differences in predictive performance. In addition, we identified 18 key features for early GDM prediction in the Chinese population. Conclusion This study demonstrates the critical role of imputation in improving the performance and fairness of GDM prediction models. The findings provide practical guidance for integrating imputation into clinical machine learning pipelines. |
format | Article |
id | doaj-art-8f5d5c59c9b7423f8a96d4ed6faf95d8 |
institution | Matheson Library |
issn | 2055-2076 |
language | English |
publishDate | 2025-07-01 |
publisher | SAGE Publishing |
record_format | Article |
series | Digital Health |
spelling | doaj-art-8f5d5c59c9b7423f8a96d4ed6faf95d82025-07-30T10:03:40ZengSAGE PublishingDigital Health2055-20762025-07-011110.1177/20552076251352436Enhancing early gestational diabetes mellitus prediction with imputation-based machine learning framework: A comparative study on real-world clinical recordsLeyao Ma0Lin Yang1Yaxin Wang2Jie Hao3Yini Li4Liangkun Ma5Ziyang Wang6Ye Li7Suhan Zhang8Mingyue Hu9Jiao Li10Yin Sun11 Institute of Medical Information, Chinese Academy of Medical Science & Peking Union Medical College, Beijing, China Key Laboratory of Medical Information Intelligent Technology, , Beijing, China The Obstetrics and Gynecology Hospital of Fudan University, Shanghai, China Institute of Medical Information, Chinese Academy of Medical Science & Peking Union Medical College, Beijing, China National Clinical Research Center for Obstetric & Gynecologic Diseases, Department of Obstetrics and Gynecology, , , Beijing, China National Clinical Research Center for Obstetric & Gynecologic Diseases, Department of Obstetrics and Gynecology, , , Beijing, China Institute of Medical Information, Chinese Academy of Medical Science & Peking Union Medical College, Beijing, China National Clinical Research Center for Obstetric & Gynecologic Diseases, Department of Obstetrics and Gynecology, , , Beijing, China National Clinical Research Center for Obstetric & Gynecologic Diseases, Department of Obstetrics and Gynecology, , , Beijing, China National Clinical Research Center for Obstetric & Gynecologic Diseases, Department of Obstetrics and Gynecology, , , Beijing, China Key Laboratory of Medical Information Intelligent Technology, , Beijing, China National Clinical Research Center for Obstetric & Gynecologic Diseases, Department of Obstetrics and Gynecology, , , Beijing, ChinaObjective Gestational diabetes mellitus (GDM) is one of the most common pregnancy complications. Electronic health records (EHRs) promise GDM risk prediction, but missing data poses a challenge to developing reliable and generalizable risk prediction models. This study aims to address the problem of missing EHR data in GDM prediction before 12 weeks gestation. Methods A total of 5066 women with singleton pregnancies, aged 18 to 50, were included in this retrospective study. This study evaluated 6 imputation methods, combined with 4 classification machine learning models. The evaluation encompassed downstream predictive performance, robustness to variable missingness, ability to restore original data distribution, and influence on feature selection based on 10-fold cross-validation. Results Our findings revealed a significant improvement in model performance with imputation. When using the top 30 features, logistic regression (LR) with multivariate imputation by chained equations using classification and regression trees (mice) achieved the highest area under the receiver operating characteristic curve of 0.6899, compared to 0.6336 for the LR model without imputation. Mice also led to the best average performance across prediction models and yielded the most accurate restoration of the original data distribution. LR models trained on data imputed by mice remained the most robust across varying levels of missingness. The classification algorithm primarily accounted for differences in predictive performance. In addition, we identified 18 key features for early GDM prediction in the Chinese population. Conclusion This study demonstrates the critical role of imputation in improving the performance and fairness of GDM prediction models. The findings provide practical guidance for integrating imputation into clinical machine learning pipelines.https://doi.org/10.1177/20552076251352436 |
spellingShingle | Leyao Ma Lin Yang Yaxin Wang Jie Hao Yini Li Liangkun Ma Ziyang Wang Ye Li Suhan Zhang Mingyue Hu Jiao Li Yin Sun Enhancing early gestational diabetes mellitus prediction with imputation-based machine learning framework: A comparative study on real-world clinical records Digital Health |
title | Enhancing early gestational diabetes mellitus prediction with imputation-based machine learning framework: A comparative study on real-world clinical records |
title_full | Enhancing early gestational diabetes mellitus prediction with imputation-based machine learning framework: A comparative study on real-world clinical records |
title_fullStr | Enhancing early gestational diabetes mellitus prediction with imputation-based machine learning framework: A comparative study on real-world clinical records |
title_full_unstemmed | Enhancing early gestational diabetes mellitus prediction with imputation-based machine learning framework: A comparative study on real-world clinical records |
title_short | Enhancing early gestational diabetes mellitus prediction with imputation-based machine learning framework: A comparative study on real-world clinical records |
title_sort | enhancing early gestational diabetes mellitus prediction with imputation based machine learning framework a comparative study on real world clinical records |
url | https://doi.org/10.1177/20552076251352436 |
work_keys_str_mv | AT leyaoma enhancingearlygestationaldiabetesmellituspredictionwithimputationbasedmachinelearningframeworkacomparativestudyonrealworldclinicalrecords AT linyang enhancingearlygestationaldiabetesmellituspredictionwithimputationbasedmachinelearningframeworkacomparativestudyonrealworldclinicalrecords AT yaxinwang enhancingearlygestationaldiabetesmellituspredictionwithimputationbasedmachinelearningframeworkacomparativestudyonrealworldclinicalrecords AT jiehao enhancingearlygestationaldiabetesmellituspredictionwithimputationbasedmachinelearningframeworkacomparativestudyonrealworldclinicalrecords AT yinili enhancingearlygestationaldiabetesmellituspredictionwithimputationbasedmachinelearningframeworkacomparativestudyonrealworldclinicalrecords AT liangkunma enhancingearlygestationaldiabetesmellituspredictionwithimputationbasedmachinelearningframeworkacomparativestudyonrealworldclinicalrecords AT ziyangwang enhancingearlygestationaldiabetesmellituspredictionwithimputationbasedmachinelearningframeworkacomparativestudyonrealworldclinicalrecords AT yeli enhancingearlygestationaldiabetesmellituspredictionwithimputationbasedmachinelearningframeworkacomparativestudyonrealworldclinicalrecords AT suhanzhang enhancingearlygestationaldiabetesmellituspredictionwithimputationbasedmachinelearningframeworkacomparativestudyonrealworldclinicalrecords AT mingyuehu enhancingearlygestationaldiabetesmellituspredictionwithimputationbasedmachinelearningframeworkacomparativestudyonrealworldclinicalrecords AT jiaoli enhancingearlygestationaldiabetesmellituspredictionwithimputationbasedmachinelearningframeworkacomparativestudyonrealworldclinicalrecords AT yinsun enhancingearlygestationaldiabetesmellituspredictionwithimputationbasedmachinelearningframeworkacomparativestudyonrealworldclinicalrecords |