Enhancing early gestational diabetes mellitus prediction with imputation-based machine learning framework: A comparative study on real-world clinical records

Objective Gestational diabetes mellitus (GDM) is one of the most common pregnancy complications. Electronic health records (EHRs) promise GDM risk prediction, but missing data poses a challenge to developing reliable and generalizable risk prediction models. This study aims to address the problem of...

Full description

Saved in:
Bibliographic Details
Main Authors: Leyao Ma, Lin Yang, Yaxin Wang, Jie Hao, Yini Li, Liangkun Ma, Ziyang Wang, Ye Li, Suhan Zhang, Mingyue Hu, Jiao Li, Yin Sun
Format: Article
Language:English
Published: SAGE Publishing 2025-07-01
Series:Digital Health
Online Access:https://doi.org/10.1177/20552076251352436
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1839609357664256000
author Leyao Ma
Lin Yang
Yaxin Wang
Jie Hao
Yini Li
Liangkun Ma
Ziyang Wang
Ye Li
Suhan Zhang
Mingyue Hu
Jiao Li
Yin Sun
author_facet Leyao Ma
Lin Yang
Yaxin Wang
Jie Hao
Yini Li
Liangkun Ma
Ziyang Wang
Ye Li
Suhan Zhang
Mingyue Hu
Jiao Li
Yin Sun
author_sort Leyao Ma
collection DOAJ
description Objective Gestational diabetes mellitus (GDM) is one of the most common pregnancy complications. Electronic health records (EHRs) promise GDM risk prediction, but missing data poses a challenge to developing reliable and generalizable risk prediction models. This study aims to address the problem of missing EHR data in GDM prediction before 12 weeks gestation. Methods A total of 5066 women with singleton pregnancies, aged 18 to 50, were included in this retrospective study. This study evaluated 6 imputation methods, combined with 4 classification machine learning models. The evaluation encompassed downstream predictive performance, robustness to variable missingness, ability to restore original data distribution, and influence on feature selection based on 10-fold cross-validation. Results Our findings revealed a significant improvement in model performance with imputation. When using the top 30 features, logistic regression (LR) with multivariate imputation by chained equations using classification and regression trees (mice) achieved the highest area under the receiver operating characteristic curve of 0.6899, compared to 0.6336 for the LR model without imputation. Mice also led to the best average performance across prediction models and yielded the most accurate restoration of the original data distribution. LR models trained on data imputed by mice remained the most robust across varying levels of missingness. The classification algorithm primarily accounted for differences in predictive performance. In addition, we identified 18 key features for early GDM prediction in the Chinese population. Conclusion This study demonstrates the critical role of imputation in improving the performance and fairness of GDM prediction models. The findings provide practical guidance for integrating imputation into clinical machine learning pipelines.
format Article
id doaj-art-8f5d5c59c9b7423f8a96d4ed6faf95d8
institution Matheson Library
issn 2055-2076
language English
publishDate 2025-07-01
publisher SAGE Publishing
record_format Article
series Digital Health
spelling doaj-art-8f5d5c59c9b7423f8a96d4ed6faf95d82025-07-30T10:03:40ZengSAGE PublishingDigital Health2055-20762025-07-011110.1177/20552076251352436Enhancing early gestational diabetes mellitus prediction with imputation-based machine learning framework: A comparative study on real-world clinical recordsLeyao Ma0Lin Yang1Yaxin Wang2Jie Hao3Yini Li4Liangkun Ma5Ziyang Wang6Ye Li7Suhan Zhang8Mingyue Hu9Jiao Li10Yin Sun11 Institute of Medical Information, Chinese Academy of Medical Science & Peking Union Medical College, Beijing, China Key Laboratory of Medical Information Intelligent Technology, , Beijing, China The Obstetrics and Gynecology Hospital of Fudan University, Shanghai, China Institute of Medical Information, Chinese Academy of Medical Science & Peking Union Medical College, Beijing, China National Clinical Research Center for Obstetric & Gynecologic Diseases, Department of Obstetrics and Gynecology, , , Beijing, China National Clinical Research Center for Obstetric & Gynecologic Diseases, Department of Obstetrics and Gynecology, , , Beijing, China Institute of Medical Information, Chinese Academy of Medical Science & Peking Union Medical College, Beijing, China National Clinical Research Center for Obstetric & Gynecologic Diseases, Department of Obstetrics and Gynecology, , , Beijing, China National Clinical Research Center for Obstetric & Gynecologic Diseases, Department of Obstetrics and Gynecology, , , Beijing, China National Clinical Research Center for Obstetric & Gynecologic Diseases, Department of Obstetrics and Gynecology, , , Beijing, China Key Laboratory of Medical Information Intelligent Technology, , Beijing, China National Clinical Research Center for Obstetric & Gynecologic Diseases, Department of Obstetrics and Gynecology, , , Beijing, ChinaObjective Gestational diabetes mellitus (GDM) is one of the most common pregnancy complications. Electronic health records (EHRs) promise GDM risk prediction, but missing data poses a challenge to developing reliable and generalizable risk prediction models. This study aims to address the problem of missing EHR data in GDM prediction before 12 weeks gestation. Methods A total of 5066 women with singleton pregnancies, aged 18 to 50, were included in this retrospective study. This study evaluated 6 imputation methods, combined with 4 classification machine learning models. The evaluation encompassed downstream predictive performance, robustness to variable missingness, ability to restore original data distribution, and influence on feature selection based on 10-fold cross-validation. Results Our findings revealed a significant improvement in model performance with imputation. When using the top 30 features, logistic regression (LR) with multivariate imputation by chained equations using classification and regression trees (mice) achieved the highest area under the receiver operating characteristic curve of 0.6899, compared to 0.6336 for the LR model without imputation. Mice also led to the best average performance across prediction models and yielded the most accurate restoration of the original data distribution. LR models trained on data imputed by mice remained the most robust across varying levels of missingness. The classification algorithm primarily accounted for differences in predictive performance. In addition, we identified 18 key features for early GDM prediction in the Chinese population. Conclusion This study demonstrates the critical role of imputation in improving the performance and fairness of GDM prediction models. The findings provide practical guidance for integrating imputation into clinical machine learning pipelines.https://doi.org/10.1177/20552076251352436
spellingShingle Leyao Ma
Lin Yang
Yaxin Wang
Jie Hao
Yini Li
Liangkun Ma
Ziyang Wang
Ye Li
Suhan Zhang
Mingyue Hu
Jiao Li
Yin Sun
Enhancing early gestational diabetes mellitus prediction with imputation-based machine learning framework: A comparative study on real-world clinical records
Digital Health
title Enhancing early gestational diabetes mellitus prediction with imputation-based machine learning framework: A comparative study on real-world clinical records
title_full Enhancing early gestational diabetes mellitus prediction with imputation-based machine learning framework: A comparative study on real-world clinical records
title_fullStr Enhancing early gestational diabetes mellitus prediction with imputation-based machine learning framework: A comparative study on real-world clinical records
title_full_unstemmed Enhancing early gestational diabetes mellitus prediction with imputation-based machine learning framework: A comparative study on real-world clinical records
title_short Enhancing early gestational diabetes mellitus prediction with imputation-based machine learning framework: A comparative study on real-world clinical records
title_sort enhancing early gestational diabetes mellitus prediction with imputation based machine learning framework a comparative study on real world clinical records
url https://doi.org/10.1177/20552076251352436
work_keys_str_mv AT leyaoma enhancingearlygestationaldiabetesmellituspredictionwithimputationbasedmachinelearningframeworkacomparativestudyonrealworldclinicalrecords
AT linyang enhancingearlygestationaldiabetesmellituspredictionwithimputationbasedmachinelearningframeworkacomparativestudyonrealworldclinicalrecords
AT yaxinwang enhancingearlygestationaldiabetesmellituspredictionwithimputationbasedmachinelearningframeworkacomparativestudyonrealworldclinicalrecords
AT jiehao enhancingearlygestationaldiabetesmellituspredictionwithimputationbasedmachinelearningframeworkacomparativestudyonrealworldclinicalrecords
AT yinili enhancingearlygestationaldiabetesmellituspredictionwithimputationbasedmachinelearningframeworkacomparativestudyonrealworldclinicalrecords
AT liangkunma enhancingearlygestationaldiabetesmellituspredictionwithimputationbasedmachinelearningframeworkacomparativestudyonrealworldclinicalrecords
AT ziyangwang enhancingearlygestationaldiabetesmellituspredictionwithimputationbasedmachinelearningframeworkacomparativestudyonrealworldclinicalrecords
AT yeli enhancingearlygestationaldiabetesmellituspredictionwithimputationbasedmachinelearningframeworkacomparativestudyonrealworldclinicalrecords
AT suhanzhang enhancingearlygestationaldiabetesmellituspredictionwithimputationbasedmachinelearningframeworkacomparativestudyonrealworldclinicalrecords
AT mingyuehu enhancingearlygestationaldiabetesmellituspredictionwithimputationbasedmachinelearningframeworkacomparativestudyonrealworldclinicalrecords
AT jiaoli enhancingearlygestationaldiabetesmellituspredictionwithimputationbasedmachinelearningframeworkacomparativestudyonrealworldclinicalrecords
AT yinsun enhancingearlygestationaldiabetesmellituspredictionwithimputationbasedmachinelearningframeworkacomparativestudyonrealworldclinicalrecords