Application of SMOTE-ENN Method in Data Balancing for Classification of Diabetes Health Indicators with C4.5 Algorithm

Data imbalance in health datasets often leads to decreased performance of classification models, especially in detecting minority classes such as diabetics. This study evaluates the effect of the SMOTE-ENN method on improving the performance of the C4.5 algorithm in the classification of diabetes he...

Full description

Saved in:
Bibliographic Details
Main Authors: Bakti Putra Pamungkas, Muhammad Jauhar Vikri, Ita Aristia Sa'ida
Format: Article
Language:English
Published: LPPM ISB Atma Luhur 2025-05-01
Series:Jurnal Sisfokom
Subjects:
Online Access:https://jurnal.atmaluhur.ac.id/index.php/sisfokom/article/view/2350
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Data imbalance in health datasets often leads to decreased performance of classification models, especially in detecting minority classes such as diabetics. This study evaluates the effect of the SMOTE-ENN method on improving the performance of the C4.5 algorithm in the classification of diabetes health indicators. The dataset used is the 2021 Diabetes Binary Health Indicators BRFSS from Kaggle, which consists of 236,378 respondent data with unbalanced class distribution: 85.80% non-diabetic and 14.20% diabetic. The SMOTE method was used to add synthetic data to the minority classes, while ENN was applied to remove data considered noise. After balancing, the C4.5 algorithm was used for classification. Evaluation was conducted using accuracy, precision, recall, and F1-score metrics. The results showed that the application of SMOTE-ENN improved accuracy from 79.49% to 80.33% and precision from 29% to 30%. Although the recall value did not increase, this method proved to be able to improve the overall stability of the prediction, especially in terms of the accuracy of the classification of the positive class. The novelty of this research lies in the specific application of the SMOTE-ENN method on large-scale health datasets with the C4.5 algorithm, which has not been widely explored before. Therefore, further exploration of other balancing techniques and algorithms is needed to obtain more optimal classification results on unbalanced data.
ISSN:2301-7988
2581-0588