COMPARISON OF DATASET OVERSAMPLING ALGORITHMS AND THEIR APPLICABILITY TO THE CATEGORIZATION PROBLEM

The subject of research in the article is the problem of classification in machine learning in the presence of imbalanced classes in datasets. The purpose of the work is to analyze existing solutions and algorithms for solving the problem of dataset imbalance of different types and different indust...

Full description

Saved in:
Bibliographic Details
Main Authors: Denys Teslenko, Anna Sorokina, Artem Khovrat, Nural Huliiev, Valentyna Kyriy
Format: Article
Language:English
Published: Kharkiv National University of Radio Electronics 2023-08-01
Series:Сучасний стан наукових досліджень та технологій в промисловості
Subjects:
Online Access:https://itssi-journal.com/index.php/ittsi/article/view/399
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1839597828392878080
author Denys Teslenko
Anna Sorokina
Artem Khovrat
Nural Huliiev
Valentyna Kyriy
author_facet Denys Teslenko
Anna Sorokina
Artem Khovrat
Nural Huliiev
Valentyna Kyriy
author_sort Denys Teslenko
collection DOAJ
description The subject of research in the article is the problem of classification in machine learning in the presence of imbalanced classes in datasets. The purpose of the work is to analyze existing solutions and algorithms for solving the problem of dataset imbalance of different types and different industries and to conduct an experimental comparison of algorithms. The article solves the following tasks: to analyze approaches to solving the problem – preprocessing methods, learning methods, hybrid methods and algorithmic approaches; to define and describe the oversampling algorithms most often used to balance datasets; to select classification algorithms that will serve as a tool for establishing the quality of balancing by checking the applicability of the datasets obtained after oversampling; to determine metrics for assessing the quality of classification for comparison; to conduct experiments according to the proposed methodology. For clarity, we considered datasets with varying degrees of imbalance (the number of instances of the minority class was equal to 15, 30, 45, and 60% of the number of samples of the majority class). The following methods are used: analytical and inductive methods for determining the necessary set of experiments and building hypotheses regarding their results, experimental and graphic methods for obtaining a visual comparative characteristic of the selected algorithms. The following results were obtained: with the help of quality metrics, an experiment was conducted for all algorithms on two different datasets – the Titanic passenger dataset and the dataset for detecting fraudulent transactions in bank accounts. The obtained results indicated the best applicability of SMOTE and SVM SMOTE algorithms, the worst performance of Borderline SMOTE and k-means SMOTE, and at the same time described the results of each algorithm and the potential of their usage. Conclusions: the application of the analytical and experimental method provided a comprehensive comparative description of the existing balancing algorithms. The superiority of oversampling algorithms over undersampling algorithms was proven. The selected algorithms were compared using different classification algorithms. The results were presented using graphs and tables, as well as demonstrated in general using heat maps. Conclusions that were made can be used when choosing the optimal balancing algorithm in the field of machine learning.
format Article
id doaj-art-bfa6df24a9d843b7bcc7c72ed2c4444f
institution Matheson Library
issn 2522-9818
2524-2296
language English
publishDate 2023-08-01
publisher Kharkiv National University of Radio Electronics
record_format Article
series Сучасний стан наукових досліджень та технологій в промисловості
spelling doaj-art-bfa6df24a9d843b7bcc7c72ed2c4444f2025-08-02T18:23:03ZengKharkiv National University of Radio ElectronicsСучасний стан наукових досліджень та технологій в промисловості2522-98182524-22962023-08-012(24)10.30837/ITSSI.2023.24.161COMPARISON OF DATASET OVERSAMPLING ALGORITHMS AND THEIR APPLICABILITY TO THE CATEGORIZATION PROBLEMDenys Teslenko0Anna Sorokina1Artem Khovrat2Nural Huliiev3Valentyna Kyriy4Kharkіv National University of Radio ElectronicsKharkіv National University of Radio ElectronicsKharkіv National University of Radio ElectronicsKharkіv National University of Radio ElectronicsKharkіv National University of Radio Electronics The subject of research in the article is the problem of classification in machine learning in the presence of imbalanced classes in datasets. The purpose of the work is to analyze existing solutions and algorithms for solving the problem of dataset imbalance of different types and different industries and to conduct an experimental comparison of algorithms. The article solves the following tasks: to analyze approaches to solving the problem – preprocessing methods, learning methods, hybrid methods and algorithmic approaches; to define and describe the oversampling algorithms most often used to balance datasets; to select classification algorithms that will serve as a tool for establishing the quality of balancing by checking the applicability of the datasets obtained after oversampling; to determine metrics for assessing the quality of classification for comparison; to conduct experiments according to the proposed methodology. For clarity, we considered datasets with varying degrees of imbalance (the number of instances of the minority class was equal to 15, 30, 45, and 60% of the number of samples of the majority class). The following methods are used: analytical and inductive methods for determining the necessary set of experiments and building hypotheses regarding their results, experimental and graphic methods for obtaining a visual comparative characteristic of the selected algorithms. The following results were obtained: with the help of quality metrics, an experiment was conducted for all algorithms on two different datasets – the Titanic passenger dataset and the dataset for detecting fraudulent transactions in bank accounts. The obtained results indicated the best applicability of SMOTE and SVM SMOTE algorithms, the worst performance of Borderline SMOTE and k-means SMOTE, and at the same time described the results of each algorithm and the potential of their usage. Conclusions: the application of the analytical and experimental method provided a comprehensive comparative description of the existing balancing algorithms. The superiority of oversampling algorithms over undersampling algorithms was proven. The selected algorithms were compared using different classification algorithms. The results were presented using graphs and tables, as well as demonstrated in general using heat maps. Conclusions that were made can be used when choosing the optimal balancing algorithm in the field of machine learning. https://itssi-journal.com/index.php/ittsi/article/view/399categorization; machine learning; methods of balancing; data generation methods; dataset; unbalanced datasets.
spellingShingle Denys Teslenko
Anna Sorokina
Artem Khovrat
Nural Huliiev
Valentyna Kyriy
COMPARISON OF DATASET OVERSAMPLING ALGORITHMS AND THEIR APPLICABILITY TO THE CATEGORIZATION PROBLEM
Сучасний стан наукових досліджень та технологій в промисловості
categorization; machine learning; methods of balancing; data generation methods; dataset; unbalanced datasets.
title COMPARISON OF DATASET OVERSAMPLING ALGORITHMS AND THEIR APPLICABILITY TO THE CATEGORIZATION PROBLEM
title_full COMPARISON OF DATASET OVERSAMPLING ALGORITHMS AND THEIR APPLICABILITY TO THE CATEGORIZATION PROBLEM
title_fullStr COMPARISON OF DATASET OVERSAMPLING ALGORITHMS AND THEIR APPLICABILITY TO THE CATEGORIZATION PROBLEM
title_full_unstemmed COMPARISON OF DATASET OVERSAMPLING ALGORITHMS AND THEIR APPLICABILITY TO THE CATEGORIZATION PROBLEM
title_short COMPARISON OF DATASET OVERSAMPLING ALGORITHMS AND THEIR APPLICABILITY TO THE CATEGORIZATION PROBLEM
title_sort comparison of dataset oversampling algorithms and their applicability to the categorization problem
topic categorization; machine learning; methods of balancing; data generation methods; dataset; unbalanced datasets.
url https://itssi-journal.com/index.php/ittsi/article/view/399
work_keys_str_mv AT denysteslenko comparisonofdatasetoversamplingalgorithmsandtheirapplicabilitytothecategorizationproblem
AT annasorokina comparisonofdatasetoversamplingalgorithmsandtheirapplicabilitytothecategorizationproblem
AT artemkhovrat comparisonofdatasetoversamplingalgorithmsandtheirapplicabilitytothecategorizationproblem
AT nuralhuliiev comparisonofdatasetoversamplingalgorithmsandtheirapplicabilitytothecategorizationproblem
AT valentynakyriy comparisonofdatasetoversamplingalgorithmsandtheirapplicabilitytothecategorizationproblem