Boost-Classifier-Driven Fault Prediction Across Heterogeneous Open-Source Repositories

Ensuring reliability, availability, and security in modern software systems hinges on early fault detection, yet predicting which parts of a codebase are most at risk remains a significant challenge. In this paper, we analyze 2.4 million commits drawn from 33 heterogeneous open-source projects, span...

Full description

Saved in:

Bibliographic Details
Main Authors:	Philip König, Sebastian Raubitzek, Alexander Schatten, Dennis Toth, Fabian Obermann, Caroline König, Kevin Mallinger
Format:	Article
Language:	English
Published:	MDPI AG 2025-07-01
Series:	Big Data and Cognitive Computing
Subjects:	fault prediction machine learning CatBoost code metrics feature importance artificial intelligence
Online Access:	https://www.mdpi.com/2504-2289/9/7/174
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1839616489990127616
author	Philip König Sebastian Raubitzek Alexander Schatten Dennis Toth Fabian Obermann Caroline König Kevin Mallinger
author_facet	Philip König Sebastian Raubitzek Alexander Schatten Dennis Toth Fabian Obermann Caroline König Kevin Mallinger
author_sort	Philip König
collection	DOAJ
description	Ensuring reliability, availability, and security in modern software systems hinges on early fault detection, yet predicting which parts of a codebase are most at risk remains a significant challenge. In this paper, we analyze 2.4 million commits drawn from 33 heterogeneous open-source projects, spanning healthcare, security tools, data processing, and more. By examining each repository per file and per commit, we derive process metrics (e.g., churn, file age, revision frequency) alongside size metrics and entropy-based indicators of how scattered changes are over time. We train and tune a gradient boosting model to classify bug-prone commits under realistic class-imbalance conditions, achieving robust predictive performance across diverse repositories. Moreover, a comprehensive feature-importance analysis shows that files with long lifespans (high age), frequent edits (revision count), and widely scattered changes (entropy metrics) are especially vulnerable to defects. These insights can help practitioners and researchers prioritize testing and tailor maintenance strategies, ultimately strengthening software dependability.
format	Article
id	doaj-art-7c2ae9f91a3d4dba96aff484f33f83e6
institution	Matheson Library
issn	2504-2289
language	English
publishDate	2025-07-01
publisher	MDPI AG
record_format	Article
series	Big Data and Cognitive Computing
spelling	doaj-art-7c2ae9f91a3d4dba96aff484f33f83e62025-07-25T13:14:08ZengMDPI AGBig Data and Cognitive Computing2504-22892025-07-019717410.3390/bdcc9070174Boost-Classifier-Driven Fault Prediction Across Heterogeneous Open-Source RepositoriesPhilip König0Sebastian Raubitzek1Alexander Schatten2Dennis Toth3Fabian Obermann4Caroline König5Kevin Mallinger6SBA Research gGmbH, Floragasse 7/5.OG, 1040 Vienna, AustriaSBA Research gGmbH, Floragasse 7/5.OG, 1040 Vienna, AustriaInstitute of Information Systems Engineering, TU Wien, Favoritenstrasse 9–11/194, 1040 Vienna, AustriaSBA Research gGmbH, Floragasse 7/5.OG, 1040 Vienna, AustriaSBA Research gGmbH, Floragasse 7/5.OG, 1040 Vienna, AustriaChristian Doppler Laboratory for Assurance and Transparency in Software Protection, Research Group Security & Privacy, Faculty of Computer Science, University of Vienna, Kolingasse 14–16, 1040 Vienna, AustriaSBA Research gGmbH, Floragasse 7/5.OG, 1040 Vienna, AustriaEnsuring reliability, availability, and security in modern software systems hinges on early fault detection, yet predicting which parts of a codebase are most at risk remains a significant challenge. In this paper, we analyze 2.4 million commits drawn from 33 heterogeneous open-source projects, spanning healthcare, security tools, data processing, and more. By examining each repository per file and per commit, we derive process metrics (e.g., churn, file age, revision frequency) alongside size metrics and entropy-based indicators of how scattered changes are over time. We train and tune a gradient boosting model to classify bug-prone commits under realistic class-imbalance conditions, achieving robust predictive performance across diverse repositories. Moreover, a comprehensive feature-importance analysis shows that files with long lifespans (high age), frequent edits (revision count), and widely scattered changes (entropy metrics) are especially vulnerable to defects. These insights can help practitioners and researchers prioritize testing and tailor maintenance strategies, ultimately strengthening software dependability.https://www.mdpi.com/2504-2289/9/7/174fault predictionmachine learningCatBoostcode metricsfeature importanceartificial intelligence
spellingShingle	Philip König Sebastian Raubitzek Alexander Schatten Dennis Toth Fabian Obermann Caroline König Kevin Mallinger Boost-Classifier-Driven Fault Prediction Across Heterogeneous Open-Source Repositories Big Data and Cognitive Computing fault prediction machine learning CatBoost code metrics feature importance artificial intelligence
title	Boost-Classifier-Driven Fault Prediction Across Heterogeneous Open-Source Repositories
title_full	Boost-Classifier-Driven Fault Prediction Across Heterogeneous Open-Source Repositories
title_fullStr	Boost-Classifier-Driven Fault Prediction Across Heterogeneous Open-Source Repositories
title_full_unstemmed	Boost-Classifier-Driven Fault Prediction Across Heterogeneous Open-Source Repositories
title_short	Boost-Classifier-Driven Fault Prediction Across Heterogeneous Open-Source Repositories
title_sort	boost classifier driven fault prediction across heterogeneous open source repositories
topic	fault prediction machine learning CatBoost code metrics feature importance artificial intelligence
url	https://www.mdpi.com/2504-2289/9/7/174
work_keys_str_mv	AT philipkonig boostclassifierdrivenfaultpredictionacrossheterogeneousopensourcerepositories AT sebastianraubitzek boostclassifierdrivenfaultpredictionacrossheterogeneousopensourcerepositories AT alexanderschatten boostclassifierdrivenfaultpredictionacrossheterogeneousopensourcerepositories AT dennistoth boostclassifierdrivenfaultpredictionacrossheterogeneousopensourcerepositories AT fabianobermann boostclassifierdrivenfaultpredictionacrossheterogeneousopensourcerepositories AT carolinekonig boostclassifierdrivenfaultpredictionacrossheterogeneousopensourcerepositories AT kevinmallinger boostclassifierdrivenfaultpredictionacrossheterogeneousopensourcerepositories

Boost-Classifier-Driven Fault Prediction Across Heterogeneous Open-Source Repositories

Similar Items