A Big Data Framework for Scalable and Cross-Dataset Capable Machine Learning in Network Intrusion Detection Systems
Network Intrusion Detection Systems (NIDS) are widely used to secure modern networks, but deploying accurate and scalable Machine Learning (ML)-based detection in high-speed environments remains challenging. Traditional approaches often fail to generalize across different network environments, leadi...
Saved in:
Main Authors: | , , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
IEEE
2025-01-01
|
Series: | IEEE Access |
Subjects: | |
Online Access: | https://ieeexplore.ieee.org/document/11082153/ |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
_version_ | 1839614762622648320 |
---|---|
author | Vinicius M. de Oliveira Henrique M. de Oliveira Gabriel M. Santos Jhonatan Geremias Eduardo K. Viegas |
author_facet | Vinicius M. de Oliveira Henrique M. de Oliveira Gabriel M. Santos Jhonatan Geremias Eduardo K. Viegas |
author_sort | Vinicius M. de Oliveira |
collection | DOAJ |
description | Network Intrusion Detection Systems (NIDS) are widely used to secure modern networks, but deploying accurate and scalable Machine Learning (ML)-based detection in high-speed environments remains challenging. Traditional approaches often fail to generalize across different network environments, leading to significant performance degradation in cross-dataset evaluations. Additionally, ensuring near real-time inference while ingesting large volumes of network events requires efficient processing pipelines. In this work, we propose a distributed ensemble-based NIDS designed to improve both accuracy and scalability in large-scale network environments. Our approach leverages a Big Data framework to decouple event ingestion from inference, ensuring high-speed processing without sacrificing detection performance. We implement our system using Apache Spark and Apache Kafka, enabling real-time event ingestion, efficient model inference, and periodic model updates through distributed storage. The ensemble classification scheme enhances generalization capabilities by combining multiple classifiers, reducing accuracy loss in cross-dataset scenarios. Experimental evaluations conducted on three benchmark datasets—UNSW-NB15, CS-CIC-IDS, and BoT-IoT—demonstrate that our proposed approach consistently outperforms traditional techniques. Our model achieves an F-Measure improvement of up to 0.46 in cross-dataset evaluations, addressing the generalization limitations of individual classifiers. Additionally, it achieves near real-time inference throughput comparable to traditional classifiers, processing up to 1.07M events per second with three workers, while our distributed training pipeline scales efficiently, reducing model training time by up to 62% in the same setup. |
format | Article |
id | doaj-art-dba9fbe7aecf453489ded8068f0e152b |
institution | Matheson Library |
issn | 2169-3536 |
language | English |
publishDate | 2025-01-01 |
publisher | IEEE |
record_format | Article |
series | IEEE Access |
spelling | doaj-art-dba9fbe7aecf453489ded8068f0e152b2025-07-25T23:00:52ZengIEEEIEEE Access2169-35362025-01-011312941912943110.1109/ACCESS.2025.358987211082153A Big Data Framework for Scalable and Cross-Dataset Capable Machine Learning in Network Intrusion Detection SystemsVinicius M. de Oliveira0Henrique M. de Oliveira1Gabriel M. Santos2Jhonatan Geremias3Eduardo K. Viegas4https://orcid.org/0000-0002-5050-6363Graduate Program in Computer Science, Pontifical Catholic University of Paraná, Curitiba, Paraná, BrazilGraduate Program in Computer Science, Pontifical Catholic University of Paraná, Curitiba, Paraná, BrazilGraduate Program in Computer Science, Pontifical Catholic University of Paraná, Curitiba, Paraná, BrazilGraduate Program in Computer Science, Pontifical Catholic University of Paraná, Curitiba, Paraná, BrazilGraduate Program in Computer Science, Pontifical Catholic University of Paraná, Curitiba, Paraná, BrazilNetwork Intrusion Detection Systems (NIDS) are widely used to secure modern networks, but deploying accurate and scalable Machine Learning (ML)-based detection in high-speed environments remains challenging. Traditional approaches often fail to generalize across different network environments, leading to significant performance degradation in cross-dataset evaluations. Additionally, ensuring near real-time inference while ingesting large volumes of network events requires efficient processing pipelines. In this work, we propose a distributed ensemble-based NIDS designed to improve both accuracy and scalability in large-scale network environments. Our approach leverages a Big Data framework to decouple event ingestion from inference, ensuring high-speed processing without sacrificing detection performance. We implement our system using Apache Spark and Apache Kafka, enabling real-time event ingestion, efficient model inference, and periodic model updates through distributed storage. The ensemble classification scheme enhances generalization capabilities by combining multiple classifiers, reducing accuracy loss in cross-dataset scenarios. Experimental evaluations conducted on three benchmark datasets—UNSW-NB15, CS-CIC-IDS, and BoT-IoT—demonstrate that our proposed approach consistently outperforms traditional techniques. Our model achieves an F-Measure improvement of up to 0.46 in cross-dataset evaluations, addressing the generalization limitations of individual classifiers. Additionally, it achieves near real-time inference throughput comparable to traditional classifiers, processing up to 1.07M events per second with three workers, while our distributed training pipeline scales efficiently, reducing model training time by up to 62% in the same setup.https://ieeexplore.ieee.org/document/11082153/Network intrusion detectionmachine learningbig datageneralization |
spellingShingle | Vinicius M. de Oliveira Henrique M. de Oliveira Gabriel M. Santos Jhonatan Geremias Eduardo K. Viegas A Big Data Framework for Scalable and Cross-Dataset Capable Machine Learning in Network Intrusion Detection Systems IEEE Access Network intrusion detection machine learning big data generalization |
title | A Big Data Framework for Scalable and Cross-Dataset Capable Machine Learning in Network Intrusion Detection Systems |
title_full | A Big Data Framework for Scalable and Cross-Dataset Capable Machine Learning in Network Intrusion Detection Systems |
title_fullStr | A Big Data Framework for Scalable and Cross-Dataset Capable Machine Learning in Network Intrusion Detection Systems |
title_full_unstemmed | A Big Data Framework for Scalable and Cross-Dataset Capable Machine Learning in Network Intrusion Detection Systems |
title_short | A Big Data Framework for Scalable and Cross-Dataset Capable Machine Learning in Network Intrusion Detection Systems |
title_sort | big data framework for scalable and cross dataset capable machine learning in network intrusion detection systems |
topic | Network intrusion detection machine learning big data generalization |
url | https://ieeexplore.ieee.org/document/11082153/ |
work_keys_str_mv | AT viniciusmdeoliveira abigdataframeworkforscalableandcrossdatasetcapablemachinelearninginnetworkintrusiondetectionsystems AT henriquemdeoliveira abigdataframeworkforscalableandcrossdatasetcapablemachinelearninginnetworkintrusiondetectionsystems AT gabrielmsantos abigdataframeworkforscalableandcrossdatasetcapablemachinelearninginnetworkintrusiondetectionsystems AT jhonatangeremias abigdataframeworkforscalableandcrossdatasetcapablemachinelearninginnetworkintrusiondetectionsystems AT eduardokviegas abigdataframeworkforscalableandcrossdatasetcapablemachinelearninginnetworkintrusiondetectionsystems AT viniciusmdeoliveira bigdataframeworkforscalableandcrossdatasetcapablemachinelearninginnetworkintrusiondetectionsystems AT henriquemdeoliveira bigdataframeworkforscalableandcrossdatasetcapablemachinelearninginnetworkintrusiondetectionsystems AT gabrielmsantos bigdataframeworkforscalableandcrossdatasetcapablemachinelearninginnetworkintrusiondetectionsystems AT jhonatangeremias bigdataframeworkforscalableandcrossdatasetcapablemachinelearninginnetworkintrusiondetectionsystems AT eduardokviegas bigdataframeworkforscalableandcrossdatasetcapablemachinelearninginnetworkintrusiondetectionsystems |