A Big Data Framework for Scalable and Cross-Dataset Capable Machine Learning in Network Intrusion Detection Systems

Network Intrusion Detection Systems (NIDS) are widely used to secure modern networks, but deploying accurate and scalable Machine Learning (ML)-based detection in high-speed environments remains challenging. Traditional approaches often fail to generalize across different network environments, leadi...

Full description

Saved in:
Bibliographic Details
Main Authors: Vinicius M. de Oliveira, Henrique M. de Oliveira, Gabriel M. Santos, Jhonatan Geremias, Eduardo K. Viegas
Format: Article
Language:English
Published: IEEE 2025-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/11082153/
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1839614762622648320
author Vinicius M. de Oliveira
Henrique M. de Oliveira
Gabriel M. Santos
Jhonatan Geremias
Eduardo K. Viegas
author_facet Vinicius M. de Oliveira
Henrique M. de Oliveira
Gabriel M. Santos
Jhonatan Geremias
Eduardo K. Viegas
author_sort Vinicius M. de Oliveira
collection DOAJ
description Network Intrusion Detection Systems (NIDS) are widely used to secure modern networks, but deploying accurate and scalable Machine Learning (ML)-based detection in high-speed environments remains challenging. Traditional approaches often fail to generalize across different network environments, leading to significant performance degradation in cross-dataset evaluations. Additionally, ensuring near real-time inference while ingesting large volumes of network events requires efficient processing pipelines. In this work, we propose a distributed ensemble-based NIDS designed to improve both accuracy and scalability in large-scale network environments. Our approach leverages a Big Data framework to decouple event ingestion from inference, ensuring high-speed processing without sacrificing detection performance. We implement our system using Apache Spark and Apache Kafka, enabling real-time event ingestion, efficient model inference, and periodic model updates through distributed storage. The ensemble classification scheme enhances generalization capabilities by combining multiple classifiers, reducing accuracy loss in cross-dataset scenarios. Experimental evaluations conducted on three benchmark datasets—UNSW-NB15, CS-CIC-IDS, and BoT-IoT—demonstrate that our proposed approach consistently outperforms traditional techniques. Our model achieves an F-Measure improvement of up to 0.46 in cross-dataset evaluations, addressing the generalization limitations of individual classifiers. Additionally, it achieves near real-time inference throughput comparable to traditional classifiers, processing up to 1.07M events per second with three workers, while our distributed training pipeline scales efficiently, reducing model training time by up to 62% in the same setup.
format Article
id doaj-art-dba9fbe7aecf453489ded8068f0e152b
institution Matheson Library
issn 2169-3536
language English
publishDate 2025-01-01
publisher IEEE
record_format Article
series IEEE Access
spelling doaj-art-dba9fbe7aecf453489ded8068f0e152b2025-07-25T23:00:52ZengIEEEIEEE Access2169-35362025-01-011312941912943110.1109/ACCESS.2025.358987211082153A Big Data Framework for Scalable and Cross-Dataset Capable Machine Learning in Network Intrusion Detection SystemsVinicius M. de Oliveira0Henrique M. de Oliveira1Gabriel M. Santos2Jhonatan Geremias3Eduardo K. Viegas4https://orcid.org/0000-0002-5050-6363Graduate Program in Computer Science, Pontifical Catholic University of Paraná, Curitiba, Paraná, BrazilGraduate Program in Computer Science, Pontifical Catholic University of Paraná, Curitiba, Paraná, BrazilGraduate Program in Computer Science, Pontifical Catholic University of Paraná, Curitiba, Paraná, BrazilGraduate Program in Computer Science, Pontifical Catholic University of Paraná, Curitiba, Paraná, BrazilGraduate Program in Computer Science, Pontifical Catholic University of Paraná, Curitiba, Paraná, BrazilNetwork Intrusion Detection Systems (NIDS) are widely used to secure modern networks, but deploying accurate and scalable Machine Learning (ML)-based detection in high-speed environments remains challenging. Traditional approaches often fail to generalize across different network environments, leading to significant performance degradation in cross-dataset evaluations. Additionally, ensuring near real-time inference while ingesting large volumes of network events requires efficient processing pipelines. In this work, we propose a distributed ensemble-based NIDS designed to improve both accuracy and scalability in large-scale network environments. Our approach leverages a Big Data framework to decouple event ingestion from inference, ensuring high-speed processing without sacrificing detection performance. We implement our system using Apache Spark and Apache Kafka, enabling real-time event ingestion, efficient model inference, and periodic model updates through distributed storage. The ensemble classification scheme enhances generalization capabilities by combining multiple classifiers, reducing accuracy loss in cross-dataset scenarios. Experimental evaluations conducted on three benchmark datasets—UNSW-NB15, CS-CIC-IDS, and BoT-IoT—demonstrate that our proposed approach consistently outperforms traditional techniques. Our model achieves an F-Measure improvement of up to 0.46 in cross-dataset evaluations, addressing the generalization limitations of individual classifiers. Additionally, it achieves near real-time inference throughput comparable to traditional classifiers, processing up to 1.07M events per second with three workers, while our distributed training pipeline scales efficiently, reducing model training time by up to 62% in the same setup.https://ieeexplore.ieee.org/document/11082153/Network intrusion detectionmachine learningbig datageneralization
spellingShingle Vinicius M. de Oliveira
Henrique M. de Oliveira
Gabriel M. Santos
Jhonatan Geremias
Eduardo K. Viegas
A Big Data Framework for Scalable and Cross-Dataset Capable Machine Learning in Network Intrusion Detection Systems
IEEE Access
Network intrusion detection
machine learning
big data
generalization
title A Big Data Framework for Scalable and Cross-Dataset Capable Machine Learning in Network Intrusion Detection Systems
title_full A Big Data Framework for Scalable and Cross-Dataset Capable Machine Learning in Network Intrusion Detection Systems
title_fullStr A Big Data Framework for Scalable and Cross-Dataset Capable Machine Learning in Network Intrusion Detection Systems
title_full_unstemmed A Big Data Framework for Scalable and Cross-Dataset Capable Machine Learning in Network Intrusion Detection Systems
title_short A Big Data Framework for Scalable and Cross-Dataset Capable Machine Learning in Network Intrusion Detection Systems
title_sort big data framework for scalable and cross dataset capable machine learning in network intrusion detection systems
topic Network intrusion detection
machine learning
big data
generalization
url https://ieeexplore.ieee.org/document/11082153/
work_keys_str_mv AT viniciusmdeoliveira abigdataframeworkforscalableandcrossdatasetcapablemachinelearninginnetworkintrusiondetectionsystems
AT henriquemdeoliveira abigdataframeworkforscalableandcrossdatasetcapablemachinelearninginnetworkintrusiondetectionsystems
AT gabrielmsantos abigdataframeworkforscalableandcrossdatasetcapablemachinelearninginnetworkintrusiondetectionsystems
AT jhonatangeremias abigdataframeworkforscalableandcrossdatasetcapablemachinelearninginnetworkintrusiondetectionsystems
AT eduardokviegas abigdataframeworkforscalableandcrossdatasetcapablemachinelearninginnetworkintrusiondetectionsystems
AT viniciusmdeoliveira bigdataframeworkforscalableandcrossdatasetcapablemachinelearninginnetworkintrusiondetectionsystems
AT henriquemdeoliveira bigdataframeworkforscalableandcrossdatasetcapablemachinelearninginnetworkintrusiondetectionsystems
AT gabrielmsantos bigdataframeworkforscalableandcrossdatasetcapablemachinelearninginnetworkintrusiondetectionsystems
AT jhonatangeremias bigdataframeworkforscalableandcrossdatasetcapablemachinelearninginnetworkintrusiondetectionsystems
AT eduardokviegas bigdataframeworkforscalableandcrossdatasetcapablemachinelearninginnetworkintrusiondetectionsystems