A Big Data Framework for Scalable and Cross-Dataset Capable Machine Learning in Network Intrusion Detection Systems
Network Intrusion Detection Systems (NIDS) are widely used to secure modern networks, but deploying accurate and scalable Machine Learning (ML)-based detection in high-speed environments remains challenging. Traditional approaches often fail to generalize across different network environments, leadi...
Saved in:
Main Authors: | , , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
IEEE
2025-01-01
|
Series: | IEEE Access |
Subjects: | |
Online Access: | https://ieeexplore.ieee.org/document/11082153/ |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Summary: | Network Intrusion Detection Systems (NIDS) are widely used to secure modern networks, but deploying accurate and scalable Machine Learning (ML)-based detection in high-speed environments remains challenging. Traditional approaches often fail to generalize across different network environments, leading to significant performance degradation in cross-dataset evaluations. Additionally, ensuring near real-time inference while ingesting large volumes of network events requires efficient processing pipelines. In this work, we propose a distributed ensemble-based NIDS designed to improve both accuracy and scalability in large-scale network environments. Our approach leverages a Big Data framework to decouple event ingestion from inference, ensuring high-speed processing without sacrificing detection performance. We implement our system using Apache Spark and Apache Kafka, enabling real-time event ingestion, efficient model inference, and periodic model updates through distributed storage. The ensemble classification scheme enhances generalization capabilities by combining multiple classifiers, reducing accuracy loss in cross-dataset scenarios. Experimental evaluations conducted on three benchmark datasets—UNSW-NB15, CS-CIC-IDS, and BoT-IoT—demonstrate that our proposed approach consistently outperforms traditional techniques. Our model achieves an F-Measure improvement of up to 0.46 in cross-dataset evaluations, addressing the generalization limitations of individual classifiers. Additionally, it achieves near real-time inference throughput comparable to traditional classifiers, processing up to 1.07M events per second with three workers, while our distributed training pipeline scales efficiently, reducing model training time by up to 62% in the same setup. |
---|---|
ISSN: | 2169-3536 |