OPTISTACK: A Hybrid Ensemble Learning and XAI-Based Approach for Malware Detection in Compressed Files
The increasing reliance on compressed file formats for data storage and transmission has made them attractive vectors for malware propagation, as their structural complexity enables evasion of conventional detection mechanisms. Although entropy-based analysis has been widely applied in executable ma...
Saved in:
Main Authors: | , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
IEEE
2025-01-01
|
Series: | IEEE Access |
Subjects: | |
Online Access: | https://ieeexplore.ieee.org/document/11036813/ |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Summary: | The increasing reliance on compressed file formats for data storage and transmission has made them attractive vectors for malware propagation, as their structural complexity enables evasion of conventional detection mechanisms. Although entropy-based analysis has been widely applied in executable malware detection, its application to compressed file formats remains underexplored. Moreover, existing approaches are predominantly limited to Shannon entropy, failing to exploit the discriminative power of higher-order statistical metrics. Additionally, standalone machine learning models often suffer from limited generalizability and lack interpretability, hindering their real-world deployment in security-critical systems. To address these challenges, we propose OPTISTACK, a novel stacking ensemble framework that integrates Random Forest (RF), Decision Tree (DT), and XGBoost (XGB) as base learners with a Logistic Regression (LR) meta-classifier. Our model leverages an advanced entropy-based feature space—including Rényi entropy (with <inline-formula> <tex-math notation="LaTeX">$\alpha = 2, 4, 6$ </tex-math></inline-formula>), mean entropy, and quartile-based entropy (25th and 75th percentiles)—to capture fine-grained statistical variations in compressed data. To the best of our knowledge, this is the first study to integrate higher-order entropy metrics and distributional entropy features into a stacking ensemble model for malware detection in compressed files. Extensive evaluation on the NapierOne dataset, spanning six prevalent compression formats—ZIP, 7ZIP, GZIP (GNU Zip), RAR (Roshal Archive), TAR (Tape Archive), and ZLIB—demonstrates that OPTISTACK significantly outperforms traditional models, achieving 99.45% accuracy, 99.62% F1-score, 98.80% MCC, and 94.11% AUC-ROC. Our PDP-ICE analysis reveals that minor variations in 25th and 75th quartile entropy values lead to substantial shifts in classification probabilities, underscoring their critical role in model sensitivity and robustness. SHAP-based interpretability analysis further identifies the 25th quartile entropy as the most influential feature across all models. Additionally, we introduce an entropy network graph-based vulnerability analysis that reveals ZIP and RAR as the most malware-prone formats. By combining stacking ensemble learning, advanced entropy metrics, and Explainable AI (XAI) techniques, OPTISTACK delivers a robust, interpretable, and generalizable framework for detecting malware in compressed file environments—addressing key limitations in existing cybersecurity methodologies. |
---|---|
ISSN: | 2169-3536 |