MPJ-SPARK Integration-Based Technique to Enhance Big Data Analytics in High Performance Computing Environments

The explosion of data from various sources such as smartphone applications, sensors, social media, and High-Performance Computing (HPC) simulations, has driven demand for high-performance data analytics. Traditional analytics tools lag HPC in computational efficiency, whereas machine learning worklo...

Full description

Saved in:
Bibliographic Details
Main Authors: Sakhr A. Saleh, Maher A. Khemakhem, Fathy E. Eassa
Format: Article
Language:English
Published: IEEE 2025-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/11062570/
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:The explosion of data from various sources such as smartphone applications, sensors, social media, and High-Performance Computing (HPC) simulations, has driven demand for high-performance data analytics. Traditional analytics tools lag HPC in computational efficiency, whereas machine learning workloads require substantial resources. However, integrating HPC and big data presents challenges due to architectural differences. This study introduces an MPJ-Spark integration-based technique that includes a novel multi-Spark-driver architecture to bridge this gap. MPJ-Spark enables a single application to execute concurrently across multiple Spark drivers, thereby improving parallelization and resource management. The methodology involves designing an MPJ-Spark cluster that integrates Message Passing in Java (MPJ) with Spark for efficient communication across HPC nodes. A single MPJ root process manages cluster communications, input file partitioning, and distributes partition metadata to MPJ workers. Each worker operates with an isolated Spark driver, processes tasks independently, and returns results to the root process for aggregation. This eliminates remote shuffling and improves network efficiency. A key-value data structure was developed to facilitate data exchange and to convert Resilient Distributed Dataset (RDD) into contiguous arrays for MPJ. A shared-storage-aware file manager was designed to improve the reading, writing, and partitioning of the datasets. MPJ-Spark was evaluated on the Aziz Supercomputer utilizing the WordCount workload across datasets ranging from 32 GB to 4.3 TB. The results demonstrated a significant improvement in execution time, ranging from 4x to 6x faster than Spark. This technique enables big data applications to leverage HPC’s computational power and effectively address the gap between these platforms.
ISSN:2169-3536