A performance enhanced distributed computing framework for clustering by local direction centrality upon Apache Spark

Clustering by local direction centrality (CDC) is a newly proposed versatile algorithm adept at identifying clusters with heterogeneous density and weak connectivity. Its advantages in accuracy and robustness have been widely validated in computer science, bioscience, and geoscience. However, it has...

Full description

Saved in:
Bibliographic Details
Main Authors: Zhipeng Gui, Zichen Huang, Huan Chen, Dehua Peng, Yuhang Liu, Guangyao Fang, Qianxi Lan, Anqi Zhao, Huayi Wu
Format: Article
Language:English
Published: Taylor & Francis Group 2025-07-01
Series:Big Earth Data
Subjects:
Online Access:https://www.tandfonline.com/doi/10.1080/20964471.2025.2529639
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Clustering by local direction centrality (CDC) is a newly proposed versatile algorithm adept at identifying clusters with heterogeneous density and weak connectivity. Its advantages in accuracy and robustness have been widely validated in computer science, bioscience, and geoscience. However, it has a quadratic time complexity due to costly K-nearest neighbor search and internal connection operations, which hinder its ability to handle large-scale datasets. To improve its computational efficiency and scalability, we proposed a performance enhanced distributed framework of CDC, named D-CDC, by workflow-level algorithm optimization and distributed computational acceleration. Specifically, KDTree spatial indexing is leveraged to reduce the KNN search complexity to logarithmic time, and KNN constraints and disjoint sets are introduced to decrease the computational cost of internal connection. Besides, to minimize cross-partition communication, we designed an Improved QuadTree (ImprovedQT) spatial partitioning method by considering cluster completeness and shape regularity. We then implemented D-CDC on the Apache Spark framework using Resilient Distributed Dataset (RDD) customization techniques. Experiments on six synthetic datasets demonstrate that D-CDC preserves the clustering accuracy of the original CDC in general and achieves up to 600-fold speedup by reducing the runtime from 142,590 s to 236 s on million-scale datasets. A real-world case study on over 2 million enterprise registration POI data in Chinese mainland further validates that D-CDC can identify fine-grained and weakly connected aggregation patterns of large-scale geospatial data in an efficient manner.
ISSN:2096-4471
2574-5417