A performance enhanced distributed computing framework for clustering by local direction centrality upon Apache Spark
Clustering by local direction centrality (CDC) is a newly proposed versatile algorithm adept at identifying clusters with heterogeneous density and weak connectivity. Its advantages in accuracy and robustness have been widely validated in computer science, bioscience, and geoscience. However, it has...
Saved in:
Main Authors: | , , , , , , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
Taylor & Francis Group
2025-07-01
|
Series: | Big Earth Data |
Subjects: | |
Online Access: | https://www.tandfonline.com/doi/10.1080/20964471.2025.2529639 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Summary: | Clustering by local direction centrality (CDC) is a newly proposed versatile algorithm adept at identifying clusters with heterogeneous density and weak connectivity. Its advantages in accuracy and robustness have been widely validated in computer science, bioscience, and geoscience. However, it has a quadratic time complexity due to costly K-nearest neighbor search and internal connection operations, which hinder its ability to handle large-scale datasets. To improve its computational efficiency and scalability, we proposed a performance enhanced distributed framework of CDC, named D-CDC, by workflow-level algorithm optimization and distributed computational acceleration. Specifically, KDTree spatial indexing is leveraged to reduce the KNN search complexity to logarithmic time, and KNN constraints and disjoint sets are introduced to decrease the computational cost of internal connection. Besides, to minimize cross-partition communication, we designed an Improved QuadTree (ImprovedQT) spatial partitioning method by considering cluster completeness and shape regularity. We then implemented D-CDC on the Apache Spark framework using Resilient Distributed Dataset (RDD) customization techniques. Experiments on six synthetic datasets demonstrate that D-CDC preserves the clustering accuracy of the original CDC in general and achieves up to 600-fold speedup by reducing the runtime from 142,590 s to 236 s on million-scale datasets. A real-world case study on over 2 million enterprise registration POI data in Chinese mainland further validates that D-CDC can identify fine-grained and weakly connected aggregation patterns of large-scale geospatial data in an efficient manner. |
---|---|
ISSN: | 2096-4471 2574-5417 |