Big Data Clustering Using Spark MLlib and HDFS

Authors

  • Prof. (Dr) MSR Prasad Prof. (Dr) MSR Prasad, K L E F Deemed To Be University, Green Fields, Vaddeswaram, Andhra Pradesh 522302, India email2msr@gmail.com Author

Keywords:

big data, clustering, Spark MLlib, HDFS, K-Means, Gaussian Mixture Model, Bisecting K-Means, scalability, silhouette score, distributed computing

Abstract

Clustering at web scale strains conventional machine‐learning stacks because algorithms must iterate over terabytes of data while respecting storage locality, memory limits, and network constraints. Apache Spark and the Hadoop Distributed File System (HDFS) provide a practical foundation for unsupervised learning at this scale: Spark’s in-memory, iterative computing model drastically reduces disk I/O relative to classic MapReduce, and HDFS supplies fault-tolerant, high-throughput storage with data locality awareness. This manuscript presents an end-to-end approach for big data clustering using Spark MLlib on HDFS. After motivating use cases and reviewing relevant work, we detail a methodology that covers data ingestion, feature engineering, dimensionality reduction, algorithm selection (K-Means, Bisecting K-Means, Gaussian Mixture Models), hyper-parameter tuning, and distributed evaluation (Silhouette, Davies–Bouldin, and Calinski–Harabasz indices).

We then describe a simulation study that emulates both synthetic, multi-density clusters and a semi-structured behavioral dataset, executed on a modest multi-node cluster with HDFS‐resident Parquet inputs. Results show that MLlib’s K-Means with k-means|| initialization provides strong baselines and the best time-to-insight for large, moderately separated clusters; Bisecting K-Means yields more stable partitions on highly imbalanced cluster sizes; and Gaussian Mixture Models capture elliptical structure but at increased computational cost. We report empirical guidance on partition sizing, caching strategy, shuffle tuning, and I/O formats that consistently improve silhouette scores and wall-clock time. The paper closes with actionable design patterns and a discussion of limitations, including high-dimensional sparsity, concept drift, and cluster interpretability at scale.

Downloads

Download data is not yet available.

Published

2026-03-02

How to Cite

Prof. (Dr) MSR Prasad. “Big Data Clustering Using Spark MLlib and HDFS”. International Journal of Advanced Research in Computer Science and Engineering (IJARCSE) 2, no. 1 (March 2, 2026): Mar (22–32). Accessed March 5, 2026. https://ijarcse.org/index.php/ijarcse/article/view/117.

Similar Articles

1-10 of 70

You may also start an advanced similarity search for this article.