Big Data Clustering Using Spark MLlib and HDFS
Keywords:
big data, clustering, Spark MLlib, HDFS, K-Means, Gaussian Mixture Model, Bisecting K-Means, scalability, silhouette score, distributed computingAbstract
Clustering at web scale strains conventional machine‐learning stacks because algorithms must iterate over terabytes of data while respecting storage locality, memory limits, and network constraints. Apache Spark and the Hadoop Distributed File System (HDFS) provide a practical foundation for unsupervised learning at this scale: Spark’s in-memory, iterative computing model drastically reduces disk I/O relative to classic MapReduce, and HDFS supplies fault-tolerant, high-throughput storage with data locality awareness. This manuscript presents an end-to-end approach for big data clustering using Spark MLlib on HDFS. After motivating use cases and reviewing relevant work, we detail a methodology that covers data ingestion, feature engineering, dimensionality reduction, algorithm selection (K-Means, Bisecting K-Means, Gaussian Mixture Models), hyper-parameter tuning, and distributed evaluation (Silhouette, Davies–Bouldin, and Calinski–Harabasz indices).
We then describe a simulation study that emulates both synthetic, multi-density clusters and a semi-structured behavioral dataset, executed on a modest multi-node cluster with HDFS‐resident Parquet inputs. Results show that MLlib’s K-Means with k-means|| initialization provides strong baselines and the best time-to-insight for large, moderately separated clusters; Bisecting K-Means yields more stable partitions on highly imbalanced cluster sizes; and Gaussian Mixture Models capture elliptical structure but at increased computational cost. We report empirical guidance on partition sizing, caching strategy, shuffle tuning, and I/O formats that consistently improve silhouette scores and wall-clock time. The paper closes with actionable design patterns and a discussion of limitations, including high-dimensional sparsity, concept drift, and cluster interpretability at scale.
Downloads
Downloads
Published
Issue
Section
License
Copyright (c) 2026 The journal retains copyright of all published articles, ensuring that authors have control over their work while allowing wide dissenmination.

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.
Articles are published under the Creative Commons Attribution NonCommercial 4.0 License (CC BY NC 4.0), allowing others to distribute, remix, adapt, and build upon the work for non-commercial purposes while crediting the original author.
