Big Data Clustering Using Spark MLlib and HDFS

Prof. (Dr) MSR Prasad

Big Data Clustering Using Spark MLlib and HDFS

Authors

Prof. (Dr) MSR Prasad Prof. (Dr) MSR Prasad, K L E F Deemed To Be University, Green Fields, Vaddeswaram, Andhra Pradesh 522302, India email2msr@gmail.com Author

Keywords:

big data, clustering, Spark MLlib, HDFS, K-Means, Gaussian Mixture Model, Bisecting K-Means, scalability, silhouette score, distributed computing

Abstract

Clustering at web scale strains conventional machine‐learning stacks because algorithms must iterate over terabytes of data while respecting storage locality, memory limits, and network constraints. Apache Spark and the Hadoop Distributed File System (HDFS) provide a practical foundation for unsupervised learning at this scale: Spark’s in-memory, iterative computing model drastically reduces disk I/O relative to classic MapReduce, and HDFS supplies fault-tolerant, high-throughput storage with data locality awareness. This manuscript presents an end-to-end approach for big data clustering using Spark MLlib on HDFS. After motivating use cases and reviewing relevant work, we detail a methodology that covers data ingestion, feature engineering, dimensionality reduction, algorithm selection (K-Means, Bisecting K-Means, Gaussian Mixture Models), hyper-parameter tuning, and distributed evaluation (Silhouette, Davies–Bouldin, and Calinski–Harabasz indices).

We then describe a simulation study that emulates both synthetic, multi-density clusters and a semi-structured behavioral dataset, executed on a modest multi-node cluster with HDFS‐resident Parquet inputs. Results show that MLlib’s K-Means with k-means|| initialization provides strong baselines and the best time-to-insight for large, moderately separated clusters; Bisecting K-Means yields more stable partitions on highly imbalanced cluster sizes; and Gaussian Mixture Models capture elliptical structure but at increased computational cost. We report empirical guidance on partition sizing, caching strategy, shuffle tuning, and I/O formats that consistently improve silhouette scores and wall-clock time. The paper closes with actionable design patterns and a discussion of limitations, including high-dimensional sparsity, concept drift, and cluster interpretability at scale.

Downloads

Download data is not yet available.

Downloads

Published

2026-03-02

Issue

Vol. 2 No. 1 (2026): Jan-Mar 2026

Section

Articles

License

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.

Articles are published under the Creative Commons Attribution NonCommercial 4.0 License (CC BY NC 4.0), allowing others to distribute, remix, adapt, and build upon the work for non-commercial purposes while crediting the original author.

How to Cite

Prof. (Dr) MSR Prasad. “Big Data Clustering Using Spark MLlib and HDFS”. International Journal of Advanced Research in Computer Science and Engineering (IJARCSE) U.S. ISSN: 3071-0154 2, no. 1 (March 2, 2026): Mar (22–32). Accessed June 8, 2026. https://ijarcse.org/index.php/ijarcse/article/view/117.

Download Citation

Big Data Clustering Using Spark MLlib and HDFS

Authors

Keywords:

Abstract

Downloads

Downloads

Published

Issue

Section

License

How to Cite

Similar Articles

MakeSubmission

Call Submission

Information

Visitors

Keywords

Similar Articles

Multi-View Clustering Algorithms for Big Data Analytics

Real-Time Stock Price Forecasting Using Big Data Pipelines

Distributed Load Testing for SaaS Applications in Cloud Environments

Streaming Data Analytics for Smart Traffic Signal Optimization

ML-Based Fault Prediction in Wind Turbine Monitoring Systems

Security Challenges in IoT-Blockchain Integrated Ecosystems

Early Disease Prediction Using Hybrid Ensemble ML Techniques

Anomaly Detection in Time-Series IoT Data Using Transformer Architectures

Healthcare Predictive Analytics Using BigQuery and TensorFlow

ML-Based Predictive Maintenance in Industrial IoT Networks