Big Data Clustering Using Spark MLlib and HDFS

Prof. (Dr) MSR Prasad

Big Data Clustering Using Spark MLlib and HDFS

Authors

Prof. (Dr) MSR Prasad Prof. (Dr) MSR Prasad, K L E F Deemed To Be University, Green Fields, Vaddeswaram, Andhra Pradesh 522302, India email2msr@gmail.com Author

Keywords:

big data, clustering, Spark MLlib, HDFS, K-Means, Gaussian Mixture Model, Bisecting K-Means, scalability, silhouette score, distributed computing

Abstract

Clustering at web scale strains conventional machine‐learning stacks because algorithms must iterate over terabytes of data while respecting storage locality, memory limits, and network constraints. Apache Spark and the Hadoop Distributed File System (HDFS) provide a practical foundation for unsupervised learning at this scale: Spark’s in-memory, iterative computing model drastically reduces disk I/O relative to classic MapReduce, and HDFS supplies fault-tolerant, high-throughput storage with data locality awareness. This manuscript presents an end-to-end approach for big data clustering using Spark MLlib on HDFS. After motivating use cases and reviewing relevant work, we detail a methodology that covers data ingestion, feature engineering, dimensionality reduction, algorithm selection (K-Means, Bisecting K-Means, Gaussian Mixture Models), hyper-parameter tuning, and distributed evaluation (Silhouette, Davies–Bouldin, and Calinski–Harabasz indices).

We then describe a simulation study that emulates both synthetic, multi-density clusters and a semi-structured behavioral dataset, executed on a modest multi-node cluster with HDFS‐resident Parquet inputs. Results show that MLlib’s K-Means with k-means|| initialization provides strong baselines and the best time-to-insight for large, moderately separated clusters; Bisecting K-Means yields more stable partitions on highly imbalanced cluster sizes; and Gaussian Mixture Models capture elliptical structure but at increased computational cost. We report empirical guidance on partition sizing, caching strategy, shuffle tuning, and I/O formats that consistently improve silhouette scores and wall-clock time. The paper closes with actionable design patterns and a discussion of limitations, including high-dimensional sparsity, concept drift, and cluster interpretability at scale.

Downloads

Download data is not yet available.

Downloads

Published

2026-03-02

Issue

Vol. 2 No. 1 (2026): Jan-Mar 2026

Section

Articles

License

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.

Articles are published under the Creative Commons Attribution NonCommercial 4.0 License (CC BY NC 4.0), allowing others to distribute, remix, adapt, and build upon the work for non-commercial purposes while crediting the original author.

How to Cite

Prof. (Dr) MSR Prasad. “Big Data Clustering Using Spark MLlib and HDFS”. International Journal of Advanced Research in Computer Science and Engineering (IJARCSE) U.S. ISSN: 3071-0154 2, no. 1 (March 2, 2026): Mar (22–32). Accessed June 8, 2026. https://ijarcse.org/index.php/ijarcse/article/view/117.

Download Citation

Big Data Clustering Using Spark MLlib and HDFS

Authors

Keywords:

Abstract

Downloads

Downloads

Published

Issue

Section

License

How to Cite

Similar Articles

MakeSubmission

Call Submission

Information

Visitors

Keywords

Similar Articles

AI-Orchestrated Microservice Security for High-Performance Scalable Systems

Fog Computing for Edge AI Workloads in Smart Transportation Systems

Graph Analytics for Community Detection in Social Media Data

Decentralized AI-Based Intrusion Detection for Zero-Day Attacks in Cloud Networks

ML-Driven Credit Risk Scoring for Microfinance Lending Models

Hybrid AI Models for Real-Time Object Detection in Low-Bandwidth Environments

Emotion Recognition from Voice Using Multi-Layer Perceptrons

Comparative Analysis of Signature-Based and Anomaly-Based IDS

Crowd Behavior Analysis Using AI in Surveillance Video Streams

Low-Power Routing Algorithms for WSNs in Agricultural IoT Systems