Skip to content

Clustering with Spark on OpenStack Cloud. [OpenStack] [Hadoop] [Spark] [Ansible] [YAML] [Scala] [SBT] [Shell]

Notifications You must be signed in to change notification settings

mjaglan/bigdata-example-project

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

bigdata-example-project

Using KMeans clustering with Euclidean distance measure to group together similar data points into 8 clusters. And then reporting the Sum Squared Error of the resulting clusters.

Objective is to run analysis algorithm on openstack cloud, by ansiblizing the major steps. For this we have to use ansible scripts to create the VMs, setup hadoop cluster, install required softwares, retrieve and upload the dataset into HDFS, and copy analysis code to Master-node of Hadoop Cluster. Login to master node, run the analysis code on the data in HDFS, retrieve the results, and show the output of algorithm ran.

youtube1

Results: The KMeans algorithm, when ran for 30 iterations on 13,700+ records for 8 clusters, the resulting sum squared error (SSE) was coming around 6300 ± 500. We ran our source multiple times from scratch.

Implementation: The entry point to run this project is executing launch.sh present at /src. The /src/twitter/ contains the main source code:

site.yml
|--software.yml  // install necessary softwares on the VM
|--dataset.yml   // retrieve the dataset and upload it to HDFS
|--analysis.yml  // copy the analysis code-base

which will install necessary softwares on the VM, retrieve the dataset and upload it to HDFS, copy the following analysis code-base:

main.sh
|--twitter.sbt
|--kmeans.demo.scala

to the master node.

To know how to run this project, refer the installation.rst file. To see a sample video demo of this project, click at -

youtube2

References:

About

Clustering with Spark on OpenStack Cloud. [OpenStack] [Hadoop] [Spark] [Ansible] [YAML] [Scala] [SBT] [Shell]

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published