Skip to content

Big Data Computing course assignements. Focus on the use of the Apache Spark framework for tackling clustering problems on big data.

Notifications You must be signed in to change notification settings

federicochiarello/BigDataComputing

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

24 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Big Data Computing

Homeworks:

This repository contains all the assignments of the Big Data Computing course.
The purpose of those homeworks is to get acquainted with Spark and with its use to implement MapReduce algorithms.
The last 2 homeworks are focused on the k-center with z outliers problem, a robust version of the k-center problem which is useful in the analysis of noisy data.

  • Homework 1:
    In the first homework I developed a Spark program to analyze a dataset of an online retailer which contains several transactions made by customers, where a transaction represents several products purchased by a customer.

  • Homework 2:
    In the second homework I implemented the 3-approximation sequential algorithm for the k-center with z outliers problem. This algorithm proposed by Charikar et al. is simple to implement but has superlinear complexity.

  • Homework 3:
    In the third homework I implemented a 2-round MapReduce coreset-based algorithm for the k-center with z outliers problem, where the use of the inefficient 3-approximation is confined to a small coreset computed in parallel through the efficient Farthest-First Traversal.
    This efficient implementation was run on a big dataset (about 1.2M points in 7 dimensions) on the CloudVeneto cluster.

Plots:

I have also developed some programs that generate plots of the results of the different clustering algorithms.
Here I leave two examples of the results obtained on an Uber dataset.


clustering example


coreset example

About

Big Data Computing course assignements. Focus on the use of the Apache Spark framework for tackling clustering problems on big data.

Topics

Resources

Stars

Watchers

Forks

Languages