This repository contains all the assignments of the Big Data Computing course.
The purpose of those homeworks is to get acquainted with Spark and with its use to implement MapReduce algorithms.
The last 2 homeworks are focused on the k-center with z outliers problem, a robust version of the k-center problem which is useful in the analysis of noisy data.
-
Homework 1:
In the first homework I developed a Spark program to analyze a dataset of an online retailer which contains several transactions made by customers, where a transaction represents several products purchased by a customer. -
Homework 2:
In the second homework I implemented the 3-approximation sequential algorithm for the k-center with z outliers problem. This algorithm proposed by Charikar et al. is simple to implement but has superlinear complexity. -
Homework 3:
In the third homework I implemented a 2-round MapReduce coreset-based algorithm for the k-center with z outliers problem, where the use of the inefficient 3-approximation is confined to a small coreset computed in parallel through the efficient Farthest-First Traversal.
This efficient implementation was run on a big dataset (about 1.2M points in 7 dimensions) on the CloudVeneto cluster.
I have also developed some programs that generate plots of the results of the different clustering algorithms.
Here I leave two examples of the results obtained on an Uber dataset.