LSDMLab

Process large amount of data and to implement complex data analyses using Spark. The dataset has been made available by Google. It includes data about a cluster of 12500 machines, and the activity on this cluster during 29 days.

1. Lab tasks

The following questions are answered in this lab:

• What is the distribution of the machines according to their CPU capacity?

• What is the percentage of computational power lost due to maintenance (a machine went offline and reconnected later)?

• What is the distribution of the number of jobs/tasks per scheduling class?

• Do tasks with a low scheduling class have a higher probability of being evicted?

• In general, do tasks from the same job run on the same machine?

• Are the tasks that request the more resources the one that consume the more resources?

• Can we observe correlations between peaks of high resource consumption on some machines and task eviction events?

• Do tasks having the higher priority require more resources?

• What are hardware specifications of machines on which different priority tasks have/haven't successfully run?

2. Extending the work

2.1 Comparison of different solutions

The solution is compared from different points of view (performance, ease of use,etc.) with different technical solutions to process the data:

• Compare with Spark Dataframe.

• Compare the use of Spark to the use of a non-parallel Python data analysis library such us Pandas.

2.2 Deploying in the cloud

Our Spark application is deployed in a real distributed environment (GCP Dataproc cluster) to conduct some performance evaluation in this context.

3. Instructions for running the code

Ensures Spark is installed and configured properly.
Ensures the working directory contains well-named folders (Job_events, Machine_events, Task_events, Task_usage) including csv files.
Run script run-app.sh and enter the question to start the execution.

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
__pycache__		__pycache__
graphs		graphs
LSDM Lab.txt		LSDM Lab.txt
README.md		README.md
definition.py		definition.py
deploy-gcp.py		deploy-gcp.py
question1-Dataframe.py		question1-Dataframe.py
question1-solution1.py		question1-solution1.py
question1-solution2.py		question1-solution2.py
question1-solution3.py		question1-solution3.py
question2.py		question2.py
question3-Dataframe.py		question3-Dataframe.py
question3-Pandas.py		question3-Pandas.py
question3.py		question3.py
question4-Pandas.py		question4-Pandas.py
question4.py		question4.py
question5.py		question5.py
question6.py		question6.py
question7.py		question7.py
question8.py		question8.py
question9.py		question9.py
run-app.sh		run-app.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LSDMLab

1. Lab tasks

2. Extending the work

2.1 Comparison of different solutions

2.2 Deploying in the cloud

3. Instructions for running the code

About

Releases

Packages

Contributors 2

Languages

Huydatnguyen/LSDMLab

Folders and files

Latest commit

History

Repository files navigation

LSDMLab

1. Lab tasks

2. Extending the work

2.1 Comparison of different solutions

2.2 Deploying in the cloud

3. Instructions for running the code

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages