Cloud and big data: Programming Assignments

This are the Programming Assignments for the subject Cloud and Big Data at Complutense University of Madrid, course 2017/2018. The following sentences are the statement of each file:

MapReduce

P11 -> Develop a distributed version of the grep tool to search words in very large documents. The output should be the numbers of the lines that match a given pattern.
P12 -> Develop a MapReduce job to find the frequency of each URL in a web server log. The output should be the URLs and their frequency.
P13 -> Write a MapReduce job to calculate the average daily stock price at close of Alphabet Inc. (GOOG) per year since 2009. The output should be the year and the average price.
P14 -> Develop a MapReduce job to show movies with an average rating in the ranges:
- (1) 1 or lower
- (2) 2 or lower (but higher than 1)
- (3) 3 or lower (but higher than 2)
- (4) 4 or lower (but higher than 3)
- (5) 5 or lower (but higher than 4)
The job should have two MapReduce phases. The output of the first phase should be the movies and their average rating. The output of the second phase should be ranges and the title of the movies.
P15 -> Pig and Hive are higher-level abstractions of MapReduce. They provide an interface that has nothing to do with “map” or “reduce,” but the systems interpret the higher-level language into a series of MapReduce jobs. Much like how a query planner in an RDBMS translates SQL into actual operations on data, Hive and Pig translate their respective languages into MapReduce operations. Discuss in detail how these two tools could be used to simplify the development of the “Movie Rating Data” exercise.

Spark

P21 -> Develop a Spark version of the grep tool to search words in very large documents. The output should be the numbers of the lines that match a given pattern.
P22 -> Develop a Spark script to find the frequency of each URL in a web server log. The output should be the URLs and their frequency.
P23 -> Write a Spark script to calculate the average daily stock price at close of Alphabet Inc. (GOOG) per year since 2009. The output should be the year and the average price.
P24 -> Develop a Spark script to show movies with an average rating in the ranges:
- (1) 1 or lower
- (2) 2 or lower (but higher than 1)
- (3) 3 or lower (but higher than 2)
- (4) 4 or lower (but higher than 3)
- (5) 5 or lower (but higher than 4)
The job should have two MapReduce phases. The output of the first phase should be the movies and their average rating. The output of the second phase should be ranges and the title of the movies.
P25 -> Develop a Spark script to answer the following questions:
- What is the average rating given by each user?.
- What is the overall average rating?
- What is the average rating of each movie?.
- What is the average rating of each genres?.
- Which are the top 10?. Which are the top 10 each month?

P3. Meteorite Landing

The NASA’s Open Data Portal hosts a comprehensive data set from The Meteoritical Society that contains information on all of the known meteorite landings. The Table consists of 34,513 meteorites and includes fields like the type of meteorite, the mass and the year. https://data.nasa.gov/Space-Science/Meteorite-Landings/gh4g-9sfh Write a MapReduce job and a Spark script to calculate the average mass per year of a type of meteorite specified as an argument.

P4. Distributed Markov Shakespeare

In this exercise we will explore the application of function compositions, narrow and wide dependencies and stages in the DAG parallelism for a slightly involved distributed computation to gain further insights into this programming approach. The task is to build a simple statistical language model for the writing style of Shakespeare. Your model should be a simple Markov Chain of order 2. You will use that model to generate novel sentences based on Shakespeare’s original texts.

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
P25		P25
LICENSE		LICENSE
P11_mapper.py		P11_mapper.py
P11_reducer.py		P11_reducer.py
P12_mapper.py		P12_mapper.py
P12_reducer.py		P12_reducer.py
P13_mapper.py		P13_mapper.py
P13_reducer.py		P13_reducer.py
P14a_mapper.py		P14a_mapper.py
P14a_reducer.py		P14a_reducer.py
P14b_mapper.py		P14b_mapper.py
P14b_reducer.py		P14b_reducer.py
P15.pdf		P15.pdf
P16.pdf		P16.pdf
P21_spark.py		P21_spark.py
P22_spark.py		P22_spark.py
P23_spark.py		P23_spark.py
P24_spark.py		P24_spark.py
P26.pdf		P26.pdf
P3_mapper.py		P3_mapper.py
P3_reducer.py		P3_reducer.py
P3_spark.py		P3_spark.py
P4_spark.pdf		P4_spark.pdf
P4_spark.py		P4_spark.py
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Cloud and big data: Programming Assignments

MapReduce

Spark

P3. Meteorite Landing

P4. Distributed Markov Shakespeare

About

Releases

Packages

Languages

License

al3xhh/CloudAndBigDataProgrammingAssigments

Folders and files

Latest commit

History

Repository files navigation

Cloud and big data: Programming Assignments

MapReduce

Spark

P3. Meteorite Landing

P4. Distributed Markov Shakespeare

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages