Skip to content

daiphuongngo/World-Sales-Analysis-Hadoop-Hive-HDFS-Zeppelin-Spark-SQL-Scala-Tableau-PowerBI

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

29 Commits
 
 
 
 
 
 
 
 

Repository files navigation

World-Sales-Analysis-Hadoop-HDFS-Zeppelin-Spark-SQL-Scala-Tableau-Power Bi

Overview:

The World Sales here is a small dataset used to demonstrate skills in both Big Data and Visualization platforms, languages and tools. The target of this project is to determine the most profitable Region and Country by different factors and methods.

Platforms, Languages and Tools:

  • Hadoop

  • HDFS

  • Zeppelin

  • Spark

  • Scala

  • SQL

  • Tableau

  • Power BI

Dataset:

worldsales.csv

Key variables:

Id Region Country Item_Type Sales_Channel Order_Priority Order_Date Order_ID Ship_Date Units_Sold Unit_Price Unit_Cost Total_Revenue Total_Cost Total_Profit

Conclusion:

Europe is the most profitable region. Belarus is the most profitable country worldwide.

Table of Content:

1. Load data

Load data into HDFS

Upload the file worldsales.csv to HDFS’s tmp folder

Screenshot 2021-06-23 214558

2. Create external table in Hive for analysis in Zeppelin

-- Hive

CREATE EXTERNAL TABLE IF NOT EXISTS worldsales (Id INT, Region STRING, Country STRING, Item_Type STRING, Sales_Channel STRING, Order_Priority STRING, Order_Date DATETIME, Order_ID INT, 
Ship_Date DATETIME, Units_Sold INT, Unit_Price INT, Unit_Cost INT, Total_Revenue DOUBLE, Total_Cost DOUBLE, Total_Profit DOUBLE)
COMMENT 'Data of the World Sales'
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE
LOCATION '/tmp/worldsales'

-- Zeppelin -- Spark

Load data into Zeppelin

// Create a worldsales DataFrame from CSV file
%spark2
val worldsales = (spark.read 
.option("header", "true") --// Use first line as header
.option("inferSchema", "true") --// Infer schema
.csv("/tmp/worldsales.csv"))

Screenshot 2021-06-23 214558

3. Show the newly created dataframe

%spark2
worldsales.select("Id", "Region", "Country", "Item_Type", "Sales_Channel", "Order_Priority", "Order_Date", "Order_ID", "Ship_Date", "Units_Sold", "Unit_Price", "Unit_Cost", "Total_Revenue", "Total_Cost", "Total Profit").show()

Screenshot 2021-06-23 220234

4. Print the dataframe schema

%spark2
// Print the Schema in a tree format
worldsales.printSchema()

Screenshot 2021-06-23 220419

5. Filter the dataframe to show units sold > 8000 and unit cost > 500

Method 1:

%spark2
// Create a Dataset containing worldsales with units sold and unit price using “filter”
val worldsales_dataset = worldsales.select(("Id", "Region", "Country", "Item_Type", "Sales_Channel", "Order_Priority", "Order_Date", "Order_ID", "Ship_Date", "Units_Sold", "Unit_Price", "Unit_Cost", "Total_Revenue", "Total_Cost", "Total Profit")
					.filter($"Units_Sold" > 8000)
					.filter($"Unit_Cost" > 500)
worldsales_dataset.show()

Method 2:

worldsales.filter("Units_Sold" > 8000 && "Unit_Cost" > 500).show()

Screenshot 2021-06-23 221000

Filter

Regions with Unit Sold greater than 8,000 and Unit Cost greater than 500

This led to only 2 Sub-Saharan countries: Senegal and Swaziland after the filteration. Senegal's Units Sold was 8,989 and Unit Cost was 502.54 while Swaziland's Units Sold was 9,915 and Unit Cost was 524.96. Both has the same Sales Channel as Offline.

6. Show the dataframe in group by “Region” and count

%spark2
worldsales.groupBy("region").count().show()

Q5

Count by Region

Count by Region

7. Create a separate dataframe with the group by results

%spark2
val worldsales_results = worldsales.groupBy("region").count()
worldsales_results.show()

Q6

There were the most activities in Sub-Saharan Africa and Europe. Meanwhile, North America and Australia and Oceania wasn't active in trading with Sub-Saharan Africa.

8. Save the new subset dataframe as a CSV file into HDFS

%spark2
worldsales_results.coalesce(1).write.format(“csv”).option(“header”, “true”).save(“/tmp/worldsales_results.csv”)

Q7 results

9. Create two views using the “createOrReplaceTempView” command

9.a. View on “Salesview” from the first dataframe

%spark2
worldsales.createOrReplaceTempView(“Salesview”)

Q8 1

9.b. View on “Regionview” from the second dataframe

%spark2
Worldsales_results.createOrReplaceTempView(“Regionview”)

Q8 2

-- SQL

10. Using SQL select all from “Regionview” view and show in a line graph

%spark2.sql
SELECT * FROM Regionview

Q9

Region Line

Region Line

The Line chart illustrates the dynamic Sales & Trading activities between Europe and Sub-Saharan Africa. But there was no energetic performance between North America, Australia & Oceania and Sub-Sharan Africa.

11. Using SQL select from the “Salesview” view – the region and sum of units sold and group by region and display in a data grid view

%spark2.sql
SELECT region, SUM(Units_Sold) AS Sum_Units_Sold
FROM Salesview
GROUP BY region

Q10

Region vs Sum Units Sold

Region vs Sum Units Sold

There was a positive correlation of Sum Units Sold between Europe and Sub-Saharan Africa. These two regions and continents had the highest Sum Units Sold. In contrast, North America, Australia & Oceania had the lowest Sum Units Sold. Other regions and continents played moderately around the average Sum Units Sold.

12. Using SQL select from the “Salesview” view – the region and sum of total_profit and group by region and display in a Bar chart

%spark2.sql
SELECT region, SUM(Total_Profit)
FROM Salesview
GROUP BY region

Q11

Europe and Sub-Saharan Africa certainly dominated Sum of Total Profit, ranking 2nd and 1st, respectively. North America, Australia & Oceania in the other hand gained the lowest Sum of Total Profit.

13. From the “Salesview” view, show the Total Profit as Profit, the Total Revenue as Revenue and the Total Cost as Cost from “Salesview” group by Region – The client wants to see this data in a Line chart in order to see the correlation between Cost, Revenue, Profit between Regions.

%spark2.sql
SELECT region, SUM(Total_Profit) AS Profit, SUM(total_revenue) AS Revenue, SUM(total_cost) AS Cost
FROM Salesview
GROUP BY region

Q12 have to drag

Q12

Cost, Revenue, Profit between Regions

Cost, Revenue, Profit between Regions

The correlations between these fields between Regions were the same. Europe and Sub-Saharan Africa gained the highest figures in all 3 fields while North America, Australia & Oceania's fields were significantly low.

Sum Cost between Regions on Map

Sum Cost between Country on Map

Sum Profit between Regions on Map

Total Profit by Country

Sum Revenue between Regions on Map

Total Revenue by Country on Map

Average Profit by Region on Map

Average Profit by Country

14. The customer is planning to open up a new store and searching for the best location for it, they need to see the Average Profit in each Region as a percentage (Pie chart) compared to other Regions

Now I will use both views created to plot the Pie chart and also point out the region where it is most profitable.

%spark2.sql
SELECT a.Region, AVG(Total_Profit) 
FROM Salesview b , Regionview a
WHERE a.Region = b.Region
GROUP BY a.Region

Q13 - Orren's solution on pie chart

Average Profit by Region

The Pie chart demonstrates that Europe and Sub-Saharan Africa took half of the worldwide Total Profit. Europe's Average Profit at 27% is the highest among all continents. Thefore, it is proven that Europe would be the most profitable Region.

More specifically, Belarus is the most profitable country by Average Profit.

Dashboard - Sales Performance by Region

Dashboard - Sales Performance by Region

Dashboard - Maps of Cost, Revenue, Profit

Dashboard - Maps of Cost, Revenue, Profit

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published