World-Sales-Analysis-Hadoop-HDFS-Zeppelin-Spark-SQL-Scala-Tableau-Power Bi

Overview:

The World Sales here is a small dataset used to demonstrate skills in both Big Data and Visualization platforms, languages and tools. The target of this project is to determine the most profitable Region and Country by different factors and methods.

Platforms, Languages and Tools:

Hadoop
HDFS
Zeppelin
Spark
Scala
SQL
Tableau
Power BI

Dataset:

worldsales.csv

Key variables:

Id	Region	Country	Item_Type	Sales_Channel	Order_Priority	Order_Date	Order_ID	Ship_Date	Units_Sold	Unit_Price	Unit_Cost	Total_Revenue	Total_Cost	Total_Profit

Conclusion:

Europe is the most profitable region. Belarus is the most profitable country worldwide.

Table of Content:

1. Load data

Load data into HDFS

Upload the file worldsales.csv to HDFS’s tmp folder

2. Create external table in Hive for analysis in Zeppelin

-- Hive

CREATE EXTERNAL TABLE IF NOT EXISTS worldsales (Id INT, Region STRING, Country STRING, Item_Type STRING, Sales_Channel STRING, Order_Priority STRING, Order_Date DATETIME, Order_ID INT, 
Ship_Date DATETIME, Units_Sold INT, Unit_Price INT, Unit_Cost INT, Total_Revenue DOUBLE, Total_Cost DOUBLE, Total_Profit DOUBLE)
COMMENT 'Data of the World Sales'
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE
LOCATION '/tmp/worldsales'

-- Zeppelin -- Spark

Load data into Zeppelin

// Create a worldsales DataFrame from CSV file
%spark2
val worldsales = (spark.read 
.option("header", "true") --// Use first line as header
.option("inferSchema", "true") --// Infer schema
.csv("/tmp/worldsales.csv"))

3. Show the newly created dataframe

%spark2
worldsales.select("Id", "Region", "Country", "Item_Type", "Sales_Channel", "Order_Priority", "Order_Date", "Order_ID", "Ship_Date", "Units_Sold", "Unit_Price", "Unit_Cost", "Total_Revenue", "Total_Cost", "Total Profit").show()

4. Print the dataframe schema

%spark2
// Print the Schema in a tree format
worldsales.printSchema()

5. Filter the dataframe to show units sold > 8000 and unit cost > 500

Method 1:

%spark2
// Create a Dataset containing worldsales with units sold and unit price using “filter”
val worldsales_dataset = worldsales.select(("Id", "Region", "Country", "Item_Type", "Sales_Channel", "Order_Priority", "Order_Date", "Order_ID", "Ship_Date", "Units_Sold", "Unit_Price", "Unit_Cost", "Total_Revenue", "Total_Cost", "Total Profit")
					.filter($"Units_Sold" > 8000)
					.filter($"Unit_Cost" > 500)
worldsales_dataset.show()

Method 2:

worldsales.filter("Units_Sold" > 8000 && "Unit_Cost" > 500).show()

This led to only 2 Sub-Saharan countries: Senegal and Swaziland after the filteration. Senegal's Units Sold was 8,989 and Unit Cost was 502.54 while Swaziland's Units Sold was 9,915 and Unit Cost was 524.96. Both has the same Sales Channel as Offline.

6. Show the dataframe in group by “Region” and count

%spark2
worldsales.groupBy("region").count().show()

7. Create a separate dataframe with the group by results

%spark2
val worldsales_results = worldsales.groupBy("region").count()
worldsales_results.show()

There were the most activities in Sub-Saharan Africa and Europe. Meanwhile, North America and Australia and Oceania wasn't active in trading with Sub-Saharan Africa.

8. Save the new subset dataframe as a CSV file into HDFS

%spark2
worldsales_results.coalesce(1).write.format(“csv”).option(“header”, “true”).save(“/tmp/worldsales_results.csv”)

9. Create two views using the “createOrReplaceTempView” command

9.a. View on “Salesview” from the first dataframe

%spark2
worldsales.createOrReplaceTempView(“Salesview”)

9.b. View on “Regionview” from the second dataframe

%spark2
Worldsales_results.createOrReplaceTempView(“Regionview”)

-- SQL

10. Using SQL select all from “Regionview” view and show in a line graph

%spark2.sql
SELECT * FROM Regionview

The Line chart illustrates the dynamic Sales & Trading activities between Europe and Sub-Saharan Africa. But there was no energetic performance between North America, Australia & Oceania and Sub-Sharan Africa.

11. Using SQL select from the “Salesview” view – the region and sum of units sold and group by region and display in a data grid view

%spark2.sql
SELECT region, SUM(Units_Sold) AS Sum_Units_Sold
FROM Salesview
GROUP BY region

There was a positive correlation of Sum Units Sold between Europe and Sub-Saharan Africa. These two regions and continents had the highest Sum Units Sold. In contrast, North America, Australia & Oceania had the lowest Sum Units Sold. Other regions and continents played moderately around the average Sum Units Sold.

12. Using SQL select from the “Salesview” view – the region and sum of total_profit and group by region and display in a Bar chart

%spark2.sql
SELECT region, SUM(Total_Profit)
FROM Salesview
GROUP BY region

Europe and Sub-Saharan Africa certainly dominated Sum of Total Profit, ranking 2nd and 1st, respectively. North America, Australia & Oceania in the other hand gained the lowest Sum of Total Profit.

13. From the “Salesview” view, show the Total Profit as Profit, the Total Revenue as Revenue and the Total Cost as Cost from “Salesview” group by Region – The client wants to see this data in a Line chart in order to see the correlation between Cost, Revenue, Profit between Regions.

%spark2.sql
SELECT region, SUM(Total_Profit) AS Profit, SUM(total_revenue) AS Revenue, SUM(total_cost) AS Cost
FROM Salesview
GROUP BY region

The correlations between these fields between Regions were the same. Europe and Sub-Saharan Africa gained the highest figures in all 3 fields while North America, Australia & Oceania's fields were significantly low.

14. The customer is planning to open up a new store and searching for the best location for it, they need to see the Average Profit in each Region as a percentage (Pie chart) compared to other Regions

Now I will use both views created to plot the Pie chart and also point out the region where it is most profitable.

%spark2.sql
SELECT a.Region, AVG(Total_Profit) 
FROM Salesview b , Regionview a
WHERE a.Region = b.Region
GROUP BY a.Region

The Pie chart demonstrates that Europe and Sub-Saharan Africa took half of the worldwide Total Profit. Europe's Average Profit at 27% is the highest among all continents. Thefore, it is proven that Europe would be the most profitable Region.

More specifically, Belarus is the most profitable country by Average Profit.

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
Big Data		Big Data
Dataset		Dataset
Tableau		Tableau
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

World-Sales-Analysis-Hadoop-HDFS-Zeppelin-Spark-SQL-Scala-Tableau-Power Bi

Overview:

Platforms, Languages and Tools:

Dataset:

Key variables:

Conclusion:

Table of Content:

1. Load data

Load data into HDFS

2. Create external table in Hive for analysis in Zeppelin

Load data into Zeppelin

3. Show the newly created dataframe

4. Print the dataframe schema

5. Filter the dataframe to show units sold > 8000 and unit cost > 500

Method 1:

Method 2:

6. Show the dataframe in group by “Region” and count

7. Create a separate dataframe with the group by results

8. Save the new subset dataframe as a CSV file into HDFS

9. Create two views using the “createOrReplaceTempView” command

9.a. View on “Salesview” from the first dataframe

9.b. View on “Regionview” from the second dataframe

-- SQL

10. Using SQL select all from “Regionview” view and show in a line graph

11. Using SQL select from the “Salesview” view – the region and sum of units sold and group by region and display in a data grid view

12. Using SQL select from the “Salesview” view – the region and sum of total_profit and group by region and display in a Bar chart

13. From the “Salesview” view, show the Total Profit as Profit, the Total Revenue as Revenue and the Total Cost as Cost from “Salesview” group by Region – The client wants to see this data in a Line chart in order to see the correlation between Cost, Revenue, Profit between Regions.

14. The customer is planning to open up a new store and searching for the best location for it, they need to see the Average Profit in each Region as a percentage (Pie chart) compared to other Regions

Dashboard - Sales Performance by Region

Dashboard - Maps of Cost, Revenue, Profit

About

Releases

Packages

daiphuongngo/World-Sales-Analysis-Hadoop-Hive-HDFS-Zeppelin-Spark-SQL-Scala-Tableau-PowerBI

Folders and files

Latest commit

History

Repository files navigation

World-Sales-Analysis-Hadoop-HDFS-Zeppelin-Spark-SQL-Scala-Tableau-Power Bi

Overview:

Platforms, Languages and Tools:

Dataset:

Key variables:

Conclusion:

Table of Content:

1. Load data

Load data into HDFS

2. Create external table in Hive for analysis in Zeppelin

Load data into Zeppelin

3. Show the newly created dataframe

4. Print the dataframe schema

5. Filter the dataframe to show units sold > 8000 and unit cost > 500

Method 1:

Method 2:

6. Show the dataframe in group by “Region” and count

7. Create a separate dataframe with the group by results

8. Save the new subset dataframe as a CSV file into HDFS

9. Create two views using the “createOrReplaceTempView” command

9.a. View on “Salesview” from the first dataframe

9.b. View on “Regionview” from the second dataframe

-- SQL

10. Using SQL select all from “Regionview” view and show in a line graph

11. Using SQL select from the “Salesview” view – the region and sum of units sold and group by region and display in a data grid view

12. Using SQL select from the “Salesview” view – the region and sum of total_profit and group by region and display in a Bar chart

13. From the “Salesview” view, show the Total Profit as Profit, the Total Revenue as Revenue and the Total Cost as Cost from “Salesview” group by Region – The client wants to see this data in a Line chart in order to see the correlation between Cost, Revenue, Profit between Regions.

14. The customer is planning to open up a new store and searching for the best location for it, they need to see the Average Profit in each Region as a percentage (Pie chart) compared to other Regions

Dashboard - Sales Performance by Region

Dashboard - Maps of Cost, Revenue, Profit

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages