This is a fictional project for studying purposes. The company, business context and the insights are not real. The dataset used in this project is from Kaggle and it is available there.
The House Rocket is a real state company. They work buying houses for a good price and selling them later after some time. The company has a dataset that contains information about a lot of houses available to be bought. The data scientist from House Rocket should help the CEO answering two questions and creating two tool to help understanding the dataset.
Which houses should the House Rocket CEO buy and at what price? The source code can be found here and the dashboard is available here.
When is the best time to sell them and what would be the selling price? The source code can be found here.
An interactive dashboard in which it is possible to filter the data according to the CEO requirements and explore more about it. The dashboard was created using the Python package called Streamlit. It is available on Streamlit Cloud here.
Create a few insights about the dataset telling if they are true or false.
Information about the atrributes can be found here.
Attribute | Description |
---|---|
id | Unique ID for each home sold |
date | Date of the home sale |
price | Price of each home sold |
bedrooms | Number of bedrooms |
bathrooms | Number of bathrooms, where .5 accounts for a room with a toilet but no shower |
sqft_living | Square footage of the apartments interior living space |
sqft_lot | Square footage of the land space |
floors | Number of floors |
waterfront | A dummy variable for whether the apartment was overlooking the waterfront or not |
view | An index from 0 to 4 of how good the view of the property was |
condition | An index from 1 to 5 on the condition of the apartment |
grade | An index from 1 to 13, where 1-3 falls short of building construction and design, 7 has an average level of construction and design, and 11-13 have a high quality level of construction and design |
sqft_above | The square footage of the interior housing space that is above ground level |
sqft_basement | The square footage of the interior housing space that is below ground level |
yr_built | The year the house was initially built |
yr_renovated | The year of the house's last renovation |
zipcode | What zipcode area the house is in |
lat | Lattitude of the house |
long | Longitude of the house |
sqft_living15 | The square footage of interior housing living space for the nearest 15 neighbors |
sqft_lot15 | The square footage of the land lots of the nearest 15 neighbors |
- The zipcode, condition and grade were the most important variables to decide which houses should be purchased or not. Only houses with condition greater than or equal to three and grade greater than or equal to seven were classified as houses to be purchased.
- The season was considered an important variable to find the best moment to sell the house.
- The median price was considered a better metric to evaluate if the house should be purchased because the mean value can vary considerably if a house in one region is priced much higher than other houses.
- The median price per zipcode was also considered to set the selling price. Houses with a price below the median have 30% profit and houses above the median have 10% profit.
- The price per square foot of the living area was the variable analized to buy or not the house
- The values equal to zero in the column yr_renovated correspond to hoouses that were never renovated.
- The price column represents the value at which the house was advertised for sale.
- The date column represents the first day the house was for sale.
- Download the dataset from Kaggle.
- Understand the business problem.
- Clean, analyse and explore the dataset using data science packages in Python.
- Answer the main questions from the business problem.
- Develop dashboard for the CEO using Streamlit and deploy on the Streamlit Cloud.
- Create possible insights and analyse them.
I1: Houses that have some kind of river, lake or sea in front of them are at least 30% more expensive than the others that don't have water in front of them.
True: Houses that have some kind of river, lake or sea in front of them are 212,64% more expensive.
I2: Houses built before 1955 are 50% cheaper.
False: The price of the houses that were built before and after 1955 are almost the same.
I3: The average price of the houses are greater in the summer than all other seasons by 10%.
False: The average price of the houses during the spring are greater than the summer.
I4: The average price increased by 10% from 2014 to 2015.
False: The mean price of the houses is almost the same in the two years considered.
I5: The difference between the lowest and highest value between the average price for the months is greater than 10 % of the maximum value.
False: The average price from april is a little bit less than 10% greater than the average price in february.
I6: Houses that were never renovated are at least 20% cheaper.
True: Houses that were never renovated are 30% cheaper than the others that were renovated.
The proposed solution would result in an average profit of 100 K per house purchased and sold.
House Rocket would get a profit of 998 M if all the houses were bought requiring an investment of 5,134 M.
The questions that motivated this project were answered. Analysing the dataset it was possible to find out which houses should be bought based on their price, zipcode, condition and grade. The dashboard was created using Streamlit and deployed on Streamlit. The insights were generated based on the dataset from Kaggle.
- Improve Streamlit dashboard to add new features.
- Analyse the data to find out if houses in bad condition should be bought and renovated.
- Develop a machine learning model to predict if a certain house with known attributes should be bought or not by a given price.
- Develop a machine learning model to predict the adequate value to sell a house the was already bought.