Skip to content

The CEO from Rossmann wants to renovate all stores and asked wants to know what the income of all the stores will be in the next 6 weeks. A regression model would be of great help.

Notifications You must be signed in to change notification settings

m4theus4ndr4de/regression-drugstore-sales-prediction

Repository files navigation

logo

Drugstore Sales Prediction

This is a fictional project for studying purposes. The business context and the insights are not real. The dataset is from Rossmann, a large drug store chain in Europe with many stores all over the continent. The dataset is available on Kaggle.

1. Description of the Business Problem

The CEO from Rossmann wants to make some investments to renovate all stores and asked the managers of all the stores to say what the income of all the stores will be in the next 6 weeks so that he can decide how much to invest in each store based on that. All the managers asked the data department from Rossmann for a way to answear the CEO accurately. So, a regression model would be of great help.

The tools that were created:

Machine Learning Regression Model: Using the dataset from Kaggle, a machine learning regression model was created to be used for future predictions.

The notebook used to create the model is available here.

Flask Prediction API: The model is available on the Render Cloud and can be acessible by an API created using Flask. The API source code is available here.

Telegram Chat Bot: A chat bot on Telegram (a desktop and messaging app) is available so that the CEO can send the id number of a store and the prediction of the sales for the next 6 weeks will be available there. You can clik here to check it out sending a number up to 4 digits. The Bot source code is available here.

2. Dataset Attributes

Information about the attributes can be found here.

AttributeDescription
idan Id that represents a (Store, Date) duple within the test set
Storea unique Id for each store
Salesthe turnover for any given day (this is what you are predicting)
Customersthe number of customers on a given day
Openan indicator for whether the store was open: 0 = closed, 1 = open
StateHolidayindicates a state holiday. Normally all stores, with few exceptions, are closed on state holidays. Note that all schools are closed on public holidays and weekends. a = public holiday, b = Easter holiday, c = Christmas, 0 = None
SchoolHolidayindicates if the (Store, Date) was affected by the closure of public schools
StoreTypedifferentiates between 4 different store models: a, b, c, d
Assortmentdescribes an assortment level: a = basic, b = extra, c = extended
CompetitionDistancedistance in meters to the nearest competitor store
CompetitionOpenSince[Month/Year]gives the approximate year and month of the time the nearest competitor was opened
Promoindicates whether a store is running a promo on that day
Promo2Promo2 is a continuing and consecutive promotion for some stores: 0 = store is not participating, 1 = store is participating
Promo2Since[Year/Week]describes the year and calendar week when the store started participating in Promo2
PromoIntervaldescribes the consecutive intervals Promo2 is started, naming the months the promotion is started anew. E.g. "Feb,May,Aug,Nov" means each round starts in February, May, August, November of any given year for that store

3. Business Premises

The premises that were assumed for the development of the business problem solution are:

  • The model will be available on Github, so the model file has to be smaller than 50 Mb.
  • Extra assortment is greater tha extended and basic.
  • Stores that have no competitors in the dataset (CompetitionDIstance and CompetitionOpenSince[Month/Year] not available) were interpreted as stores that have no competitors near them.
  • The number 200000 was considered as a good one to replace the NA values on the column CompetitionDistance for stores with no competitors meaning that the nearest competitor is very far.

4. Solution Strategy

  1. Understand the Business problem.
  2. Download the dataset from Kaggle.
  3. Clean the dataset removing outliers, NA values and unnecessary features.
  4. Explore the data to create hypothesis, think about a few insights and validate them.
  5. Prepare the data to be used by the modeling algorithms encoding variables, splitting train and test dataset and other necessary operations.
  6. Create the models using machine learning algorithms.
  7. Evaluate the created models to find the one that best fits to your problem.
  8. Tune the model to achieve a better performance.
  9. Deploy the model in production so that it is available to the user.
  10. Find possible improvements to be explored in the future.

5. The Insights

I1: Stores with greater assortments should sell more.

True: Stores with greater assortment sell more.

I2: Stores with closer competitors should sell less.

False: Stores with closer competitors sells almost the same average amount than the others.

I3: Stores that have a competitor for longer periods of time should sell more.

False: Stores with competitors for a longer time sell less.

I4: Stores with longer active promotions should sell more.

False: Store with longer promotions stop selling more after some time.

I5: Stores with more consecutive promotions should sell more.

False: Stores with more consecutive promotions sell less.

I6: Stores open during the Christmas holiday should sell more.

False: Stores open during Christmas don't sell more.

I7: Stores should sell more over the years.

False: Stores sell less over the years.

I8: Stores should sell more in the second half of the year.

True: The stores sell more in the second half of the year.

I9: Stores should sell more after the 10th of each month.

True: Stores really sell more after the 10th day of each month.

I10: Stores should sell less on weekends.

True: Saturday and Sun are the worst seling days.

I11: Stores should sell less during school holidays.

True: Stores sellless during school holiday except for the august.

6. Machine Learning Modeling

The final result of this project is a regression model. Therefore, some machine learning models were created. In all, 5 models were created, one of them is a simple model that calculates the average sales to serve as a comparison with machine learning models. The other models initially created were Linear Regression, Regularized Linear Regression, Random Forest and XGBoost.

The Boruta algorithm was used to select features for the model and 18 features were selected to the final model. The models were evaluated considering three metrics, Mean Absolute Error (MAE), Mean Absolute Percentage Error (MAPE) and Root Mean Squared Error (RMSE). The initial models performances are in the table below.

Model NameMAEMAPERMSE
Random Forest Regressor680.190.101008.96
XGBoost Regressor874.260.131256.33
Average Model1354.800.461835.14
Linear Regression1867.090.292671.05
Lasso2198.580.343110.51

7. Final Model

To decide which would be the final model, a cross-validation was carried out to evaluate the performance of the algorithms in a more robust way. These metrics are represented in the table below.

Model NameMAEMAPERMSE
Random Forest Regressor837.52 +/- 216.760.12 +/- 0.021254.42 +/- 316.65
XGBoost Regressor1069.47 +/- 139.480.15 +/- 0.021523.41 +/- 182.76
Linear Regression2081.73 +/- 295.630.3 +/- 0.022952.52 +/- 468.37
Lasso2388.68 +/- 398.480.34 +/- 0.013369.37 +/- 567.55

The Random Forest model was the best among all the models created. However, XGBoost was chosen to be deployed because it tends to take up less disk space than Random Forest. After choosing which would be the final model, a random search hyperparameter optimization was used to improve the performance of the model. The final model evaluation metrics are in the table below.

Model NameMAEMAPERMSE
XGBoost Regressor653.390.10956.03

8. Conclusion

The XGBoost prediction model was chosen because it can be trained faster than a Random Forest model using a GPU. The model used in deployment was not the best one, but it is considerably smaller than the others, because it has a smaller number of estimators, and the error metrics are not so distant from the best model. A chat bot that answears the income for the next 6 weeks was also developed to work like a hands on tool. Now, the CEO can have access easily to the income of each store by simple sending a message to the chat bot.

9. Future Work

  • Develop some more features to the bot.
  • Create an options menu to the Telegram Bot.
  • Develop a model to determine the profit of the next day, month and year.
  • Improve model prediction capabilities by adding new features.
  • Search for stores with a high prediction error and find a way to enhance the predition of them.
  • Try other machine learning algorithms.

About

The CEO from Rossmann wants to renovate all stores and asked wants to know what the income of all the stores will be in the next 6 weeks. A regression model would be of great help.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages