Skip to content

The business team asked the data scientists to select the most valuable customers for the company Recency, frequency and monetary aspects were considered by the business team as the main characterists to evaluate the customers in clusters.

Notifications You must be signed in to change notification settings

m4theus4ndr4de/clustering-loyalty-program-creation

Repository files navigation

logo

Loyalty Program Creation

This is a fictional project for studying purposes. The business context and the insights are not real. The dataset is available on Kaggle.

1. Description of the Business Problem

The business leaders of an e-commerce company concluded that a good strategy to leverage sales is to create a loyalty program for their customers. So, the business team asked the data scientists to select the most valuable customers for the company Recency, frequency and monetary aspects were considered by the business team as the main characterists to evaluate the customers in clusters.

The tools that were created:

Machine Learning Clustering Model: Using the dataset from Kaggle, a machine learning clustering model was created to be used for client clustering using the dataset and also for the identification of future clients clusters.

The notebook used to create the model is available here.

2. Dataset Attributes

AttributeDescription
InvoiceNoNumber of purchase invoice.
StockCodeCode of the stock the object comes from.
DescriptionDescription of the item purchased.
QuantityQuantity of the item purchased.
InvoiceDateDate of the invoice.
UnitPricePrice of one item of the object purchased.
CustomerIDIdentification number of the client responsible for the purchase.
CountryThe country the purchase comes from.

3. Business Premises

The premises that were assumed for the development of the business problem solution are:

  • Stock codes with letters, like POST, D, PADS, were discarded because it is not possible to know exactly what they mean.
  • Unit prices lower than 0.04 were not considered because they seem to be wrong.
  • Customers with that return almost every purchase they make cannot be considered.

4. Solution Strategy

  1. Understand the Business problem.
  2. Download the dataset.
  3. Clean the dataset removing outliers, NA values and unnecessary features.
  4. Prepare the data to be used by the modeling algorithms encoding variables, splitting train and test dataset and other necessary operations.
  5. Create the models using machine learning algorithms.
  6. Evaluate the created models to find the one that best fits to the problem.
  7. Tune the model to achieve a better performance.
  8. Explore the data to create hypothesis, think about a few insights and validate them.
  9. Deploy the model in production so that it is available to the user.
  10. Find possible improvements to be explored in the future.

5. The Insights

I1: The customers of the loyalty program have a purchase volume (products) above 10% of the total purchases.

True: The loyalty program cluster has 34% of the total products purchased.

I2: The customers of the loyalty program have a volume (revenue) of purchases above 10% of the total purchases.

True: The loyalty program cluster has 46% of the total profit.

I3: Loyalty program customers have a lower number of returns than the average of the other customers.

False: Loyalty program cluster has an average quantity of retuns above the average of the other customers.

I4: The median billing by loyalty program customers is 10% higher than the median billing overall.

True: The median of the profit from the loyalty program cluster is 215% above the overall median.

I5: Loyalty program customers are on the third quantile.

False: They are mostly in the first quantile.

6. Machine Learning Modeling

The final result of this project is a clustering model. Some dimensionality reduction algorithms, like PCA (Principal COmponent Analysis), UMAP (Uniform Manifold Approximation and Projection) and t-SNE (Distributed Stochastic Neighbor Embedding) were used to create embedding spaces as alternatives for the features space. Some machine learning modelling algorithms were also used as options to find the best possible model. In all, 3 types of models were created, k-Means, GMM (Gaussian MNixture Model) and HC (Hierarchical Clustering). The table below presents some of the models created, the embedding algorithm used to create the model, the number os clusters and the silhouette score.

Model NameSpace CreationNº CLustersSilhouette Score
k-MeansFeatures20.69
GMMFeatures2-0.01
HCFeatures20.65
k-MeansUMAP150.56
GMMUMAP140.47
HCUMAP150.54
k-Meanst-SNE130.45
GMMt-SNE130.36
HCt-SNE120.42
k-MeansTree Embedding Space150.48
GMMTree Embedding Space20.43
HCTree Embedding Space150.48

7. Final Model

The final model was chosen based on the number of clusrters that the business team chose considering the silhouette scores. The final model characteristcs are presented in the table below.

Model NameSpace CreationNº CLustersSilhouette Score
k-MeansUMAP110.52

The number of cluusters the business team belives to be the best is eleven. It is a good number because the silhouette score is one of the highest values found considering all the models created and the number of clusters is not high. The clusters profile with their average metrics are presented in the table below.

Cluster NumberNumber of CustomersCustomers PercentageGross RevenueRecencyProducts PurchasedFrequencyReturns
075513.36260.0911.7241.90.0576.6
13836.72663.624.2175.50.1417.5
283614.71705.6236.698.10.0416.7
33926.91164.39100.261.80.198.4
44297.51028.46290.759.70.63202.2
52774.9906.62362.665.11.052.5
658610.3861.5535.144.70.713.5
759510.4774.43135.165.10.773.7
83916.9647.62199.147.21.022.4
94087.2606.1556.246.11.076.6
1064311.3492.88246.839.61.021.6

8. Conclusion

Several models were created to meet the demand of the business team. FInally, it was possible to find a model that satisfied the data and business teams simultaneously. The features created in the beginning of the modeling process were effective to separate the customers in cluesters and find the cluster with the most valuable customers. The model can now be used by the business team to find the right marketing strategy for each customer according to the group they belong to and achieve higher profit.

9. Future Work

  • Try other clustering modeling algorithms.
  • Try other embedding spaces with more than 2 components.

About

The business team asked the data scientists to select the most valuable customers for the company Recency, frequency and monetary aspects were considered by the business team as the main characterists to evaluate the customers in clusters.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published