Skip to content

A case study intended to focus on demonstrating how to develop a useful RFM Classification model and predict customer value from the model.

Notifications You must be signed in to change notification settings

Shane-McCallum/E-Commerce-RFM-Classification-Case-Study

Repository files navigation

cover image

E-Commerce RFM Segmentation with KMeans

Every business has consumers; without the consumers there are no sales, and the business would die. So, it is of no surprise that many businesses have a vested interest in understanding who their consumers are, what they want, when do they want it, and how much are they spending for it. The process of finding these answers is known as customer segmentation. There are a lot of methods used for customer segmentation today. One of the most universally useful solutions to segmenting a consumer base is using a Recency, Frequency, and Monetary Value (RFM) analysis. An RFM analysis allows businesses to segment their consumers into categories or tiers based on the consumer's scores across the three dimensions. Businesses can then see who their highest, average, and lowest performing consumers are. This enables a business to direct marketing campaigns and strategies to the proper consumer audience more accurately; which in turn will lead to more sales of higher volume. Below, I have assembled a case study intended to focus on demonstrating how to develop a useful RFM Analysis and Classification model from the UCI E-Commerce store data.

1. Problem Identification

Problem Statement PDF

As always, in order to develop an applicable solution, there must first be a SMART (S – Specific, M – Measurable, A – Achievable, R – Realistic, T – Timebound) problem. In this case study, the problem has been clearly provided and I know exactly what the solution should look like. It is clear that an RFM analysis will be needed to segment the consumer data into the proper tiers for precise and specialized marketing. This often looks like email notifications about discount codes for items left in a checkout cart, or an invitation to a loyalty program or subscription option for the most valued consumers.

2. Data Wrangling

E-commerce datasets are often private, proprietary information for most companies. This makes them quite hard to find among publicly available data. However, the data used for this case study was made free to the public through UCI Machine Learning Repository and is well known by most in the data science and analytics community. It is available for download here.

In regards to cleaning the data, there was some minor touches needed before the data was appropriate for segmentation.

First, the data had about 135,000 entries where the customer ID was blank. This could easily be explained as customers purchasing under guest accounts instead of making an account with an assigned customer ID number. As there is no way of tracking a specific guest's recency, frequency and total monetary value, I removed them from the data. Additionally, I noted that returns counted as negative values for quantity of the product in the transaction. This makes sense, as the product is returned and the original sale is null. However, negative quantity is assigned a UnitPrice of GBP0.00, which only throws off the data’s minimum and median values for UnitPrice as well as standard deviations. Therefore, all return transactions were removed from the data as well. Finally, the last thing of note, is that there were some outliers of considerable magnitude within the data for Quantity and Revenue. These outliers will skew the segmentation of the consumer base into tiers, so I removed the top 0.99% of them.

3. Exploratory Data Analysis (EDA)

In order to prepare for a proper RFM analysis I wanted to be sure that I had cleaned the data enough to get a reliable representation of the client's customer base. In addition, the exploratory analysis would reveal if any further cleaning was needed. What I was checking for here were heavily skewed bar plots that would indicate a strong imbalance among the features. First up was to examine the customer base itself. I am not as concerned about which customers purchased the most products as I am that only few of the several thousand customers make up most of the purchases. To check for this, I used SciKit Learns bar plot feature.

Top 100 Customers by # of Purchases

Wow, alright; CustomerID 17841 has made nearly 8,000 purchases in the last year alone. Following that, though, there is a gradual drop off in the number of purchases. As long as the data does not "flatten out" in comparison to the maximum, I am not too concerned.

Up next, I wanted to make sure there wasn’t a bias in the data on which products were purchased, as that would also signal a single product sustaining the business, and therefore an RFM analysis would be of little use.

Top 50 Most Often Purchased Items

Great, nothing of concern here. Next up, I wanted to take a look at which countries comprised the data to be sure I could give a confident representation of them.

Countries by # of Transactions from Them

Here, it is clear that the data is lacking enough transactional data, and therefore, likely lacking the consumer base in any country outside of the UK. So, in order to preserve the authenticity of the segmentation as much as I could, I removed the transactions made from other countries.

The last little bit of EDA I wanted to do is to create a cohort analysis. A cohort analysis will allow me to see how many customers returned each month after the month which they completed their first transaction. Customer cohorts are incredibly valuable as they create mutually exclusive customer segments. This allows a marketing team to clearly measure the metrics of a products lifecycle among the customer base as well as measuring the standard customer lifecycle; such as yearly purchase cycles. This cohort analysis tells the client that each row represents a cohort, the month which that cohort was first active as a consumer on the store, and that each column represents a new month and the retention percentage of consumers from the cohort's first month.

Cohort Analysis

4. Modeling with KMeans Clustering

To begin, I grouped the data by Customer ID and created the features for Recency, Frequency, and Monetary Value. Again, Recency is the amount of days from the customer's last purchase; so, a smaller number here means a higher Recency Score. Frequency and Monetary Value are exactly what they sound like, and therefore a higher value here now means a higher score in the segmentation.

RFM_seg initial table

Now, the client wants the customers segmented into three tiers; the Best Customers, Average Customers, and Weak Customers. In order to do this, I am going to make the range for each segment to be scored between 1 and 4. This will divide the customers up into quartiles across three features, allowing for clean and even segmentation. A customer segmented as Best Customer would have to have a score totaling up to 9 or more (such as Recency-3, Frequency-3, and Monetary Value-3).

RFM seg w/ RFM Level

To properly implement KMeans Clustering. This will make all of the features of the RFM segmentation easily comparable and prevent KMeans from outputting really skewed and stretched clusters. It's clear by the distribution graphs below that normalizing the data makes it a lot easier to compare.

Not-Normalized Distributions

Normalized Distributions

Next, I iterated through Kmeans Clusters to see which clusters had the "best" sum-of-square-errors score (SSE). The best value is usually the one where the "crook of the elbow" is visible. However, I know from the client that they want 3 tiers, which means I need to at least have three clusters.

KMeans Elbow Check

Well, the elbow isn't very pronounced. So, to check and see which cluster would be the best I will use another method; the Silhouette Score. The Silhouette Score provides a great visual representation of the KMeans clustering algorithm at work. What I want to see here is a set of even "Silhouettes," indicating that the clusters are evenly sized and not overlapping into each other.

Silhouette Score 2 Silhouette Score 3 Silhouette Score 4 Silhouette Score 5 Silhouette Score 6

Finally, I will check the Average Silhouette Score. Generally, the K value with the closest average score to 1 is the best, with 0 meaning there’s some overlap, and -1 meaning there exist no clusters.

Average Silhouette Score Graph

Seems to be that K=3 provides the best score for the segmentation; since 2 is not really an option for what the client wants. With the cluster value chosen, I run the KMeans algorithm and fit it to the normalized RFM data. To visualize and compare, I have plotted both the customer segmentations and the KMeans clusters on snake plots. Snake plots are great for comparing segments across the various features to identify where they differ.

RFM Snake Plot KMeans Clusters Snake Plot

It is clear that the Best Customers are in Cluster 1. The Average Customer is probably overlapped a little bit between Cluster 0 and Cluster 2, but are mostly found in Cluster 0. Finally, The Weak Customers, are found in Cluster 2.

To better understand how customers were assigned into their segments, I made a heatmap of the relative importance of the customers value in each importance and how that determined their segment. Ideally, this and the relative importance of attributes for the clusters should be similar. The heatmaps show that the further away from 0 an attribute score is, the more important it is in determining what falls into that segment or cluster.

Relative Importance of Attributes for Segments Relative Importance of Attributes for Clusters

5. Conclusion and Future Tests

The client now has a clear segmentation of their consumer base and can see who their best, average and weakest customers are based on their recency, frequency and monetary value. Additionally, the client can continually feed future customer information into the model and segment future customers rather easily. However, there are some important notes. First and foremost is that this model is designed for the client's consumers located only within the UK, and not elsewhere. Additionally, the data has been cleaned of most of the client's outlier consumers, and therefore, cannot be applied to those consumers. This stated, with continual data of our client's consumer transactions being fed into the model, the outliers may shrink into the standard quartiles of the data. For future tests, I would encourage the client to copy this model and apply it to the data they have for their consumers from other countries. Once there is a healthy enough population of customers from these other countries, the client could have several different models tracking their consumer base across various continents experiencing different trends.

About

A case study intended to focus on demonstrating how to develop a useful RFM Classification model and predict customer value from the model.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages