Skip to content

Latest commit

 

History

History
117 lines (86 loc) · 9.04 KB

File metadata and controls

117 lines (86 loc) · 9.04 KB

logo

Health Insurance Cross Sell

This is a fictional project for studying purposes. The business context and the insights are not real. The dataset is from a Health Insurance company that sells various kinds of insurance. The dataset is available on Kaggle.

1. Description of the Business Problem

An insurance company sells health insurance to its customers. They want to start selling vehicle insurance to these customers in order to diversify their products. The company will call these customers and offer this new type of insurance. The company surveyed its customers to get some data from them and find out which ones would be interested in vehicle insurance to make a cross sell. The company has availability to make only two thousand calls. They believe that one of the ways to reach as many customers as possible with the least amount of calls is to make a machine learning model that sorts the list of customers to maximize the amount of contracted services. It is a type of classification problem called learn to rank.

The tools that were created:

Machine Learning Classification Model: Using the dataset from Kaggle, a machine learning classification model was created to be use for future predictions.

The notebook used to create the model is available here.

Flask Prediction API: The model is available on the cloud Heroku and can be acessible by an API created using Flask. The API source code is available here.

Google Sheets Script: A Google SHeets Script was developed to br used as a way to make predictions for several custumers at once. The spreadsheet is available here. There is a button on the top menu called "Health Insurance Prediction". To make predictions the user have to click there, click on "Get Prediction" and the predictions for all the rows in the spreadsheet will appear on the prediction column.

2. Dataset Attributes

Information about the attributes can be found here.

AttributeDescription
idUnique ID for the customer
GenderGender of the customer
AgeAge of the customer
Driving_License0 : Customer does not have DL, 1 : Customer already has DL
Region_CodeUnique code for the region of the customer
Previously_Insured1 : Customer already has Vehicle Insurance, 0 : Customer doesn't have Vehicle Insurance
Vehicle_AgeAge of the Vehicle
Vehicle_Damage1 : Customer got his/her vehicle damaged in the past. 0 : Customer didn't get his/her vehicle damaged in the past.
Annual_PremiumThe amount customer needs to pay as premium in the year
PolicySalesChannelAnonymized Code for the channel of outreaching to the customer ie. Different Agents, Over Mail, Over Phone, In Person, etc.
VintageNumber of Days, Customer has been associated with the company
Response1 : Customer is interested, 0 : Customer is not interested

3. Business Premises

The premises that were assumed for the development of the business problem solution are:

  • Cross-selling is a sales technique that involves selling an additional product or service to an existing customer.
  • Learn to rank is a kind of classification problem in which the objective is to order a datatable based on the probability of some data be of an specific class.
  • There is no Policy Sales Channel better than others, they should have the same weight to the model prediction.

4. Solution Strategy

  1. Understand the Business problem.
  2. Download the dataset from Kaggle.
  3. Clean the dataset removing outliers, NA values and unnecessary features.
  4. Explore the data to create hypothesis, think about a few insights and validate them.
  5. Prepare the data to be used by the modeling algorithms encoding variables, splitting train and test dataset and other necessary operations.
  6. Create the models using machine learning algorithms.
  7. Evaluate the created models to find the one that best fits to your problem.
  8. Tune the model to achieve a better performance.
  9. Deploy the model in production so that it is available to the user.
  10. Find possible improvements to be explored in the future.

5. Machine Learning Modeling

The final result of this project is a classification model to rank the table. Therefore, six models were created: KNN (K-Nearest Neighbors), Logistic Regression, Extra Trees, Random Forest, XGBoost and LightGBM.

The Boruta algorithm was used to select features for the model and only one feature were selected by Boruta. The dataset features are not very good at explaining if the customers want or not a vehicle insurance. The features for the model were chosen based on the feature importance in an Extra Trees model, seven features were selected. The models were evaluated considering two metrics, Precision at K and Recall at K considering the two thousand first rows of the table the models should rank. The initial models performances are in the table below.

Model NamePrecision at KRecall at K
LightGBM0.41530.0895
XGBoost0.40780.0879
Random Forest0.33630.0725
KNN0.33380.0719
Extra Trees0.32880.0709
Logistic Regression0.30280.0653

6. Final Model

To decide which would be the final model, a cross-validation was carried out to evaluate the performance of the algorithms in a more robust way. These metrics are represented in the table below.

Model NamePrecision at KRecall at K
LightGBM CV0.4222 +/- 0.00370.1128 +/- 0.0007
XGBoost CV0.4120 +/- 0.00550.1102 +/- 0.0013
Random Forest CV0.3526 +/- 0.01060.0942 +/- 0.0031
KNN CV0.3374 +/- 0.00590.0904 +/- 0.0015
Extra Trees CV0.3216 +/- 0.00390.0860 +/- 0.0011
Logistic Regression CV0.2958 +/- 0.01110.0792 +/- 0.0032

The LightGBM model was the best among all the models created. It was the one selected to be deployed. After choosing which would be the final model, a random search hyperparameter optimization was used to improve the performance of the model. The final model evaluation metrics are in the table below.

Model NamePrecision at KRecall at K
LightGBM0.433 +/- 0.00670.1158 +/- 0.0018

7. Business Results

The model, when applied to the initial dataset with 381,109 clients, would include 701 more clients that would want a vahicle insurance than picking 2000 clients randomly in the database. It would represent an increase of 297,03% in the number of sucessed calls.

8. Conclusion

Although the dataset is not very good at creating classification models to predict whether or not customers would like vehicle insurance, a model was created that managed to sort the table better than a random sort. The model can help the company achieve a higher success rate when calling customers. However, it would be of great help to have more features to enhance the model predictability.

9. Future Work

  • Improve model prediction capabilities by adding new features.
  • Explore the dataset to find possible insights.
  • Try other machine learning algorithms.