This repository contains a small pre-task for potential ML team members for UBC Launch Pad.
The dataset bundled in this repository contains information about credit card bill payments, courtesy of the UCI Machine Learning Repository. Your task is to train a model on this data to predict whether or not a customer will default on their next bill payment.
Most of the work should be done in model.py
. It contains a
barebones model class; your job is to implement the fit
and predict
methods,
in whatever way you want (feel free to import any libraries you wish). You can
look at main.py
to see how these methods will be called. Don't
worry about getting "good" results (this dataset is very tough to predict on)
— treat this as an exploratory task!
To run this code, you'll need Python and three libraries: NumPy, SciPy,
and scikit-learn
. After invoking python main.py
from your shell of
choice, you should see the model accuracy printed: approximately 50% if you
haven't changed anything, since the provided model predicts completely randomly.
Here are the things you should do:
- Fork this repo, so we can see your code!
- Install the required libraries using
pip install -r requirements.txt
(if needed). - Ensure you see the model's accuracy/precision/recall scores printed when running
python main.py
. - Replace the placeholder code in
model.py
with your own model. - Fill in the "write-up" section below in your forked copy of the README.
Good luck, and have fun with this! 🚀
Give a brief summary of the approach you took, and why! Include your model's accuracy/precision/recall scores as well!
The first step that I take is to explore the data to see if there is any columns with NAN. As seem in the jupyter notebook provided, there is no column that contain NA value. Then I turn to inspect the label of the training data, which I discover that the number of default and not default is unbalanced.
Hence I decided to use a tree-based model to solve the above problem. This is the reason why I choose lightgbm to solve the above problem
However, there are a lot of hyperparameters to tune for lightbgm so I make a script to do the hyperparameter tuning as seen in tuning.py Finally I use the best hyperparameter for the parameter in model.py
So here are a summary of the steps that I take:
Step 1 : look at the data to see if there is any missing value
Step 2 : look at the label to see if there is any unbalance in the cases
Step 3 : determine the best model to use in this case
Step 4 : perform hyperparameters tuning to obtain a better result
Accuracy: 82.107%
Precision: 66.102%
Recall: 35.956%
X_train
and X_test
contain data of the following form:
Column(s) | Data |
---|---|
0 | Amount of credit given, in dollars |
1 | Gender (1 = male, 2 = female) |
2 | Education (1 = graduate school; 2 = university; 3 = high school; 4 = others) |
3 | Marital status (1 = married; 2 = single; 3 = others) |
4 | Age, in years |
5–10 | History of past payments over 6 months (-1 = on-time; 1 = one month late; …) |
11–16 | Amount of previous bill over 6 months, in dollars |
17–22 | Amount of previous payment over 6 months, in dollars |
y_train
and y_test
contain a 1
if the customer defaulted on their next
payment, and a 0
otherwise.