Project for Fundamentals of Data Science 2018/2019, from the MSc in Computer Science.
Forked from luigiberducci, group composed by: luigiberducci, angelodimambro, and I.
Kaggle Score: 0.11440
- 3 new features introduced: total number of bathrooms, number of garage cars multiplied by garage area, total square feet
- removal of multicollinear features
- automatic removal of features receiving caret importance score equal to 0 when considering a Lasso regression model, until the RMSE value of such model didn't decrease any further
- Lasso regression model
- Ridge regression model
- eXtreme Gradient Boosting model
- Support Vector Machines
- Ensemble model (average)
- Stacked regression model (both variants A and B)
Our ensemble model performs a weighted average of predictions produced of a set of simple models, using the following weights and models:
Model | Weight |
Lasso | 0.5 |
Ridge | 0.5 |
XGB | 3.5 |
SVM | 5 |
Such weights have been optimized via 10-fold CV, minimizing the average RMSE and weights themselves.
A set of simple models' predictions is used to train a meta-model.
Variants:
- Variant A: meta-model trained on the average of the predictions produced during the simple models' k-fold trainings
- Variant B: meta-model trained on predictions produced by new instances of the simple models, those being trained on the whole training set
Our stacked regression model uses the following recipe:
Simple models | Meta-model |
Lasso | Specific XGB |
Ridge | |
XGB | |
SVM |
Our final predictions are computed in the following way:
predictions = ( 2 * ensemble + xgb + svm + stacked_variantA + stacked_variantB ) / 6