Skip to content

Abhinav330/Classification-problem-using-ensambling-on-titanic-dataset

Repository files navigation

Codacy Badge GitHub Pipenv locked dependency version GitHub Pipenv locked dependency version GitHub Pipenv locked dependency version GitHub Pipenv locked dependency version GitHub Pipenv locked dependency version GitHub Pipenv locked dependency version GitHub Pipenv locked Python version GitHub repo size

Code Summary

This Python script is a comprehensive example of a data analysis and machine learning project using the Titanic dataset. It covers various stages, including data loading, preprocessing, visualization, feature engineering, modeling, and generating predictions for a Kaggle competition.

Data Loading and Preprocessing

The script starts by importing necessary libraries and loading the 'train.csv' and 'test.csv' datasets. It also combines both datasets for data preprocessing and exploration. Key steps in this section include:

  • Handling missing values and visualizing them using a heatmap.
  • Exploring and visualizing data distributions, especially 'Age' and 'Fare' columns.
  • Creating new features like 'Ticket_Count' and 'Price' based on ticket counts and fares.
  • Handling outliers in the 'Price' column.
  • Extracting titles from names and grouping them into categories.
  • Filling missing values in 'Age' and 'Embarked' columns.

Data Visualization and Exploration

The script uses various plots to explore the data and relationships between variables, including:

  • Bar plots to visualize the survival rate by different attributes.
  • Kernel density estimate (KDE) plots for 'Age' and 'Price' by survival status.
  • Heatmaps to visualize correlations between variables.
  • FacetGrid for kernel density estimation of 'Age' and 'Price' by survival.

Feature Engineering

Feature engineering involves:

  • Creating new features like 'IsAlone' and 'IsCabin' based on family size and cabin information.
  • Creating an 'Age_Range' feature by binning 'Age' values.
  • Handling missing values in the 'Fare' column.
  • Creating an interaction feature 'c*f' by multiplying 'Fare' and 'Pclass' columns.

Model Building and Evaluation

The script builds a machine learning model using several classifiers and a voting ensemble method. Key steps include:

  • Data encoding using the CatBoost encoder.
  • Trying multiple classifiers and the voting ensemble method.
  • Evaluating model performance with classification reports and confusion matrices.
  • Using cross-validation to estimate model accuracy.

Generating Predictions

The script generates predictions for the 'test.csv' dataset using the trained model. It also creates a submission file in the required format for the Kaggle competition.

Referenes: