Skip to content

luuisotorres/Kaggle-Titanic-Machine-Learning-Competition-with-PySpark

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 

Repository files navigation


Kaggle: Titanic Machine Learning Competition with PySpark

--

Python Jupyter Notebook

--

About

I just did a course about PySpark and this notebook is my first attempt at working with it and learn how it can be used for EDA and machine learning.

PySpark is an interface for Apache Spark in Python that allows you to write Spark applications using Python APIs and is helpful for working with real-time and large-scale data.

The Titanic Machine Learning Competetion

This project is based on the Titanic dataset provided on the Titanic ML challenge on Kaggle. Its task is to build a machine learning model that can tell us if passengers were more likely to survive or not according to their data, such as socio-economic class, age, and gender.

The evaluation method for this model will be the accuracy score i.e the total percentage of correctly predicted passengers.

This is a binary classification problem and the classes used for predications are 1 for survived and 0 for deceased.

I used PySpark for an exploratory data analysis, data cleansing and to build logistic regression, random forest classifier and GBTClassifier models.

Libraries used

  • PySpark

Kaggle

You can also see this notebook on Kaggle. Just click here to see it.

Author

Luís Fernando Torres