Microsoft-Malware-Prediction-pyspark

Built a Scalable Data Science Pipeline in Spark using PySpark for predicting if a PC has Malware or not. Data consistes of information about the hardware and Anti Virus Softwares installed in the Machine.

This dataset is available on kaggle: https://www.kaggle.com/c/microsoft-malware-prediction

Models Used (PySpark):

1. Gradient Boosting Decision Tree Classifier
2. Random Forest Classifier
3. Logistic Regression

Pipeline Steps:

1. Data Preprocessing:

  a. Data Cleaning
  b. Feature Engineering
  c. Feature Selection
  d. Data Encoding

2. Exploratory Data Analysis

  a. Identification of correlated features
  b. Data Exploring
  c. For Model Interpretation

3. Model Implementation and Tuning:

  a. Build Default models as baseline models
  b. Experiment with hyperameters (to identify extreme limits of these hyperparameters for overfitting)
  c. Tune the models by Grid Search using Cross Validation (Based on the resonable parameter grid identified by experiments)
  d. Evaluate all the models in each step using AUC (best metric for binary classification of balanced classes)

4. Feature Importance:

  a. Obtain Feature Importances from each model
  b. Identify to top 10 important features
  c. Analyse those features to give recommendations

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
.ipynb_checkpoints		.ipynb_checkpoints
Create Dataframe.ipynb		Create Dataframe.ipynb
EDA 1.ipynb		EDA 1.ipynb
EDA 2.ipynb		EDA 2.ipynb
EDA_ATTR(29-55).ipynb		EDA_ATTR(29-55).ipynb
GBM Classifier.ipynb		GBM Classifier.ipynb
Preprocessing part 1.ipynb		Preprocessing part 1.ipynb
Preprocessing part 2.ipynb		Preprocessing part 2.ipynb
README.md		README.md
Random Forest Classifier.ipynb		Random Forest Classifier.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Microsoft-Malware-Prediction-pyspark

Models Used (PySpark):

Pipeline Steps:

1. Data Preprocessing:

2. Exploratory Data Analysis

3. Model Implementation and Tuning:

4. Feature Importance:

About

Releases

Packages

Contributors 2

Languages

jimitmistry/Microsoft-Malware-Prediction-pyspark

Folders and files

Latest commit

History

Repository files navigation

Microsoft-Malware-Prediction-pyspark

Models Used (PySpark):

Pipeline Steps:

1. Data Preprocessing:

2. Exploratory Data Analysis

3. Model Implementation and Tuning:

4. Feature Importance:

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages