Skip to content

jimitmistry/Microsoft-Malware-Prediction-pyspark

Repository files navigation

Microsoft-Malware-Prediction-pyspark

Built a Scalable Data Science Pipeline in Spark using PySpark for predicting if a PC has Malware or not. Data consistes of information about the hardware and Anti Virus Softwares installed in the Machine.

This dataset is available on kaggle: https://www.kaggle.com/c/microsoft-malware-prediction

Models Used (PySpark):

1. Gradient Boosting Decision Tree Classifier
2. Random Forest Classifier
3. Logistic Regression

Pipeline Steps:

1. Data Preprocessing:

  a. Data Cleaning
  b. Feature Engineering
  c. Feature Selection
  d. Data Encoding

2. Exploratory Data Analysis

  a. Identification of correlated features
  b. Data Exploring
  c. For Model Interpretation

3. Model Implementation and Tuning:

  a. Build Default models as baseline models
  b. Experiment with hyperameters (to identify extreme limits of these hyperparameters for overfitting)
  c. Tune the models by Grid Search using Cross Validation (Based on the resonable parameter grid identified by experiments)
  d. Evaluate all the models in each step using AUC (best metric for binary classification of balanced classes)

4. Feature Importance:

  a. Obtain Feature Importances from each model
  b. Identify to top 10 important features
  c. Analyse those features to give recommendations