Skip to content

Predicting malicious/benign nature of apps based on their app permissions; with the help of Machine Learning as a tool

Notifications You must be signed in to change notification settings

pankaj-2k01/Android-Malware-Detection-System-Using-Machine-Learning

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

18 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Machine Learning Final Project

Purpose:

Project at IIITD under the course CSE343 : Machine Learning under the guidance of Professor Jainendra Shukla

Contributors:

Motivation:

With the increasing boom in the android market, there is a constant increase of apps with malicious activities. According to ZDNet, 10-24% of apps over the Play store could be malicious applications. On the surface, these apps look like any other standard app, but they exploit the user system in various harmful ways. The current methods to detect malwares are both resource heavy and exhaustive, yet fail to compete with the pace of new malwares.

What can help us to overcome these challenges ?

  • A strategy that can assess and analyse the data of confirmed Malicious Applications.
  • A model that can accurately predict the Malicious Application based on the permissions.
  • Proposing a machine learning malware detection model that relies on metadata information available publicly.

Introduction:

Despite the increasing malwares, there is not yet an effective and robust method to detect malware applications. With the increasing applicability of Machine Learning in various domains, we believe the issue of detecting Malware can be solved using Machine Learning techniques. Our project aims at a detailed and systematic study of malware detection using machine learning techniques, and further creating an efficient ML model which could classify the apps into benign(0) and malware(1) based on the requested app permissions. This study Proposes

  • Examining and Evaluating Android metadata and Permissions as Malware Predictors
  • Proposing a machine learning malware detection strategy that relies on publicly available metadata information.
  • Analysing such a model and determining its utility as a first-stage malware filter for Android malware detection

Dataset Description:

Details:

  • Dataset has been taken from kaggle Data contains the details of the permission of almost 30k app
  • There are 183 features in the dataset like Dangerous Permissions Count, Default : Access DRM content, Default : Move application resource, etc.
  • There is one target class (binary- 0/1) named - ‘Class’, indicating Benign(0) and Malware(1) applications.
  • There are 29,999 records with 20,000 malwares and 9,999 benign apps.

Prerocessing, Visualization and Analysis: Data is read from a csv file into a dataframe for easy use. Required attributes are filtered out from the dataset. Several plots are built to better understand/analyse the data. Data is checked for null/missing values and are therefore replaced by the mean of the column. Data is then analysed on the basis of the distribution of Malware and Benign applications in various settings and several plots were made to visualise the results. Matplotlib and Seaborn are used for plotting and visualization. Removed all other columns having information other than permissions. Mapped app names to index to easily access the information.

Plots:

Unsampled Class Distribution Undersampled Class Distribution Oversampled Class Distribution

Oversampled Class Distribution Classification of Apps using Categories

Methodology:

After Prerocessing the data, data is split into testing and training sets on a 8:2 ratio. We have done the Under and Across Sampling over the Dataset, however the outcome don’t appears promising at the end. Following the sampling, we used different classifiers, including logistic regression, decision trees, and Naive Bayes. However, the outcomes are unsatisfactory. However, after inspecting the Dataset, we see that there are several multivariate data tables, thus we must apply PCA to each Dataset. We plotted the Variance Percentage after using the PCA. As a result, we chose to use the Inverse transform. It is now up to us to apply the classifiers to the provided dataset. First, we used Random Forest, which resulted in a considerable improvement in the supplied accuracies. Following that, we used the Boosting approach to increase their prediction accuracy. We used the boosting strategy on an unsampled dataset and on one after selecting Reliable features, and the results show that the model is improving. Finally, we used SVM and MLP to the final dataset and obtained our best results. When we compare the results obtained after feature selection, we can see that we have progressed and obtained better accuracy.


PCA features vs Variance Percentage

Libraries Used:

Results and Analysis:

On Basic Models

Models Unsampled Oversampled Undersampled
Logistic Training Accuracy 0.69
Test Accuracy 0.68
Recall Score 0.95
ROC Score 0.53
Training Accuracy 0.63
Test Accuracy 0.62
Recall Score 0.66
ROC Score 0.61
Training Accuracy 0.63
Test Accuracy 0.63
Recall Score 0.67
ROC Score 0.62
Naive Training Accuracy 0.68
Test Accuracy 0.67
Recall Score 0.97
ROC Score 0.52
Training Accuracy 0.53
Test Accuracy 0.53
Recall Score 0.98
ROC Score 0.51
Training Accuracy 0.53
Test Accuracy 0.53
Recall Score 0.99
ROC Score 0.50
Decision Tree Training Accuracy 0.67
Test Accuracy 0.67
Recall Score 0.99
ROC Score 0.51
Training Accuracy 0.57
Test Accuracy 0.55
Recall Score 0.68
ROC Score 0.54
Training Accuracy 0.55
Test Accuracy 0.56
Recall Score 0.79
ROC Score 0.55

As we can see that sampling is not effective in our case so move forward with unsampled data only.

Models Optimal Parameter Accuracy Recall ROC
SVM default Training Accuracy 0.85
Test Accuracy 0.85
0.94 0.80
Random Forest n_estimators=200, n_jobs = -1 Training Accuracy 0.87
Test Accuracy 0.86
0.93 0.81
MLP random_state = 42, max_iter = 300 Training Accuracy 0.85
Test Accuracy 0.85
0.95 0.80

By looking at the result all the three models performs more or less the same with Random Forest with Accuracy of 86%. As we seen in the Tabulation that, Accuracy follows the order as follow: Random Forest > MLP > SVM

Conclusion:

  • Learning Different ways to visualize the data for better understanding of features. Machine Learning models like Logistic Regression, Naive Bayes and Decision Tree to model the problem. How to use platforms like Kaggle and Google Colab. How to work and collaborate in teams.

References:

  • [1] Dynamic Permissions based Android Malware Detection using Machine Learning Techniques

  • [2] Machine Learning for Android Malware Detection Using Permission and API Calls

  • [3] Android Permission Dataset

About

Predicting malicious/benign nature of apps based on their app permissions; with the help of Machine Learning as a tool

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •