Skip to content

Latest commit

 

History

History
90 lines (38 loc) · 3.71 KB

File metadata and controls

90 lines (38 loc) · 3.71 KB

Spotting The Wolf In Sheep’s Clothing: Malware Detection for Android Applications Based on Structured Heterogeneous Information Networks

Braden Riggs, Raya Kavosh | Department of Data Science | University of California, San Diego, USA

GitHub Logo

Based on a paper: https://www.cse.ust.hk/~yqsong/papers/2017-KDD-HINDROID.pdf

Written report by Braden Riggs & Raya Kavosh: https://docs.google.com/document/d/1yZ8BqL1IgKfWMAvT7HqVKmIjasgUteZnEZf0Wczic34/edit?usp=sharing

Running Package:

In command line: >>> python3 run.py

Or for running a test: >>> python3 run.py -t or python3 run.py --test or python3 run.py --Test

Docker Container: >>> dockerfile

For EDA notebook:

First run "python3 run.py" command, this will create the JSON files the EDA uses. Once the data_extract directory is populated with these files the EDA notebook can run.

For adjusting data injestion and params: config/data_params.json

There are 3 config files to adjust:

  • config/data_params.json

        mal_fp: Location of malware apps
    
        benign_fp: Location of benign apps
    
        limiter: if set to false the pipeline will process every app in dir, else process a set amount of apps specified below
    
        lim_mal: limits mal apps parsed
    
        lim_benign: limits benign apps parsed
    
  • config/dict_build.json

        directory: filepath to find processed files
        
        verbose: if set to true more print statments will trigger helping track progress
        
        truncate: if set to true Matrices A, B, P, and I, will have all APIs that occur less than the lower_bound_api_count filtered out, speeding up runtime significantly
        
        lower_bound_api_count: APIs occuring equal to or less than this value will be filtered out, values greater than 1 can result in accuracy loss
    
  • config/parsing_data.json

        multithreading: If enabled will speed up feature parsing stage
        
        out_path: output path for created files
        
        verbose: if set to true more print statments will trigger helping track progress
    
  • config/model.json

        multithreading: If enabled will speed up model training stage
        
        test_split: Portion of the data for testing the model performance on
    

Results:

The models were trained and tested on a selection of 96 apps, 48 benign apps and 48 malicious apps. This was done because 96 divides evenly into 12 groups of 8, allowing us to multithread the feature extraction and matrix creation, effectively cutting computation time in 8. With that said it still took a considerable amount of time, over 2 hours, to extract the features, train the model, and evaluate performance. This balanced dataset was then split, 70% of the apps would be used for training, and 30% of the apps would be used for testing. Additionally we tested a Logistic Regression Model included with the EDA portion of our project, this model represents the “standard” or rather baseline we evaluate the performance of our new SVM kernels on. This logistic regression model was trained on a range of features including the unique APIs in each app and various method counts. The performance of the the baseline logistic regression model and custom SVM kernels is as follows:

GitHub Logo

For analysis and further details see: https://docs.google.com/document/d/1yZ8BqL1IgKfWMAvT7HqVKmIjasgUteZnEZf0Wczic34/edit?usp=sharing

Acknowledgments:

Special Thanks to Aaron Fraenkel and Shivam Lakhotia for mentoring this project.

Thanks to the UCSD-DSMLP server for hosting the project.