Skip to content

This script processes the CICIDS2017 dataset for cybersecurity tasks, reading, cleaning, and merging data from multiple days. It explores feature distributions, reduces memory usage, handles imbalanced classes with SMOTE, and applies Random Forest models for classification, leveraging GPU acceleration if available.

License

Notifications You must be signed in to change notification settings

shashankp914/Intrusion-Detection-System-with-AI-ML

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 

Repository files navigation

This README file provides an overview of a machine learning project that analyzes the CICIDS2017 network traffic dataset for identifying malicious traffic.

Libraries Used:

  • pandas (data processing, CSV file I/O)
  • seaborn (data visualization)
  • matplotlib.pyplot (data visualization)
  • sklearn (machine learning algorithms)
  • numpy (numerical operations)

Data Preprocessing:

  1. Import Libraries: Import necessary libraries like pandas, seaborn, matplotlib, etc.
  2. Read Data: Read the CICIDS2017 dataset CSV files for different days (Monday, Tuesday, etc.) using pandas.read_csv().
  3. Clean Data:
    • Handle missing values (e.g., drop rows, fill with mean/median).
    • Remove irrelevant columns.
    • Encode categorical features (e.g., label encoding).
    • Reduce memory usage of dataframes (e.g., using pandas.DataFrame.dtypes).
    • Identify and handle meaningless features with only one unique value.
  4. Dimensionality Reduction: (Optional) Apply techniques like PCA or TSNE to visualize data in lower dimensions for easier analysis.
  5. Feature Selection: Analyze feature importance and select relevant features for model training.

Exploratory Data Analysis (EDA):

  1. Data Distribution: Analyze the distribution of features for different traffic types (benign, DoS, etc.) using bar plots, histograms, etc.
  2. Correlation Analysis: Identify correlations between features and the target variable (traffic type) using correlation coefficients (heatmap).
  3. Class Imbalance: Check for class imbalance in the target variable (unequal distribution of traffic types).
    • If imbalanced, apply oversampling techniques (e.g., SMOTE) to balance the data.

Machine Learning Model Training:

  1. Split Data: Split the preprocessed data into training and testing sets using sklearn.model_selection.train_test_split.
  2. Model Selection: Choose a suitable machine learning model for classification (e.g., Random Forest Classifier).
    • Consider using cuml library for GPU acceleration if available.
  3. Model Training: Train the model on the training data.
  4. Model Evaluation: Evaluate the model's performance on the testing data using metrics like accuracy, confusion matrix, classification report.

Results and Discussion:

  1. Present Results: Report the model's accuracy, confusion matrix, and classification report.
  2. Discuss Findings: Analyze the results, identify strengths and weaknesses of the model, discuss the impact of feature selection, etc.
  3. Future Work: Outline potential improvements and future work directions for the project.

Code Structure:

  • The code is likely organized with functions for data preprocessing, feature engineering, model training, and evaluation.
  • Comments are included to explain code sections.

About

This script processes the CICIDS2017 dataset for cybersecurity tasks, reading, cleaning, and merging data from multiple days. It explores feature distributions, reduces memory usage, handles imbalanced classes with SMOTE, and applies Random Forest models for classification, leveraging GPU acceleration if available.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published