📧 Email Spam Classification Project

Overview

This project focuses on building a machine learning model to classify emails as spam or not spam using natural language processing (NLP) techniques. The dataset used is the SMS Spam Collection Dataset from Kaggle, which contains 5572 SMS messages labeled as spam or ham (not spam).

Motivation

Spam emails are a significant issue, causing inconvenience and security risks. This project aims to develop an effective spam classification model to help filter out unwanted messages, enhancing email security and user experience.

Problem Statement

The goal is to classify emails as spam or ham using various NLP and machine learning techniques, focusing on achieving high precision to minimize false positives.

Not Spam Email

Success Metrics

The performance of the models is evaluated using the following metrics:

Accuracy
Precision

Spam Email

Given the imbalanced nature of the dataset, precision is prioritized over accuracy.

Methodology

Data Cleaning 🧹
- Removed duplicates, handled missing values, and transformed the text data.
Exploratory Data Analysis (EDA) 📊
- Analyzed the distribution of spam and ham emails.
Text Preprocessing ✍️
- Converted text to lower case, removed stop words, and applied stemming.
Vectorization 🧮
- Used Bag of Words (BoW) and Term Frequency-Inverse Document Frequency (TF-IDF) techniques for text vectorization.
Model Building 🛠️
- Implemented various models including:
  - Multinomial Naive Bayes
  - Bernoulli Naive Bayes
  - Gaussian Naive Bayes
Evaluation 📈
- Evaluated models based on accuracy, precision,
Improvement 🔧
- Tuned hyperparameters and tried different vectorization techniques to improve performance.
Website 🌐
- Built a user-friendly web interface using Streamlit.
Deployment 🚀
- Deployed the application on Streamlit Cloud.

Best Model

The Multinomial Naive Bayes model performed best in terms of precision, making it the chosen model for this project. Despite BernoulliNB and GaussianNB showing better overall performance, the high precision of MultinomialNB makes it more suitable for our needs.

Dataset

The raw dataset contained 5572 rows and 5 columns. After data cleaning and EDA, the focus was on two columns:

target: The label indicating if the message is spam or ham.
transformed_text: The cleaned and preprocessed text of the message.

Requirements

The following libraries were used in this project:

Streamlit
NLTK
Pandas
Numpy
Scikit-learn
Wordcloud

Steps Followed

Data Cleaning 🧹
EDA 📊
Text Preprocessing ✍️
Model Building 🛠️
Evaluation 📈
Improvement 🔧
Website 🌐
Deployment 🚀

Conclusion

This project successfully built an email spam classifier with high precision using the Multinomial Naive Bayes model. The application is deployed and accessible through a user-friendly Streamlit interface.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
images		images
sms-spam-classifier-main		sms-spam-classifier-main
.gitignore		.gitignore
README.md		README.md
app.py		app.py
model.pkl		model.pkl
nltk.txt		nltk.txt
requirements.txt		requirements.txt
sms-spam-detection.ipynb		sms-spam-detection.ipynb
spam.csv		spam.csv
vectorizer.pkl		vectorizer.pkl

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

📧 Email Spam Classification Project

Overview

Motivation

Problem Statement

Not Spam Email

Success Metrics

Spam Email

Methodology

Best Model

Dataset

Requirements

Steps Followed

Conclusion

About

Releases

Packages

Languages

sahilTiwariiii/Email-Spam-Classifier

Folders and files

Latest commit

History

Repository files navigation

📧 Email Spam Classification Project

Overview

Motivation

Problem Statement

Not Spam Email

Success Metrics

Spam Email

Methodology

Best Model

Dataset

Requirements

Steps Followed

Conclusion

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages