About Me

Friday 29 March 2024

Spam Mail Prediction using Machine Learning

 Spam Mail Prediction using Machine Learning

This project involves building a spam mail detector using Python within the Google Colab environment. By leveraging machine learning techniques, we aim to automatically classify emails as either spam or legitimate. The detector will enhance user security by filtering out potentially harmful emails.

Source code(with describtion)

Importing the Dependencies

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
  • Importing Libraries: The code begins by importing necessary libraries such as NumPy, Pandas, scikit-learn's train_test_split, TfidfVectorizer, LogisticRegression, and accuracy_score from sklearn.metrics.

  • Data Preparation: It implies that you have a dataset containing email content along with labels indicating whether each email is spam or not. This dataset is likely stored in a Pandas DataFrame.

  • Splitting Data: The train_test_split function from scikit-learn is used to split the dataset into training and testing sets. This step is crucial for evaluating the performance of the spam mail detector.

  • Feature Extraction: The TfidfVectorizer is employed to convert text data into numerical vectors. This vectorization technique assigns weights to words based on their frequency in each email and across all emails in the dataset. TF-IDF stands for Term Frequency-Inverse Document Frequency.

  • Model Training: A logistic regression model is chosen for training. Logistic regression is a commonly used algorithm for binary classification tasks like spam detection.

  • Model Evaluation: Once the model is trained, it's evaluated using the testing dataset, and the accuracy score is computed to assess how well the model performs in distinguishing between spam and non-spam emails.

Overall, this code snippet demonstrates the process of building a simple yet effective spam mail detector using Python and scikit-learn libraries.

Data Collection & Pre-Processing

# loading the data from csv file to a pandas Dataframe
raw_mail_data = pd.read_csv('/content/mail_data.csv')
This line of code reads data from a CSV file named 'mail_data.csv' and loads it into a Pandas DataFrame called 'raw_mail_data'. It serves as the initial step in data preprocessing, essential for subsequent analysis and manipulation tasks.
print(raw_mail_data)
This line of code prints the contents of the DataFrame raw_mail_data to the output console. It allows you to visually inspect the loaded data, helping you understand its structure and contents, which is crucial for data exploration and preprocessing tasks.
# replace the null values with a null string
mail_data = raw_mail_data.where((pd.notnull(raw_mail_data)),'')
Handling Null Values
This line of code replaces any null values in the DataFrame raw_mail_data with an empty string ''. It ensures consistency in the data and prevents potential issues during subsequent processing steps.
# printing the first 5 rows of the dataframe
mail_data.head()

Code Description: Displaying DataFrame Contents

This line of code prints the first 5 rows of the DataFrame mail_data. By using the .head() method, it provides a quick overview of the dataset's structure and content, aiding in initial data exploration and understanding.

# checking the number of rows and columns in the dataframe
mail_data.shape

Code Description: Determining DataFrame Dimensions

This line of code checks the number of rows and columns in the DataFrame mail_data using the .shape attribute. It returns a tuple containing the number of rows followed by the number of columns, providing insight into the dataset's size and structure.

Label Encoding

# label spam mail as 0;  ham mail as 1;

mail_data.loc[mail_data['Category'] == 'spam', 'Category',] = 0
mail_data.loc[mail_data['Category'] == 'ham', 'Category',] = 1

Code Description: Labeling Spam and Ham Mail

This code segment assigns labels to the mail categories in the DataFrame mail_data. It sets the label 'spam' to 0 and the label 'ham' to 1. This labeling is crucial for training a machine learning model to distinguish between spam and non-spam emails.

# separating the data as texts and label

X = mail_data['Message']

Y = mail_data['Category']

Code Description: Separating Data into Texts and Labels

This code segment separates the data in the DataFrame mail_data into two components: texts (X) and labels (Y).

  • X contains the text content of the emails, stored in the column labeled 'Message'.
  • Y contains the corresponding labels indicating whether each email is spam (0) or ham (1), stored in the column labeled 'Category'.

This separation prepares the data for further processing, such as feature extraction and model training.

print(X)

Code Description: Printing Text Data

This line of code prints the text content of the emails stored in the variable X. It provides a glimpse into the textual information contained within the emails, which is crucial for understanding the dataset and performing text-based analysis tasks

print(Y)

Code Description: Printing Label Data

This line of code prints the labels associated with each email stored in the variable Y. The labels indicate whether each email is classified as spam (0) or ham (1). Printing Y provides insight into the distribution of labels within the dataset, which is essential for model training and evaluation.

Splitting the data into training data & test data

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=3)


Code Description: Splitting Data into Training and Testing Sets

This line of code utilizes the train_test_split function from scikit-learn to split the data into training and testing sets. Here's what each component does:

  • X_train and Y_train: These variables represent the features (text data) and labels for the training set, respectively. They contain a subset of the original data that will be used to train the machine learning model.
  • X_test and Y_test: Similarly, these variables represent the features and labels for the testing set. They contain a separate subset of the original data that will be used to evaluate the trained model's performance.
  • test_size=0.2: This parameter specifies the proportion of the dataset that should be allocated to the testing set. In this case, 20% of the data is reserved for testing, while the remaining 80% is used for training.
  • random_state=3: This parameter sets the seed for the random number generator, ensuring reproducibility. By using the same seed, you'll obtain the same random splits each time you run the code.

Splitting the data into training and testing sets is essential for assessing the model's generalization performance on unseen data.

print(X.shape)
print(X_train.shape)
print(X_test.shape)

Code Description: Printing Data Shapes

These lines of code print the shapes of different data arrays, providing insights into their dimensions. Here's what each print statement does:

  • print(X.shape): This line prints the shape of the feature array X, which represents the text data of all emails. The shape is a tuple indicating the number of samples (emails) and the number of features (text length).
  • print(X_train.shape): This line prints the shape of the feature array X_train, which represents the text data of emails in the training set. It provides insight into the number of samples and features in the training data.
  • print(X_test.shape): Similarly, this line prints the shape of the feature array X_test, which represents the text data of emails in the testing set. It indicates the number of samples and features in the testing data.

Printing the shapes of these arrays helps in understanding the distribution of data between training and testing sets and ensures proper data handling during model training and evaluation.

Feature Extraction

# transform the text data to feature vectors that can be used as input to the Logistic regression

feature_extraction = TfidfVectorizer(min_df = 1, stop_words='english', lowercase=True)

X_train_features = feature_extraction.fit_transform(X_train)
X_test_features = feature_extraction.transform(X_test)

# convert Y_train and Y_test values as integers

Y_train = Y_train.astype('int')
Y_test = Y_test.astype('int')

Code Description: Transforming Text Data into Feature Vectors

This code segment transforms the text data into feature vectors suitable for input to a Logistic Regression model. Here's what each part does:

  • TfidfVectorizer: This class from scikit-learn is used for feature extraction. It converts a collection of raw text documents into a matrix of TF-IDF features.
  • min_df: This parameter sets the minimum frequency threshold for including terms in the feature matrix. Here, it's set to 1, meaning terms must appear in at least one document to be considered.
  • stop_words: This parameter specifies that common English stop words should be removed from the text data during feature extraction to improve model performance.
  • lowercase: This parameter ensures that all text is converted to lowercase before feature extraction to maintain consistency.
  • fit_transform: This method fits the TfidfVectorizer to the training data (X_train) and transforms it into a feature matrix (X_train_features). It learns the vocabulary and IDF weights from the training data and applies it to transform the training text data into a numerical feature representation.
  • transform: This method applies the learned vocabulary and IDF weights from the training data to transform the testing data (X_test) into a feature matrix (X_test_features). It's essential to use the same transformation for testing data as for training data.

  • astype('int'): This method converts the data type of Y_train and Y_test to integers, ensuring compatibility with the logistic regression model.

Overall, this code segment prepares the text data for input to the logistic regression model by converting it into a numerical feature representation using TF-IDF vectorization.
print(X_train)

Code Description: Printing Transformed Feature Vectors

This line of code prints the transformed feature vectors derived from the training text data. These feature vectors have been processed using TF-IDF vectorization, converting the text data into numerical representations suitable for input to the logistic regression model.

print(X_train_features)

Code Description: Printing Transformed Feature Vectors

This line of code prints the transformed feature vectors derived from the training text data. These feature vectors have been processed using TF-IDF vectorization, converting the text data into numerical representations suitable for input to the logistic regression model.

Training the Model

Logistic Regression

model = LogisticRegression()

Code Description: Initializing Logistic Regression Model

This line of code initializes a logistic regression model using scikit-learn's LogisticRegression class. Logistic regression is a widely used algorithm for binary classification tasks, making it suitable for this spam detection project. By initializing the model, we're preparing it for training on the transformed feature vectors.

# training the Logistic Regression model with the training data
model.fit(X_train_features, Y_train)

Code Description: Training the Logistic Regression Model

This line of code trains the logistic regression model using the training data (X_train_features and Y_train). The fit() method is used to fit the model to the training data, allowing it to learn the relationships between the features (transformed text data) and the target labels (spam or ham).

Evaluating the trained model

# prediction on training data

prediction_on_training_data = model.predict(X_train_features)
accuracy_on_training_data = accuracy_score(Y_train, prediction_on_training_data)

Code Description: Making Predictions on Training Data

This code segment predicts the labels for the training data using the trained logistic regression model (model). Here's what each part does:

  • model.predict(X_train_features): This line predicts the labels for the training data (X_train_features) using the trained logistic regression model. The predict() method takes the feature vectors as input and returns the predicted labels.
  • accuracy_score(Y_train, prediction_on_training_data): This line calculates the accuracy of the model's predictions on the training data. It compares the predicted labels (prediction_on_training_data) with the actual labels (Y_train) and computes the accuracy score, which represents the proportion of correctly classified instances.

print('Accuracy on training data : ', accuracy_on_training_data)

Code Description: Printing Accuracy on Training Data

This line of code prints the accuracy of the logistic regression model on the training data. It provides an assessment of how well the model performs in predicting the labels of the training instances. The accuracy score represents the proportion of correctly classified instances out of all instances in the training set.

# prediction on test data

prediction_on_test_data = model.predict(X_test_features)
accuracy_on_test_data = accuracy_score(Y_test, prediction_on_test_data)

Code Description: Making Predictions on Test Data

This code segment predicts the labels for the test data using the trained logistic regression model (model). Here's what each part does:

  • model.predict(X_test_features): This line predicts the labels for the test data (X_test_features) using the trained logistic regression model. The predict() method takes the feature vectors as input and returns the predicted labels.

  • accuracy_score(Y_test, prediction_on_test_data): This line calculates the accuracy of the model's predictions on the test data. It compares the predicted labels (prediction_on_test_data) with the actual labels (Y_test) and computes the accuracy score, which represents the proportion of correctly classified instances in the test set.

print('Accuracy on test data : ', accuracy_on_test_data)

Code Description: Printing Accuracy on Test Data

This line of code prints the accuracy of the logistic regression model on the test data. It provides an assessment of how well the model generalizes to unseen data by predicting the labels of the test instances. The accuracy score represents the proportion of correctly classified instances out of all instances in the test set.

Building a Predictive System

input_mail = input_mail = ["your mail"]

# convert text to feature vectors
input_data_features = feature_extraction.transform(input_mail)

# making prediction

prediction = model.predict(input_data_features)
print(prediction)

if (prediction[0] == 1):
    print('Ham mail')
else:
    print('Spam mail')

Code Description: Making Prediction on Input Mail

This code segment takes an input email, converts it into feature vectors using the same TF-IDF transformation applied during training, and then makes a prediction using the trained logistic regression model (model). Here's what each part does:

  • input_mail: This variable stores the input email as a list containing a single string.
  • feature_extraction.transform(input_mail): This line transforms the input email into feature vectors using the same TF-IDF vectorizer (feature_extraction) that was fitted on the training data. The transform() method converts the input email text into numerical representations suitable for input to the logistic regression model.
  • model.predict(input_data_features): This line predicts the label of the input email using the trained logistic regression model. The predict() method takes the transformed feature vectors (input_data_features) as input and returns the predicted label.
  • print(prediction): This line prints the predicted label for the input email.
  • The subsequent if statement checks if the predicted label is 1 (indicating a ham mail) or 0 (indicating a spam mail) and prints the corresponding message.

This code allows for the prediction of whether the input email is likely to be spam or ham based on the trained logistic regression model.


In this blog post, we've explored the development of a spam mail detector using Python and machine learning techniques. By leveraging the power of logistic regression and TF-IDF vectorization, we've created a robust model capable of classifying emails as either spam or ham with high accuracy.

This project not only demonstrates the effectiveness of machine learning in tackling real-world problems but also underscores its importance in enhancing user security and efficiency. With this spam mail detector, users can better protect themselves from potentially harmful emails, ensuring a safer and more productive online experience.

By following the steps outlined in this project, you too can implement your own spam mail detector and contribute to the ongoing efforts to combat email spam. Stay tuned for more exciting projects and insights on data science and machine learning!



Understanding the K-Nearest Neighbors Algorithm (A Beginner's Guide part 7)

Understanding the K-Nearest Neighbors Algorithm (A Beginner's Guide) Machine learning algorithms can seem complex, but breaking them dow...