Spam Mail Prediction using Machine Learning
- Importing Libraries: The code begins by importing necessary libraries such as NumPy, Pandas, scikit-learn's
train_test_split
,TfidfVectorizer
,LogisticRegression
, andaccuracy_score
fromsklearn.metrics
.
- Data Preparation: It implies that you have a dataset containing email content along with labels indicating whether each email is spam or not. This dataset is likely stored in a Pandas DataFrame.
- Splitting Data: The
train_test_split
function from scikit-learn is used to split the dataset into training and testing sets. This step is crucial for evaluating the performance of the spam mail detector.
- Feature Extraction: The
TfidfVectorizer
is employed to convert text data into numerical vectors. This vectorization technique assigns weights to words based on their frequency in each email and across all emails in the dataset. TF-IDF stands for Term Frequency-Inverse Document Frequency.
- Model Training: A logistic regression model is chosen for training. Logistic regression is a commonly used algorithm for binary classification tasks like spam detection.
- Model Evaluation: Once the model is trained, it's evaluated using the testing dataset, and the accuracy score is computed to assess how well the model performs in distinguishing between spam and non-spam emails.
Data Collection & Pre-Processing
raw_mail_data
to the output console. It allows you to visually inspect the loaded data, helping you understand its structure and contents, which is crucial for data exploration and preprocessing tasks.raw_mail_data
with an empty string ''
. It ensures consistency in the data and prevents potential issues during subsequent processing steps.Code Description: Displaying DataFrame Contents
This line of code prints the first 5 rows of the DataFrame mail_data
. By using the .head()
method, it provides a quick overview of the dataset's structure and content, aiding in initial data exploration and understanding.
Code Description: Determining DataFrame Dimensions
This line of code checks the number of rows and columns in the DataFrame mail_data
using the .shape
attribute. It returns a tuple containing the number of rows followed by the number of columns, providing insight into the dataset's size and structure.
Label Encoding
Code Description: Labeling Spam and Ham Mail
This code segment assigns labels to the mail categories in the DataFrame mail_data
. It sets the label 'spam' to 0 and the label 'ham' to 1. This labeling is crucial for training a machine learning model to distinguish between spam and non-spam emails.
Code Description: Separating Data into Texts and Labels
This code segment separates the data in the DataFrame mail_data
into two components: texts (X
) and labels (Y
).
X
contains the text content of the emails, stored in the column labeled 'Message'.Y
contains the corresponding labels indicating whether each email is spam (0) or ham (1), stored in the column labeled 'Category'.
This separation prepares the data for further processing, such as feature extraction and model training.
Code Description: Printing Text Data
This line of code prints the text content of the emails stored in the variable X
. It provides a glimpse into the textual information contained within the emails, which is crucial for understanding the dataset and performing text-based analysis tasks
Code Description: Printing Label Data
This line of code prints the labels associated with each email stored in the variable Y
. The labels indicate whether each email is classified as spam (0) or ham (1). Printing Y
provides insight into the distribution of labels within the dataset, which is essential for model training and evaluation.
Splitting the data into training data & test data
Code Description: Splitting Data into Training and Testing Sets
This line of code utilizes the train_test_split
function from scikit-learn to split the data into training and testing sets. Here's what each component does:
X_train
andY_train
: These variables represent the features (text data) and labels for the training set, respectively. They contain a subset of the original data that will be used to train the machine learning model.X_test
andY_test
: Similarly, these variables represent the features and labels for the testing set. They contain a separate subset of the original data that will be used to evaluate the trained model's performance.test_size=0.2
: This parameter specifies the proportion of the dataset that should be allocated to the testing set. In this case, 20% of the data is reserved for testing, while the remaining 80% is used for training.random_state=3
: This parameter sets the seed for the random number generator, ensuring reproducibility. By using the same seed, you'll obtain the same random splits each time you run the code.
Splitting the data into training and testing sets is essential for assessing the model's generalization performance on unseen data.
Code Description: Printing Data Shapes
These lines of code print the shapes of different data arrays, providing insights into their dimensions. Here's what each print statement does:
print(X.shape)
: This line prints the shape of the feature array X
, which represents the text data of all emails. The shape is a tuple indicating the number of samples (emails) and the number of features (text length).print(X_train.shape)
: This line prints the shape of the feature array X_train
, which represents the text data of emails in the training set. It provides insight into the number of samples and features in the training data.print(X_test.shape)
: Similarly, this line prints the shape of the feature array X_test
, which represents the text data of emails in the testing set. It indicates the number of samples and features in the testing data.
Printing the shapes of these arrays helps in understanding the distribution of data between training and testing sets and ensures proper data handling during model training and evaluation.
Feature Extraction
Code Description: Transforming Text Data into Feature Vectors
This code segment transforms the text data into feature vectors suitable for input to a Logistic Regression model. Here's what each part does:
TfidfVectorizer
: This class from scikit-learn is used for feature extraction. It converts a collection of raw text documents into a matrix of TF-IDF features.
min_df
: This parameter sets the minimum frequency threshold for including terms in the feature matrix. Here, it's set to 1, meaning terms must appear in at least one document to be considered.stop_words
: This parameter specifies that common English stop words should be removed from the text data during feature extraction to improve model performance.lowercase
: This parameter ensures that all text is converted to lowercase before feature extraction to maintain consistency.fit_transform
: This method fits the TfidfVectorizer to the training data (X_train
) and transforms it into a feature matrix (X_train_features
). It learns the vocabulary and IDF weights from the training data and applies it to transform the training text data into a numerical feature representation.transform
: This method applies the learned vocabulary and IDF weights from the training data to transform the testing data (X_test
) into a feature matrix (X_test_features
). It's essential to use the same transformation for testing data as for training data.
astype('int')
: This method converts the data type ofY_train
andY_test
to integers, ensuring compatibility with the logistic regression model.
Code Description: Printing Transformed Feature Vectors
This line of code prints the transformed feature vectors derived from the training text data. These feature vectors have been processed using TF-IDF vectorization, converting the text data into numerical representations suitable for input to the logistic regression model.
Code Description: Printing Transformed Feature Vectors
This line of code prints the transformed feature vectors derived from the training text data. These feature vectors have been processed using TF-IDF vectorization, converting the text data into numerical representations suitable for input to the logistic regression model.
Training the Model
Logistic Regression
Code Description: Initializing Logistic Regression Model
This line of code initializes a logistic regression model using scikit-learn's LogisticRegression
class. Logistic regression is a widely used algorithm for binary classification tasks, making it suitable for this spam detection project. By initializing the model, we're preparing it for training on the transformed feature vectors.
Code Description: Training the Logistic Regression Model
This line of code trains the logistic regression model using the training data (X_train_features
and Y_train
). The fit()
method is used to fit the model to the training data, allowing it to learn the relationships between the features (transformed text data) and the target labels (spam or ham).
Evaluating the trained model
Code Description: Making Predictions on Training Data
This code segment predicts the labels for the training data using the trained logistic regression model (model
). Here's what each part does:
model.predict(X_train_features)
: This line predicts the labels for the training data (X_train_features
) using the trained logistic regression model. Thepredict()
method takes the feature vectors as input and returns the predicted labels.accuracy_score(Y_train, prediction_on_training_data)
: This line calculates the accuracy of the model's predictions on the training data. It compares the predicted labels (prediction_on_training_data
) with the actual labels (Y_train
) and computes the accuracy score, which represents the proportion of correctly classified instances.
Code Description: Printing Accuracy on Training Data
This line of code prints the accuracy of the logistic regression model on the training data. It provides an assessment of how well the model performs in predicting the labels of the training instances. The accuracy score represents the proportion of correctly classified instances out of all instances in the training set.
Code Description: Making Predictions on Test Data
This code segment predicts the labels for the test data using the trained logistic regression model (model
). Here's what each part does:
model.predict(X_test_features)
: This line predicts the labels for the test data (X_test_features
) using the trained logistic regression model. Thepredict()
method takes the feature vectors as input and returns the predicted labels.accuracy_score(Y_test, prediction_on_test_data)
: This line calculates the accuracy of the model's predictions on the test data. It compares the predicted labels (prediction_on_test_data
) with the actual labels (Y_test
) and computes the accuracy score, which represents the proportion of correctly classified instances in the test set.
Code Description: Printing Accuracy on Test Data
This line of code prints the accuracy of the logistic regression model on the test data. It provides an assessment of how well the model generalizes to unseen data by predicting the labels of the test instances. The accuracy score represents the proportion of correctly classified instances out of all instances in the test set.
Building a Predictive System
Code Description: Making Prediction on Input Mail
This code segment takes an input email, converts it into feature vectors using the same TF-IDF transformation applied during training, and then makes a prediction using the trained logistic regression model (model
). Here's what each part does:
input_mail
: This variable stores the input email as a list containing a single string.
feature_extraction.transform(input_mail)
: This line transforms the input email into feature vectors using the same TF-IDF vectorizer (feature_extraction
) that was fitted on the training data. The transform()
method converts the input email text into numerical representations suitable for input to the logistic regression model.
model.predict(input_data_features)
: This line predicts the label of the input email using the trained logistic regression model. The predict()
method takes the transformed feature vectors (input_data_features
) as input and returns the predicted label.
print(prediction)
: This line prints the predicted label for the input email.
The subsequent if
statement checks if the predicted label is 1 (indicating a ham mail) or 0 (indicating a spam mail) and prints the corresponding message.
This code allows for the prediction of whether the input email is likely to be spam or ham based on the trained logistic regression model.
In this blog post, we've explored the development of a spam mail detector using Python and machine learning techniques. By leveraging the power of logistic regression and TF-IDF vectorization, we've created a robust model capable of classifying emails as either spam or ham with high accuracy.
This project not only demonstrates the effectiveness of machine learning in tackling real-world problems but also underscores its importance in enhancing user security and efficiency. With this spam mail detector, users can better protect themselves from potentially harmful emails, ensuring a safer and more productive online experience.
By following the steps outlined in this project, you too can implement your own spam mail detector and contribute to the ongoing efforts to combat email spam. Stay tuned for more exciting projects and insights on data science and machine learning!