Naïve Bayes Spam Classifier¶

Probability is a powerful tool that lets us answer interesting questions about data, and it serves as the foundation of a commonly used machine learning technique for classification We'll also be building a Naïve Bayes classifier from scratch, so you'll get hands-on experience coding a machine learning classifier.

In [1]:
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

import pandas as pd
from sklearn.model_selection import train_test_split

Part 1: Data Preparation¶

The data we will use for this hands-on was put together by Tiago A. Almeida and José María Gómez Hidalgo, and it can be downloaded from the UCI Machine Learning Repository. The data collection process is described in more details here.

In [2]:
data = pd.read_csv('data/sms.csv.gz', compression='gzip', sep='\t', header=None, names=['Label', 'SMS'])
data
Out[2]:
Label SMS
0 ham Go until jurong point, crazy.. Available only ...
1 ham Ok lar... Joking wif u oni...
2 spam Free entry in 2 a wkly comp to win FA Cup fina...
3 ham U dun say so early hor... U c already then say...
4 ham Nah I don't think he goes to usf, he lives aro...
... ... ...
5567 spam This is the 2nd time we have tried 2 contact u...
5568 ham Will ü b going to esplanade fr home?
5569 ham Pity, * was in mood for that. So...any other s...
5570 ham The guy did some bitching but I acted like i'd...
5571 ham Rofl. Its true to its name

5572 rows × 2 columns

Basic Data Statistics¶

Compute the fraction of the dataset is ham, and the fraction of the dataset is spam.

Data Cleaning¶

Let's do some data cleaning:

  1. remove all punctuation
  2. make all words lowercase

Splitting Training and Testing Data¶

Now, let's split our data into training and testing tests. You can use the train_test_split function in scikit-learn to perform this split.

A common split of training and testing data is 80% in the training set, 20% in the test set.

Sanity Check the Split¶

As a little sanity check, let's verify that the percentages of spam and non-spam are roughly equivalent in the training set and the testing set.

Build the Vocabulary¶

A first step to constructing the classifier is to collect the unique set of words that occur in the training data, otherwise known as the vocabulary. Construct a list that contains all unique words. You will need this regardless of whether you build the classifier from scratch or whether you use sklearn.

You may need to do this for both the training and test sets, depending on how you write your code, so you might consider making it a function.

Word Counts for Each Message¶

Naïve Bayes (e.g., Multinomial Naïve Bayes) expects a sparse matrix, with word counts for each word for each message. Write a function to transform your training set, and call it on your training and test sets to get word count matrices for each.

In [12]:
X_train_wc.head(5)
Out[12]:
curious exorcism savings hv9d 50gbp wc1n3xx abroad option m8 their ... watch vu rice crashed science pulling comment except chances providing
0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
4 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0

5 rows × 8753 columns

Part 2: Naïve Bayes Classifier with sklearn¶

Now that you have your word matrices, counts, and labels for your training and test sets, you can call sklearn's multinomial Naïve Bayes classifier to train and test your model.

  1. Train the classifier
  2. Count the number of mislabeled points.

Part 3: Naïve Bayes Classifier from Scratch¶

Although scikit-learn has a Naïve Bayes classifier built-in, we can also implement such a classifier from scratch. In this exercise, you will use labeled SMS messages to classify messages as "spam" or legitimate mail.

Count the Occurrences of Each Word¶

We now compute how many times each word occurs in each SMS message.

Training the Model¶

  1. Split your data into spam and "ham" (legitimate email).
  1. Calculate the prior probabilities of spam and ham (the prior probabilities for Bayes' Rule.) (Save those values in variables.)
  1. Caluclate and store:
  • the total number of words in spam and
  • the total number of words in ham messages
  • the total number of unique words in the training data
  1. We now have everything we need to calculate $P(x_i\ |\ y = Spam)$ and $P(x_i\ |\ y = Ham)$ for all words $x_i$ in the vocabulary. Compute these quantities for each word and store those results in a dictionary. Remember also to adjust these probabilities by some alpha parameter for words that do not appear for some class in the labeled dataset. You should create data structure that have the following quantities for each word:
  • the probability that a word appears given that it is spam
  • the probability that a word appears given that it is ham

Classifier¶

Now define a function classify that takes a text message as input and outputs a label, 'spam' or 'ham', given the message.

Testing the Model¶

Example Messages¶

Let's test the spam classifier on a few messages.

In [24]:
classify('Sounds good, Alex, then see u there')
P(Spam|message): 8.011681173647132e-29
P(Ham|message): 5.170079874432155e-28
Label: Ham
In [25]:
classify('YOU WIN THE PRIZE MONEY JACKPOT! CALL 14')
P(Spam|message): 1.0341656349099176e-32
P(Ham|message): 6.673654155714669e-32
Label: Ham

Accuracy on Test Set¶

With obvious spam and non-spam, the classifier seems to be working in the way we would expect. Let's properly evaluate model performance using test data now. We just need to update our function to actual return something first, rather than print