Probability is a powerful tool that lets us answer interesting questions about data, and it serves as the foundation of a commonly used machine learning technique for classification We'll also be building a Naïve Bayes classifier from scratch, so you'll get hands-on experience coding a machine learning classifier.
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
import pandas as pd
from sklearn.model_selection import train_test_split
The data we will use for this hands-on was put together by Tiago A. Almeida and José María Gómez Hidalgo, and it can be downloaded from the UCI Machine Learning Repository. The data collection process is described in more details here.
data = pd.read_csv('data/sms.csv.gz', compression='gzip', sep='\t', header=None, names=['Label', 'SMS'])
data
Label | SMS | |
---|---|---|
0 | ham | Go until jurong point, crazy.. Available only ... |
1 | ham | Ok lar... Joking wif u oni... |
2 | spam | Free entry in 2 a wkly comp to win FA Cup fina... |
3 | ham | U dun say so early hor... U c already then say... |
4 | ham | Nah I don't think he goes to usf, he lives aro... |
... | ... | ... |
5567 | spam | This is the 2nd time we have tried 2 contact u... |
5568 | ham | Will ü b going to esplanade fr home? |
5569 | ham | Pity, * was in mood for that. So...any other s... |
5570 | ham | The guy did some bitching but I acted like i'd... |
5571 | ham | Rofl. Its true to its name |
5572 rows × 2 columns
Compute the fraction of the dataset is ham, and the fraction of the dataset is spam.
Now, let's split our data into training and testing tests. You can use the train_test_split
function in scikit-learn to perform this split.
A common split of training and testing data is 80% in the training set, 20% in the test set.
As a little sanity check, let's verify that the percentages of spam and non-spam are roughly equivalent in the training set and the testing set.
A first step to constructing the classifier is to collect the unique set of words that occur in the training data, otherwise known as the vocabulary. Construct a list that contains all unique words. You will need this regardless of whether you build the classifier from scratch or whether you use sklearn.
You may need to do this for both the training and test sets, depending on how you write your code, so you might consider making it a function.
Naïve Bayes (e.g., Multinomial Naïve Bayes) expects a sparse matrix, with word counts for each word for each message. Write a function to transform your training set, and call it on your training and test sets to get word count matrices for each.
X_train_wc.head(5)
curious | exorcism | savings | hv9d | 50gbp | wc1n3xx | abroad | option | m8 | their | ... | watch | vu | rice | crashed | science | pulling | comment | except | chances | providing | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
5 rows × 8753 columns
Now that you have your word matrices, counts, and labels for your training and test sets, you can call sklearn
's multinomial Naïve Bayes classifier to train and test your model.
Although scikit-learn has a Naïve Bayes classifier built-in, we can also implement such a classifier from scratch. In this exercise, you will use labeled SMS messages to classify messages as "spam" or legitimate mail.
We now compute how many times each word occurs in each SMS message.
Now define a function classify
that takes a text message as input and outputs a label, 'spam' or 'ham', given the message.
classify('Sounds good, Alex, then see u there')
P(Spam|message): 8.011681173647132e-29 P(Ham|message): 5.170079874432155e-28 Label: Ham
classify('YOU WIN THE PRIZE MONEY JACKPOT! CALL 14')
P(Spam|message): 1.0341656349099176e-32 P(Ham|message): 6.673654155714669e-32 Label: Ham
With obvious spam and non-spam, the classifier seems to be working in the way we would expect. Let's properly evaluate model performance using test data now. We just need to update our function to actual return something first, rather than print