Now that we have experience preparing data for input to machine learning libraries, the next step will be to train, tune, and test a model. You will perform all three of these steps in this hands-on activity.
The assignment consists of the following steps:
import logging
logging.getLogger("scapy.runtime").setLevel(logging.ERROR)
from netml.pparser.parser import PCAP
from netml.utils.tool import dump_data, load_data
import pandas as pd
The goal of supervised learning is to train a model that takes examples and predicts labels for these examples that are as close as possible to the actual labels. For instance, in this example above, a model might take features from a traffic trace and predict whether the traffic constitutes regular web traffic or a scan.
How do you measure whether the model is succeeding if you don't know the true labels for new observations? The way to solve this problem is to test the performance of the trained algorithm on additional data that it has never seen, but for which you already know the correct labels.
This requires that you train the algorithm using only a portion of the entire labeled dataset (the training set) and withold the rest of the labeled data (the test set) for testing how well the model generalizes to new information.
To evaluate the model, we will need to split the data into train and test sets.
Split your data into a training and test set using scikit-learn. A common split is to train on 80% of your data, while withholding 20% of the data.
Now that you have split your data into training and testing sets, you are ready to train and evaluate a model.
Import a machine learning model of your choice, use your training set to train the model, and use the test set to evaluate it.
You can now evaluate how well your trained model works. There are several valuable ways to visualize your results. You might use techniques such as a confusion matrix, or a receiver operating characteristic (ROC) curve. Below we will gain some experience plotting both of those. This documentation may help you with plotting these results.
A confusion matrix is a one way to understand errors of different types. We can see a lot of examples off diagonal, suggesting a fair number of incorrect answers.
Some models can output different classes based on a threshold that is set for the decision.
From the ROC above, you can also compute a metric called the area under the curve (AUC). Visually, this is the area under the curve that you just plotted. You could see, intuitively, that the "best" performance should yield an AUC of 1, and the worst performance would yield an AUC closer to 0.5.
Scikit learn also has a function for computing AUC. Compute the area under the curve.
Which evaluation model is more appropriate, and when (i.e., under what circumstances)? When might you care more about looking at the confusion matrix (or model accuracy) vs. the ROC, or the area under the curve?