With a better understanding of data representation, let's now turn to preparing data for input into a machine learning pipeline. In the case of unsupervised learning, a simple matrix-level representation can suffice for input to machine learning models; we also need accompanying labels.
Often, traffic capture datasets are accompanied by labels. These labels can tell us something about the accompanying data points (i.e., flows, packets) in the traffic, and can be used to train the model for future prediction.
Automated tools exist for assigning labels to traffic flows, including pcapML. Before we use those tools, we will do some automatic preparation and labeling from an existing dataset, a log4j trace from malware traffic analysis and a regular trace.
You can use the NetML traffic library to generate features.
import logging
logging.getLogger("scapy.runtime").setLevel(logging.ERROR)
from netml.pparser.parser import PCAP
from netml.utils.tool import dump_data, load_data
import pandas as pd
Load the Log4j and HTTP packet capture files and extract features from the flows. You can feel free to compute features manually, although it will likely be more convenient at this point to use the netML
library.
Find the function in netml
that converts the pcap file into flows. Examing the resulting data structure. What does it contain?
How many flows are in each of these pcaps? (Use the netml
library output to determine the size of each data structure.)
If you loaded the two pcaps with netml
separately, the features will not be of the same dimension.
You should now have data that can be input into a model with scikit-learn. Import the scikit-learn package (sklearn
) and a classification model of your choice to test that you can train your model with the above data.
Hint: The function you want to call is fit
.
Note: If you plan to use a linear model such as logistic regression, your label should be a numerical value, and if the problem is a binary classification model, as in this case, then the appropriate label should be 0 and 1 for each respective class. (If you are using a tree-based model, then the labels could take any format.)
(Note that we have not done anything here except train the model with all of the data. To evaluate the model, we will need to split the data into train and test sets.)
We used the entire dataset to train the model in this example (no split), and so of course the model will be well-fit to all of the data. To simply test that your trained model works, call predict
using a feature vector that you generate by hand (e.g., from scratch, using a random set of numbers, from another pcap).
Consider the following extensions to the above exercise:
netml
or some of your own) into the same feature representation.The above exercise gives you an example of how to generate features from a packet capture, attach labels to the dataset, and train a model using the labeled data.
Many other steps exist in the machine learning pipeline, including splitting the data into training and test sets, tuning model parameters, and evaluating the model. These will be the next steps we walk through.