Data Preparation¶

With a better understanding of data representation, let's now turn to preparing data for input into a machine learning pipeline. In the case of unsupervised learning, a simple matrix-level representation can suffice for input to machine learning models; we also need accompanying labels.

Often, traffic capture datasets are accompanied by labels. These labels can tell us something about the accompanying data points (i.e., flows, packets) in the traffic, and can be used to train the model for future prediction.

Automated tools exist for assigning labels to traffic flows, including pcapML. Before we use those tools, we will do some automatic preparation and labeling from an existing dataset, a log4j trace from malware traffic analysis and a regular trace.

You can use the NetML traffic library to generate features.

In [1]:
import logging
logging.getLogger("scapy.runtime").setLevel(logging.ERROR)

from netml.pparser.parser import PCAP
from netml.utils.tool import dump_data, load_data

import pandas as pd

Load the Packet Capture Files¶

Load the Log4j and HTTP packet capture files and extract features from the flows. You can feel free to compute features manually, although it will likely be more convenient at this point to use the netML library.

Convert the Packet Capture Into Flows¶

Find the function in netml that converts the pcap file into flows. Examing the resulting data structure. What does it contain?

Explore the Flows¶

How many flows are in each of these pcaps? (Use the netml library output to determine the size of each data structure.)

Normalize the Shapes of Each Feature Set¶

If you loaded the two pcaps with netml separately, the features will not be of the same dimension.

  1. Adjust your data frames so that the two have the same number of columns.
  2. Merge (i.e., concatenate) the two data frames, but preserve the labels as a separate vector called "target".

Try Your Data on a Model¶

You should now have data that can be input into a model with scikit-learn. Import the scikit-learn package (sklearn) and a classification model of your choice to test that you can train your model with the above data.

Hint: The function you want to call is fit.

Note: If you plan to use a linear model such as logistic regression, your label should be a numerical value, and if the problem is a binary classification model, as in this case, then the appropriate label should be 0 and 1 for each respective class. (If you are using a tree-based model, then the labels could take any format.)

(Note that we have not done anything here except train the model with all of the data. To evaluate the model, we will need to split the data into train and test sets.)

Test Your Trained Model¶

We used the entire dataset to train the model in this example (no split), and so of course the model will be well-fit to all of the data. To simply test that your trained model works, call predict using a feature vector that you generate by hand (e.g., from scratch, using a random set of numbers, from another pcap).

Bonus¶

Consider the following extensions to the above exercise:

  • Concatenate or combine multiple features (either from netml or some of your own) into the same feature representation.
  • Normalize your features so that they are in the same range (helpful for some models).

The above exercise gives you an example of how to generate features from a packet capture, attach labels to the dataset, and train a model using the labeled data.

Looking Ahead¶

Many other steps exist in the machine learning pipeline, including splitting the data into training and test sets, tuning model parameters, and evaluating the model. These will be the next steps we walk through.