This notebook is designed to give you a very simple example of how to use nPrint in a generic machine learning pipeline and the rapid pace at which we can train new models and test new ideas on network traffic. Note that this example is simply to show the pipeline, not to test a hard problem. The traffic collected is to the same website over the course of about 15 seconds.
In this brief overview, you will use the nprint tool to generate fingerprints from packet captures (pcaps) that can be input to a variety of machine learning algorithms. By the end of this activity you will:
nPrint must be installed into $PATH for external commands to work. Note: You may not be able to do this part in Google collab; it may only work if you have a local (Linux) machine on which you are running the notebook. If that is the case, the second cell where you execute the commands on pcaps may not run, but we have provided the "npt" nprint output files as well, so you can run the rest of the notebook.
You will want to install:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report
from sklearn.metrics import roc_auc_score
First, use nPrint to generate nPrints from each traffic trace, only including the TCP headers in the nPrints. (see which option you will need for that).
nprint = '/usr/local/bin/nprint'
data = 'data/'
cmd_http = '{} -P {}/http.pcap -t -W {}/http.npt'.format(nprint, data, data)
cmd_log4j = '{} -P {}/log4j.pcap -t -W {}/log4j.npt'.format(nprint, data, data)
!{cmd_80}
!{cmd_443}
cmd_80: Command not found. cmd_443: Command not found.
Lets examine the nPrints, which can be directly loaded with Pandas. Load the nprints using read_csv
function in Pandas into data frames. How many packets are in each nprint? How many features are in each packet?
import pandas as pd
nprint_http = pd.read_csv('{}/http.npt'.format(data), index_col=0)
nprint_log4j = pd.read_csv('{}/log4j.npt'.format(data), index_col=0)
print('HTTP nPrint: Number of Packets: {0}, Features per packet: {1}'.format(nprint_http.shape[0], nprint_http.shape[1]))
print('Log4j nPrint: Number of Packets: {0}, Features per packet: {1}'.format(nprint_log4j.shape[0], nprint_log4j.shape[1]))
HTTP nPrint: Number of Packets: 24798, Features per packet: 240 Log4j nPrint: Number of Packets: 80682, Features per packet: 240
You should see that each nprint has the same number of features, which is the maximum number of bits in a TCP header. Let's look at the header itself.
Notice how each bit (feature) is named according to the exact bit it represents in the packet, and all the possible bits of a TCP header are accounted for.
print(nprint_http.columns)
print(nprint_log4j.columns)
Index(['payload_bit_0', 'payload_bit_1', 'payload_bit_2', 'payload_bit_3', 'payload_bit_4', 'payload_bit_5', 'payload_bit_6', 'payload_bit_7', 'payload_bit_8', 'payload_bit_9', ... 'payload_bit_230', 'payload_bit_231', 'payload_bit_232', 'payload_bit_233', 'payload_bit_234', 'payload_bit_235', 'payload_bit_236', 'payload_bit_237', 'payload_bit_238', 'payload_bit_239'], dtype='object', length=240) Index(['payload_bit_0', 'payload_bit_1', 'payload_bit_2', 'payload_bit_3', 'payload_bit_4', 'payload_bit_5', 'payload_bit_6', 'payload_bit_7', 'payload_bit_8', 'payload_bit_9', ... 'payload_bit_230', 'payload_bit_231', 'payload_bit_232', 'payload_bit_233', 'payload_bit_234', 'payload_bit_235', 'payload_bit_236', 'payload_bit_237', 'payload_bit_238', 'payload_bit_239'], dtype='object', length=240)
Now we need to take each nPrint and make each packet a "sample" for the machine learning task at hand. In this case, we'll set up a supervised learning task where port 80 traffic is labeled "unencrypted" and port 443 traffic is labeled "encrypted"
import numpy as np
def label_data(data, label, features, labels):
for _, row in data.iterrows():
features.append(np.array(row))
labels.append(label)
return features, labels
def train_eval(features,labels,clf):
# Split data
X_train, X_test, y_train, y_test = train_test_split(features, labels)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
# Statistics
report = classification_report(y_test, y_pred)
print(report)
# Let's also get the ROC AUC score while we're here, which requires a probability instead of just the prediction
y_pred_proba = clf.predict_proba(X_test)
# predict_proba gives us a probability estimate of each class, while roc_auc just cares about the "positive" class
y_pred_proba_pos = [sublist[1] for sublist in y_pred_proba]
roc = roc_auc_score(y_test, y_pred_proba_pos)
print('ROC AUC Score: {0}'.format(roc))
def eval_nprint(class1, class2):
(cmd1, label1) = class1
(cmd2, label2) = class2
# Generate nPrints
!{cmd1}
!{cmd2}
# Load nPrints
df1 = pd.read_csv('{}/http.npt'.format(data), index_col=0)
df2 = pd.read_csv('{}/log4j.npt'.format(data), index_col=0)
features = []
labels = []
(features, labels) = label_data(df1, label1, features, labels)
(features, labels) = label_data(df2, label2, features, labels)
rf = RandomForestClassifier(n_estimators=1000, max_depth=None, min_samples_split=2, random_state=0)
train_eval(features,labels,rf)
return rf
We're already ready to train and test a model on the traffic we gathered. Let's split the data into training and testing data, train a model, and get a stat report.
nPrint's alignment of each packet allows for understanding the specific features (parts of the packet) that are driving the model's performance. It turns out that the options that are being set in the TCP header is actually more important than the port numbers themselves!
Now that we have a generic pipeline, we can leverage nPrint's flags to generate different versions of nPrints.
Test a version of this classification problem using only the IPv4 Headers of the packets.
How about testing using just the first 30 payload bytes in each packet?
Using this representation performs less well. Why might that be the case?
This hands-on demonstrated how nPrint can be used to rapidly train and test models for different traffic analysis problems. While this problem was contrived and simple, the same basic steps can be performed for any single-packet classification problem.
If you want to train and test using sets of packets as input to a model, you'll either need a model that can handle that input, such as a CNN, or to flatten the 2D packet sample into a 1d sample for use with a model such as the random forest above.
pcapML is a system for improving the reproducability of traffic analysis tasks. pcapML leverages the pcapng file format to encode metadata directly into raw traffic captures, removing any ambiguity about which packets belong to any given traffic flow, application, attack, etc., while still being compatiable with popular tools such as tshark and tcpdump.
For dataset curators, pcapML provides an easy way to encode metadata into raw traffic captures, ensuring the dataset is used in a consistent manner. On the analysis side, pcapML provides a standard dataset format for users to work with across different datasets.
import pandas as pd
import logging
logging.getLogger("scapy").setLevel(logging.CRITICAL)
import pcapml_fe
from pcapml_fe_helpers import *
packets = []
for traffic_sample in pcapml_fe.sampler('data/country-of-origin.pcapng'):
for pkt in traffic_sample.packets:
# Print packet timestamp and raw bytes
pip = IP(pkt.raw_bytes)
ptcp = TCP(pkt.raw_bytes)
packets.append((str(pip.src), ptcp.sport, str(pip.dst), ptcp.dport, len(pip), traffic_sample.metadata))
pdf = pd.DataFrame(packets, columns=['src IP', 'src port', 'dst IP', 'dst port', 'length', 'country'])