Feature Extraction¶

Data representation plays a critical role in the performance of many machine learning methods in machine learning. The data representation of network traffic often determines the effectiveness of these models as much as the model itself. The wide range of novel events that network operators need to detect (e.g., attacks, malware, new applications, changes in traffic demands) introduces the possibility for a broad range of possible models and data representations.

NetML is an open-source tool and end-to-end pipeline for anomaly detection in network traffic. This notebook walks through the use of that library.

First, let us load the library.

In [1]:

import logging
logging.getLogger("scapy.runtime").setLevel(logging.ERROR)

from netml.pparser.parser import PCAP
from netml.utils.tool import dump_data, load_data

import pandas as pd

Specify a Packet Capture File¶

Create a pcap data structure for which we would like to extract features. You could do this based on the packet capture files that we have been using in previous hands assignments. Any packet capture file will suffice, however.

You can define the minumum number of packets that you want to include in each flow.

Convert the Packet Capture Into Flows¶

Find the function in netml that converts the pcap file into flows. Examing the resulting data structure. What does it contain?

Explore the Flows¶

How many flows are in your data structure?

What other information does the flow data structure contain, for each flow?

Extract Features from Each Flow¶

Use the netml library to extract features from each flow.

The documentation and accompanying paper provide examples of features that you can try to extract.

First try to extract the inter-arrival times for each flow.

Interarrival Times¶

Explore the Per-Flow Features¶

Inspect and print the features for each flow. (If you feel compelled: Get fancy! Plot distributions, etc. Whatever you like!)

Other Features and Options¶

Try some of the other features in the netml library.

Here are some of the other possibilities, which can be passed to the library:

IAT: A flow is represented as a timeseries of inter-arrival times between packets, i.e., elapsed time in seconds between any two packets in the flow.
STATS: A flow is represented as a set of statistical quantities. We choose ten of the most common such

statistics in the literature: flow duration, number of packets sent per second, number of bytes per second, and various statistics on packet sizes within each flow: mean, standard deviation, inter-quartile range, minimum, and maximum.

SIZE: A flow is represented as a timeseries of packet sizes in bytes, with one sample per packet.
SAMP-NUM: A flow is partitioned into small intervals of equal length 𝛿𝑡, and the number of packets in each interval is recorded; thus, a flow is represented as a timeseries of packet counts in small time intervals, with one sample per time interval. Here, 𝛿𝑡 might be viewed as a choice of sampling rate for the timeseries, hence the nomenclature.
SAMP-SIZE: A flow is partitioned into time intervals of equal length 𝛿𝑡, and the total packet size (i.e., byte count) in each interval is recorded; thus, a flow is represented as a timeseries of byte counts in small time intervals, with one sample per time interval.

One of the challenges with providing packet traces to models involve ensuring that all feature vectors are of the same length. The netml libary will do that for you, but there are a number of different ways to solve the problem. What do some of the following options do? Explore how different settings of the following affect the dimensionality of your resulting feature vector.

flow_ptks_thres
q_interval