Data representation plays a critical role in the performance of many machine learning methods in machine learning. The data representation of network traffic often determines the effectiveness of these models as much as the model itself. The wide range of novel events that network operators need to detect (e.g., attacks, malware, new applications, changes in traffic demands) introduces the possibility for a broad range of possible models and data representations.
NetML is an open-source tool and end-to-end pipeline for anomaly detection in network traffic. This notebook walks through the use of that library.
First, let us load the library.
import logging
logging.getLogger("scapy.runtime").setLevel(logging.ERROR)
from netml.pparser.parser import PCAP
from netml.utils.tool import dump_data, load_data
import pandas as pd
Create a pcap data structure for which we would like to extract features. You could do this based on the packet capture files that we have been using in previous hands assignments. Any packet capture file will suffice, however.
You can define the minumum number of packets that you want to include in each flow.
Find the function in netml
that converts the pcap file into flows. Examing the resulting data structure. What does it contain?
How many flows are in your data structure?
What other information does the flow data structure contain, for each flow?
Use the netml
library to extract features from each flow.
The documentation and accompanying paper provide examples of features that you can try to extract.
First try to extract the inter-arrival times for each flow.
Inspect and print the features for each flow. (If you feel compelled: Get fancy! Plot distributions, etc. Whatever you like!)
netml
library.Here are some of the other possibilities, which can be passed to the library:
statistics in the literature: flow duration, number of packets sent per second, number of bytes per second, and various statistics on packet sizes within each flow: mean, standard deviation, inter-quartile range, minimum, and maximum.
netml
libary will do that for you, but there are a number of different ways to solve the problem. What do some of the following options do? Explore how different settings of the following affect the dimensionality of your resulting feature vector.What other features might you want to extract from packet captures that are not provided by the netml
library?