In this lab, we will explore the process of data acquisition.
In the case of passive network traffic analysis, there are generally two primary ways of acquiring data:
The advent of more programmability in networks is quickly changing this landscape.
In particular, systems like Retina are now making it possible to ask more complex questions of network traffic from passive traffic capture and analysis but the general underlying traffic patterns are still based on raw packet capture.
Because packet captures are so large, it can sometimes be convenient to work with summary statistics about network traffic. Instead of the raw packets, data could represent the total number of bytes, packets, and so forth for flows.
Raw traffic capture is thus sometimes represented as summaries of flow statistics, rather than raw packet traces. In this activity, we will generate the summary statistics and then think about what types of information is (and is not) available in a packet trace summary vs. a raw packet capture.
Load the packet capture from the last assignment.
import pandas as pd
ndf = pd.read_csv("data/netflix.csv.gz")
ndf.head(20)
No. | Time | Source | Destination | Protocol | Length | Info | |
---|---|---|---|---|---|---|---|
0 | 1 | 2018-02-11 08:10:00.534682 | 192.168.43.72 | ns-vip-pro.paris.inria.fr | DNS | 77 | Standard query 0xed0c A fonts.gstatic.com |
1 | 2 | 2018-02-11 08:10:00.534832 | 192.168.43.72 | ns-vip-pro.paris.inria.fr | DNS | 77 | Standard query 0x301a AAAA fonts.gstatic.com |
2 | 3 | 2018-02-11 08:10:00.539408 | 192.168.43.72 | ns-vip-pro.paris.inria.fr | DNS | 87 | Standard query 0x11d3 A googleads.g.doubleclic... |
3 | 4 | 2018-02-11 08:10:00.541204 | 192.168.43.72 | ns-vip-pro.paris.inria.fr | DNS | 87 | Standard query 0x1284 AAAA googleads.g.doublec... |
4 | 5 | 2018-02-11 08:10:00.545785 | 192.168.43.72 | ns-vip-pro.paris.inria.fr | DNS | 78 | Standard query 0x3432 AAAA ytimg.l.google.com |
5 | 6 | 2018-02-11 08:10:00.547036 | 192.168.43.72 | ns-vip-pro.paris.inria.fr | DNS | 96 | Standard query 0xb756 A r4---sn-gxo5uxg-jqbe.g... |
6 | 7 | 2018-02-11 08:10:00.547156 | 192.168.43.72 | ns-vip-pro.paris.inria.fr | DNS | 75 | Standard query 0x62ab A ssl.gstatic.com |
7 | 8 | 2018-02-11 08:10:00.547249 | 192.168.43.72 | ns-vip-pro.paris.inria.fr | DNS | 74 | Standard query 0x42fb A www.google.com |
8 | 9 | 2018-02-11 08:10:00.853950 | ns-vip-pro.paris.inria.fr | 192.168.43.72 | DNS | 386 | Standard query response 0x11d3 A 216.58.213.162 |
9 | 10 | 2018-02-11 08:10:00.853970 | 192.168.43.72 | ns-vip-pro.paris.inria.fr | DNS | 75 | Standard query 0x8756 A www.gstatic.com |
10 | 11 | 2018-02-11 08:10:00.854433 | ns-vip-pro.paris.inria.fr | 192.168.43.72 | DNS | 389 | Standard query response 0x301a AAAA 2a00:1450:... |
11 | 12 | 2018-02-11 08:10:00.854620 | ns-vip-pro.paris.inria.fr | 192.168.43.72 | DNS | 346 | Standard query response 0x62ab A 172.217.18.195 |
12 | 13 | 2018-02-11 08:10:00.854641 | ns-vip-pro.paris.inria.fr | 192.168.43.72 | DNS | 400 | Standard query response 0xb756 A 193.51.224.143 |
13 | 14 | 2018-02-11 08:10:00.854676 | ns-vip-pro.paris.inria.fr | 192.168.43.72 | DNS | 398 | Standard query response 0x1284 AAAA 2a00:1450:... |
14 | 15 | 2018-02-11 08:10:00.859945 | 192.168.43.72 | ns-vip-pro.paris.inria.fr | DNS | 84 | Standard query 0xdd32 A www.googleadservices.com |
15 | 16 | 2018-02-11 08:10:00.861944 | 192.168.43.72 | par10s38-in-f3.1e100.net | TCP | 78 | 58443 > 443 [SYN] Seq=0 Win=65535 Len=0 MSS=14... |
16 | 17 | 2018-02-11 08:10:00.876320 | ns-vip-pro.paris.inria.fr | 192.168.43.72 | DNS | 377 | Standard query response 0xed0c A 216.58.209.227 |
17 | 18 | 2018-02-11 08:10:00.883409 | ns-vip-pro.paris.inria.fr | 192.168.43.72 | DNS | 354 | Standard query response 0x3432 AAAA 2a00:1450:... |
18 | 19 | 2018-02-11 08:10:00.890223 | ns-vip-pro.paris.inria.fr | 192.168.43.72 | DNS | 338 | Standard query response 0x42fb A 216.58.209.228 |
19 | 20 | 2018-02-11 08:10:00.890679 | 192.168.43.72 | ns-vip-pro.paris.inria.fr | DNS | 75 | Standard query 0xf38d A www.youtube.com |
A flow is defined as groups of packets that share the following attributes:
The csv file we used in the past assignments do not have port numbers, so you can simply group on source and destination IP address.
For each flow in the packet trace, generate the following statistics for each flow:
Count the total number of flows in this trace.
Count the total number of bytes for each flow in the trace.
Then, sort the flows by size, in bytes.
What do you notice about the large flows? What do they look like?
Count the number of packets in each flow.
What do you notice about these flows? Are they similar to the largest flows by bytes? Which differ?
Compute the duration of each flow, by taking the time of the last packet and subtracting the time of the first, for each flow.
What are the longest flows in the trace?
Compute the bytes per second and packets per second for each flow.
For a simple feature computation, compute the average bytes and packets per second for each flow, for the entire duration of the flow. If you want to get more clever or fancy, you can do "windowed averages", computing bytes or packets per second for shorter time intervals.
Some of the libraries that we will use in this class, including the netml
library from the University of Chicago, will compute these and other statistics automatically.
What are the largest flows in terms of: Number of bytes? Number of packets?
What do you notice about the flow sizes and the directions of flows?
What kinds of features are not available in packet summary statistics like those above which might be available in a raw packet trace? How might those features be useful for different packet classification problems?