Data Acquisition¶

In this lab, we will explore the process of data acquisition.

In the case of passive network traffic analysis, there are generally two primary ways of acquiring data:

  • Packet capture
  • Network traffic flows (sometimes called IPFIX)

The advent of more programmability in networks is quickly changing this landscape.

In particular, systems like Retina are now making it possible to ask more complex questions of network traffic from passive traffic capture and analysis but the general underlying traffic patterns are still based on raw packet capture.

Background¶

Because packet captures are so large, it can sometimes be convenient to work with summary statistics about network traffic. Instead of the raw packets, data could represent the total number of bytes, packets, and so forth for flows.

Raw traffic capture is thus sometimes represented as summaries of flow statistics, rather than raw packet traces. In this activity, we will generate the summary statistics and then think about what types of information is (and is not) available in a packet trace summary vs. a raw packet capture.

Step 1: Load a Packet Trace¶

Load the packet capture from the last assignment.

In [1]:
import pandas as pd

ndf = pd.read_csv("data/netflix.csv.gz")
ndf.head(20)
Out[1]:
No. Time Source Destination Protocol Length Info
0 1 2018-02-11 08:10:00.534682 192.168.43.72 ns-vip-pro.paris.inria.fr DNS 77 Standard query 0xed0c A fonts.gstatic.com
1 2 2018-02-11 08:10:00.534832 192.168.43.72 ns-vip-pro.paris.inria.fr DNS 77 Standard query 0x301a AAAA fonts.gstatic.com
2 3 2018-02-11 08:10:00.539408 192.168.43.72 ns-vip-pro.paris.inria.fr DNS 87 Standard query 0x11d3 A googleads.g.doubleclic...
3 4 2018-02-11 08:10:00.541204 192.168.43.72 ns-vip-pro.paris.inria.fr DNS 87 Standard query 0x1284 AAAA googleads.g.doublec...
4 5 2018-02-11 08:10:00.545785 192.168.43.72 ns-vip-pro.paris.inria.fr DNS 78 Standard query 0x3432 AAAA ytimg.l.google.com
5 6 2018-02-11 08:10:00.547036 192.168.43.72 ns-vip-pro.paris.inria.fr DNS 96 Standard query 0xb756 A r4---sn-gxo5uxg-jqbe.g...
6 7 2018-02-11 08:10:00.547156 192.168.43.72 ns-vip-pro.paris.inria.fr DNS 75 Standard query 0x62ab A ssl.gstatic.com
7 8 2018-02-11 08:10:00.547249 192.168.43.72 ns-vip-pro.paris.inria.fr DNS 74 Standard query 0x42fb A www.google.com
8 9 2018-02-11 08:10:00.853950 ns-vip-pro.paris.inria.fr 192.168.43.72 DNS 386 Standard query response 0x11d3 A 216.58.213.162
9 10 2018-02-11 08:10:00.853970 192.168.43.72 ns-vip-pro.paris.inria.fr DNS 75 Standard query 0x8756 A www.gstatic.com
10 11 2018-02-11 08:10:00.854433 ns-vip-pro.paris.inria.fr 192.168.43.72 DNS 389 Standard query response 0x301a AAAA 2a00:1450:...
11 12 2018-02-11 08:10:00.854620 ns-vip-pro.paris.inria.fr 192.168.43.72 DNS 346 Standard query response 0x62ab A 172.217.18.195
12 13 2018-02-11 08:10:00.854641 ns-vip-pro.paris.inria.fr 192.168.43.72 DNS 400 Standard query response 0xb756 A 193.51.224.143
13 14 2018-02-11 08:10:00.854676 ns-vip-pro.paris.inria.fr 192.168.43.72 DNS 398 Standard query response 0x1284 AAAA 2a00:1450:...
14 15 2018-02-11 08:10:00.859945 192.168.43.72 ns-vip-pro.paris.inria.fr DNS 84 Standard query 0xdd32 A www.googleadservices.com
15 16 2018-02-11 08:10:00.861944 192.168.43.72 par10s38-in-f3.1e100.net TCP 78 58443 > 443 [SYN] Seq=0 Win=65535 Len=0 MSS=14...
16 17 2018-02-11 08:10:00.876320 ns-vip-pro.paris.inria.fr 192.168.43.72 DNS 377 Standard query response 0xed0c A 216.58.209.227
17 18 2018-02-11 08:10:00.883409 ns-vip-pro.paris.inria.fr 192.168.43.72 DNS 354 Standard query response 0x3432 AAAA 2a00:1450:...
18 19 2018-02-11 08:10:00.890223 ns-vip-pro.paris.inria.fr 192.168.43.72 DNS 338 Standard query response 0x42fb A 216.58.209.228
19 20 2018-02-11 08:10:00.890679 192.168.43.72 ns-vip-pro.paris.inria.fr DNS 75 Standard query 0xf38d A www.youtube.com

Step 2: Generate Statistics for Each Flow¶

A flow is defined as groups of packets that share the following attributes:

  • Source IP Address
  • Destination IP Address
  • Source Port
  • Destination Port
  • Time interval

The csv file we used in the past assignments do not have port numbers, so you can simply group on source and destination IP address.

For each flow in the packet trace, generate the following statistics for each flow:

  • Number of bytes
  • Number of packets
  • Duration (time)

Total Number of Flows¶

Count the total number of flows in this trace.

Number of Bytes¶

Count the total number of bytes for each flow in the trace.

Then, sort the flows by size, in bytes.

What do you notice about the large flows? What do they look like?

Number of Packets¶

Count the number of packets in each flow.

What do you notice about these flows? Are they similar to the largest flows by bytes? Which differ?

Duration¶

Compute the duration of each flow, by taking the time of the last packet and subtracting the time of the first, for each flow.

What are the longest flows in the trace?

Bytes and Packets Per Second¶

Compute the bytes per second and packets per second for each flow.

For a simple feature computation, compute the average bytes and packets per second for each flow, for the entire duration of the flow. If you want to get more clever or fancy, you can do "windowed averages", computing bytes or packets per second for shorter time intervals.

Note¶

Some of the libraries that we will use in this class, including the netml library from the University of Chicago, will compute these and other statistics automatically.

Thought Questions¶

  1. What are the largest flows in terms of: Number of bytes? Number of packets?

  2. What do you notice about the flow sizes and the directions of flows?

  3. What kinds of features are not available in packet summary statistics like those above which might be available in a raw packet trace? How might those features be useful for different packet classification problems?