In this assignment, you will explore a capture of a Netflix video stream. The packet capture itself has some additional traffic beyond Netflix traffic, and so part of the exercise involves filtering the traffic to include only the Netflix traffic.
In this hands-on activity, you will learn how to:
One of the challenges with packet captures is that they often contain a mix of traffic from devices, destinations, and applications. When diagnosing performance problems with a particular service, often the first challenge is identifying and extracting the subset of traffic corresponding to that service.
In this exercise, we will use the domain name system lookups to a set of domains that we know are associated with Netflix to identify the IP addresses (and thus, the traffic flows) that are associated with Netflix.
import dnslib
NF_DOMAINS = (["nflxvideo",
"netflix",
"nflxso",
"nflxext"])
First, load the traffic capture and inspect it.
import pandas as pd
ndf = pd.read_csv("data/netflix.csv.gz")
ndf.head(20)
No. | Time | Source | Destination | Protocol | Length | Info | |
---|---|---|---|---|---|---|---|
0 | 1 | 2018-02-11 08:10:00.534682 | 192.168.43.72 | ns-vip-pro.paris.inria.fr | DNS | 77 | Standard query 0xed0c A fonts.gstatic.com |
1 | 2 | 2018-02-11 08:10:00.534832 | 192.168.43.72 | ns-vip-pro.paris.inria.fr | DNS | 77 | Standard query 0x301a AAAA fonts.gstatic.com |
2 | 3 | 2018-02-11 08:10:00.539408 | 192.168.43.72 | ns-vip-pro.paris.inria.fr | DNS | 87 | Standard query 0x11d3 A googleads.g.doubleclic... |
3 | 4 | 2018-02-11 08:10:00.541204 | 192.168.43.72 | ns-vip-pro.paris.inria.fr | DNS | 87 | Standard query 0x1284 AAAA googleads.g.doublec... |
4 | 5 | 2018-02-11 08:10:00.545785 | 192.168.43.72 | ns-vip-pro.paris.inria.fr | DNS | 78 | Standard query 0x3432 AAAA ytimg.l.google.com |
5 | 6 | 2018-02-11 08:10:00.547036 | 192.168.43.72 | ns-vip-pro.paris.inria.fr | DNS | 96 | Standard query 0xb756 A r4---sn-gxo5uxg-jqbe.g... |
6 | 7 | 2018-02-11 08:10:00.547156 | 192.168.43.72 | ns-vip-pro.paris.inria.fr | DNS | 75 | Standard query 0x62ab A ssl.gstatic.com |
7 | 8 | 2018-02-11 08:10:00.547249 | 192.168.43.72 | ns-vip-pro.paris.inria.fr | DNS | 74 | Standard query 0x42fb A www.google.com |
8 | 9 | 2018-02-11 08:10:00.853950 | ns-vip-pro.paris.inria.fr | 192.168.43.72 | DNS | 386 | Standard query response 0x11d3 A 216.58.213.162 |
9 | 10 | 2018-02-11 08:10:00.853970 | 192.168.43.72 | ns-vip-pro.paris.inria.fr | DNS | 75 | Standard query 0x8756 A www.gstatic.com |
10 | 11 | 2018-02-11 08:10:00.854433 | ns-vip-pro.paris.inria.fr | 192.168.43.72 | DNS | 389 | Standard query response 0x301a AAAA 2a00:1450:... |
11 | 12 | 2018-02-11 08:10:00.854620 | ns-vip-pro.paris.inria.fr | 192.168.43.72 | DNS | 346 | Standard query response 0x62ab A 172.217.18.195 |
12 | 13 | 2018-02-11 08:10:00.854641 | ns-vip-pro.paris.inria.fr | 192.168.43.72 | DNS | 400 | Standard query response 0xb756 A 193.51.224.143 |
13 | 14 | 2018-02-11 08:10:00.854676 | ns-vip-pro.paris.inria.fr | 192.168.43.72 | DNS | 398 | Standard query response 0x1284 AAAA 2a00:1450:... |
14 | 15 | 2018-02-11 08:10:00.859945 | 192.168.43.72 | ns-vip-pro.paris.inria.fr | DNS | 84 | Standard query 0xdd32 A www.googleadservices.com |
15 | 16 | 2018-02-11 08:10:00.861944 | 192.168.43.72 | par10s38-in-f3.1e100.net | TCP | 78 | 58443 > 443 [SYN] Seq=0 Win=65535 Len=0 MSS=14... |
16 | 17 | 2018-02-11 08:10:00.876320 | ns-vip-pro.paris.inria.fr | 192.168.43.72 | DNS | 377 | Standard query response 0xed0c A 216.58.209.227 |
17 | 18 | 2018-02-11 08:10:00.883409 | ns-vip-pro.paris.inria.fr | 192.168.43.72 | DNS | 354 | Standard query response 0x3432 AAAA 2a00:1450:... |
18 | 19 | 2018-02-11 08:10:00.890223 | ns-vip-pro.paris.inria.fr | 192.168.43.72 | DNS | 338 | Standard query response 0x42fb A 216.58.209.228 |
19 | 20 | 2018-02-11 08:10:00.890679 | 192.168.43.72 | ns-vip-pro.paris.inria.fr | DNS | 75 | Standard query 0xf38d A www.youtube.com |
Next, write an expression that filters the dataframe to include only DNS traffic.
Because you are looking for the IP addresses that are associated with Netflix traffic, you need to match the responses of the corresponding DNS lookups to the queries that contain Netflix domains. You can link them with the transaction ID in the DNS traffic.
Get the IP addresses associated with all Netflix traffic in the trace.
An important feature for inferring video quality of experience is the throughput of each flow in the video stream. To compute throughput, we need to divide the number of bytes transferred per unit time.
As a first step towards computing that feature, count the number of packets and bytes, in each direction, to each Netflix IP address in the trace.
Another important feature that can be used in inferring video quality of experience is the number of segments per unit time. In this step we will infer the number of segments downloaded per unit time for each IP address.
The number of segments can be determined by counting the number of continuous downstream transfers separated by a packet with a payload of zero bytes. For the last step, compute the number of segment downloads from each Netflix IP address.
In this exercise, we used the domain name system (DNS) lookup traffic to identify Netflix traffic. This approach can work in practice but is far from perfect, for a number of reasons:
This turns the problem of service identification (i.e., identifying Netflix traffic itself) into an inference/machine learning problem. What features can you think of that could work for developing a model that can perform this type of inference?