Trees and Ensembles¶

Today we'll investigate whether a network eavesdropper can use device traffic to infer what people are doing inside their homes. We will pretend to be the eavesdropper and use a nearest neighbors classifier to perform this attack. We'll discuss what makes this algorithm effective, why this constitutes a privacy risk, and how we can protect device owners.

In [2]:
import numpy as np
import pandas as pd

import logging
logging.getLogger("scapy.runtime").setLevel(logging.ERROR)

import sklearn
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.tree import DecisionTreeClassifier

import matplotlib as mpl
import matplotlib.pyplot as plt
%matplotlib inline

import warnings
warnings.filterwarnings('ignore')

plt.rcParams["figure.figsize"] = (8,6)

import netml
from netml.pparser.parser import PCAP

Application to IoT Privacy¶

In order to apply a classifier to our IoT device network data we need to take the following steps:

  1. Convert the lists of packets into points, with each point representing the device's network activity at a particular time
  2. Associate each point with a label (the activity you were doing with the device at the time of the point).
  3. Divide the points into a training set and a test set and train a K-Nearest Neighbors classifier. In fact, "train" is a bit of a misnomer for nearest neighbor classifiers, training essentially consists of storing the points in the training set for future distance computations and comparisons. No model is trained.
  4. Predict the labels of the test set using the classifier and calculate the accuracy of the predictions against the real labels.

1. Import data and convert to points¶

The data is currently stored as a list of packets, but we want it as points corresponding to time periods.

In [5]:
hpcap = PCAP('data/nestcam_live.pcap', flow_ptks_thres=2, verbose=10)

hpcap.pcap2pandas()
pcap = hpcap.df

pcap.head(4)
'_pcap2pandas()' starts at 2022-11-03 11:47:26
'_pcap2pandas()' ends at 2022-11-03 11:47:34 and takes 0.1279 mins.
Out[5]:
datetime dns_query dns_resp ip_dst ip_dst_int ip_src ip_src_int is_dns length mac_dst mac_dst_int mac_src mac_src_int port_dst port_src protocol time time_normed
0 2016-07-29 15:10:03 None None 172.24.1.84 2.887254e+09 52.87.161.133 8.781582e+08 False 66 18:b4:30:54:a5:db 27162184033755 b8:27:eb:ed:34:f0 202481601426672 46110.0 443.0 TCP 1469823003.220967 0.000000
1 2016-07-29 15:10:03 None None 172.24.1.84 2.887254e+09 52.87.161.133 8.781582e+08 False 66 18:b4:30:54:a5:db 27162184033755 b8:27:eb:ed:34:f0 202481601426672 46110.0 443.0 TCP 1469823003.260909 0.039942
2 2016-07-29 15:10:03 None None 52.87.161.133 8.781582e+08 172.24.1.84 2.887254e+09 False 1506 b8:27:eb:ed:34:f0 202481601426672 18:b4:30:54:a5:db 27162184033755 443.0 46110.0 TCP 1469823003.271401 0.050434
3 2016-07-29 15:10:03 None None 52.87.161.133 8.781582e+08 172.24.1.84 2.887254e+09 False 1506 b8:27:eb:ed:34:f0 202481601426672 18:b4:30:54:a5:db 27162184033755 443.0 46110.0 TCP 1469823003.272394 0.051427

2. Data cleaning¶

Let's clean up the data a bit first.

  1. Filter the data frame so that it only contains packets sent or received by the web camera.
  2. Assume that the eavesdropper is outside the home and only has access to IP header information (not MAC addresses).
  3. Assume that the eavesdropper only has access to the time each packet was sent and its length (this is a reasonable assumption for encrypted traffic).

3. Convert to datetime format (optional).¶

Times are in units of seconds since the "epoch" (January 1, 1970 at 00:00:00 GMT), a common format for timestamps.

Let's convert them to normal-looking times.

Now let's convert the list of packets into send rates by calculating the total amount of data sent (sum of packet lengths) during equal length time windows. The send_rates() function is defined below

In [11]:
def send_rates(data, window_len_sec):
    '''Calculates send rates from packet DataFrames
    Arguments:
      data: pandas DataFrame with 'time' and 'length' columns 
              like that returned from pcap_to_pandas()
      window_len_sec: interval for calculating rates
    Returns:
       rates: array of send rates
       times: array of times corresponding to each window in rates
    '''
    data = data.sort_values(by=["time"])
    windows = []
    times = []
    curr_time = data.iloc[0]["time"]
    end_time = curr_time + window_len_sec
    i = 0
    while curr_time < data.iloc[-1]["time"]:
        windows.append(0)
        times.append(curr_time)
        while i < len(data) and data.iloc[i]["time"] < end_time:
            windows[-1] += data.iloc[i]["length"]
            i += 1
        curr_time = end_time
        end_time = curr_time + window_len_sec
    rates = np.array(windows) / float(window_len_sec)
    times = np.array(times)
    return rates, times

4. Explore data representations¶

Often, the choice of data representation is at least as important as the choice of model. Try choosing different values for sampling_interval_sec and see how it affects the plots.

Questions:

  • What may be some benefits/drawbacks of having a small sampling interval?
  • What may be benefits/drawbacks of having a large sampling interval?

5. Represent rates as n-dimensional points¶

Next let's represent each rate as an n-dimensional point. Ultimately, we will then associate each n-dimensionsal point with a specific activity. The d variable creates points in d dimensions. We have set this to two right now so that visualization is easy, but below we will expand on this to allow us to visualize points in higher dimensions.

The rates_to_points function below generates $m$ d-dimensional points from the rate timeseries above.

In [29]:
# Sample function to sample send rates to act as points for kNN training
def rates_to_points(rates, times, sampling_period):
    
    # generate points. each point is sampled according to some sampling rate
    points = [rates[i:min(i+sampling_period, rates.size-1)] for i in range(0, rates.size, sampling_period)]
    times = [times[i] for i in range(0, times.size, sampling_period)]
    return np.array(points[:-1]), np.array(times[:-1])
    
# number of send rate samples to include in each point. 
# How many total seconds will each point represent? 
# This is ultimately the *dimension* of the space, k, in our KNN classifer.
d = 2

# get d-dimensional points and the time for each point.
# we need to get the times because we're going to label each point based on an activity at a given time.
points, point_times = rates_to_points(rates, 
                                      rate_times,
                                      d) 

Let's take a quick look at the result. We started off with a certain number of rates at each point in time. Then we binned those into k-dimensional points, each at a point in time. We thus have $T/k$ number of data points if our original rate timeseries had $T$ samples.

Now we have points and associated times. If you choose sampling_period = 2, then each sample will correspond to a two-dimensional point. This would allow us to plot the points. Let's try that first and plot the result.

7. Associate k-dimensional points with activity labels.¶

First, read the labels from the text file. These labels are analogous to our classes/colors from the first example above, except that instead of {red, green, blue}, we have two labels:

  • livestream, which indicates that the camera is simply monitoring; and
  • motion, which indicates that the camera has detected motion and has begun to record/upload a video.
In [16]:
labels = pd.read_csv('data/nestcam_live_labels.txt', header=None, names=["time", "activity"])
labels.head(10)
Out[16]:
time activity
0 16:10:00 livestream
1 16:12:20 motion
2 16:14:00 livestream
3 16:16:20 motion
4 16:18:00 livestream
5 16:20:45 motion
6 16:22:00 livestream
7 16:24:15 motion
8 16:26:00 livestream
9 16:28:20 motion

Assign a label to each point based on timestamp.

First, read in the labels.

Finally, map the points to labels.

Now that we've labeled the data, we can assocate each point with a label (class), and re-plot the above scatterplot with the appropriate colors corresponding to classes.

Let's divide the points into a training set and a test set.

Random Forest¶

Now we will train a random forest classifier on the training set.

Note: Your choice of data representation actually makes a huge difference as far as accuracy is concerned! Try this exercise for different values of T and see how it affects accuracy!

Train a Random Forest Classifier¶

Train a RandomForestClassifier on your labeled data.

Perform prediction on the test set, and use accuracy_score to report accuracy.


Discussion Questions¶

1. Why is this attack a privacy risk?¶

2. How could we (IoT device programmers, network operators, etc.) protect people from this attack?¶

Additional Exercises¶

1. Adjust parameters to improve accuracy.¶

Now that we have a baseline accuracy, we can tweak the data preprocessing and classifier parameters to improve the accuracy. Look back through the code we've run so far. Which values have we set arbitrarily that could affect the results? Try changing these parameters and re-running the code to see how the classification accuracy is affected. Remember to re-run all of the cells below each change (or just restart the kernel and re-run all cells).

In [ ]: