Today we'll investigate whether a network eavesdropper can use device traffic to infer what people are doing inside their homes. We will pretend to be the eavesdropper and use a nearest neighbors classifier to perform this attack. We'll discuss what makes this algorithm effective, why this constitutes a privacy risk, and how we can protect device owners.
import numpy as np
import pandas as pd
import logging
logging.getLogger("scapy.runtime").setLevel(logging.ERROR)
import sklearn
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.tree import DecisionTreeClassifier
import matplotlib as mpl
import matplotlib.pyplot as plt
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')
plt.rcParams["figure.figsize"] = (8,6)
import netml
from netml.pparser.parser import PCAP
In order to apply a classifier to our IoT device network data we need to take the following steps:
The data is currently stored as a list of packets, but we want it as points corresponding to time periods.
hpcap = PCAP('data/nestcam_live.pcap', flow_ptks_thres=2, verbose=10)
hpcap.pcap2pandas()
pcap = hpcap.df
pcap.head(4)
'_pcap2pandas()' starts at 2022-11-03 11:47:26 '_pcap2pandas()' ends at 2022-11-03 11:47:34 and takes 0.1279 mins.
datetime | dns_query | dns_resp | ip_dst | ip_dst_int | ip_src | ip_src_int | is_dns | length | mac_dst | mac_dst_int | mac_src | mac_src_int | port_dst | port_src | protocol | time | time_normed | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-07-29 15:10:03 | None | None | 172.24.1.84 | 2.887254e+09 | 52.87.161.133 | 8.781582e+08 | False | 66 | 18:b4:30:54:a5:db | 27162184033755 | b8:27:eb:ed:34:f0 | 202481601426672 | 46110.0 | 443.0 | TCP | 1469823003.220967 | 0.000000 |
1 | 2016-07-29 15:10:03 | None | None | 172.24.1.84 | 2.887254e+09 | 52.87.161.133 | 8.781582e+08 | False | 66 | 18:b4:30:54:a5:db | 27162184033755 | b8:27:eb:ed:34:f0 | 202481601426672 | 46110.0 | 443.0 | TCP | 1469823003.260909 | 0.039942 |
2 | 2016-07-29 15:10:03 | None | None | 52.87.161.133 | 8.781582e+08 | 172.24.1.84 | 2.887254e+09 | False | 1506 | b8:27:eb:ed:34:f0 | 202481601426672 | 18:b4:30:54:a5:db | 27162184033755 | 443.0 | 46110.0 | TCP | 1469823003.271401 | 0.050434 |
3 | 2016-07-29 15:10:03 | None | None | 52.87.161.133 | 8.781582e+08 | 172.24.1.84 | 2.887254e+09 | False | 1506 | b8:27:eb:ed:34:f0 | 202481601426672 | 18:b4:30:54:a5:db | 27162184033755 | 443.0 | 46110.0 | TCP | 1469823003.272394 | 0.051427 |
Let's clean up the data a bit first.
Times are in units of seconds since the "epoch" (January 1, 1970 at 00:00:00 GMT), a common format for timestamps.
Let's convert them to normal-looking times.
Now let's convert the list of packets into send rates by calculating the total amount of data sent (sum of packet lengths) during equal length time windows. The send_rates()
function is defined below
def send_rates(data, window_len_sec):
'''Calculates send rates from packet DataFrames
Arguments:
data: pandas DataFrame with 'time' and 'length' columns
like that returned from pcap_to_pandas()
window_len_sec: interval for calculating rates
Returns:
rates: array of send rates
times: array of times corresponding to each window in rates
'''
data = data.sort_values(by=["time"])
windows = []
times = []
curr_time = data.iloc[0]["time"]
end_time = curr_time + window_len_sec
i = 0
while curr_time < data.iloc[-1]["time"]:
windows.append(0)
times.append(curr_time)
while i < len(data) and data.iloc[i]["time"] < end_time:
windows[-1] += data.iloc[i]["length"]
i += 1
curr_time = end_time
end_time = curr_time + window_len_sec
rates = np.array(windows) / float(window_len_sec)
times = np.array(times)
return rates, times
Often, the choice of data representation is at least as important as the choice of model. Try choosing different values for sampling_interval_sec and see how it affects the plots.
Questions:
Next let's represent each rate as an n-dimensional point. Ultimately, we will then associate each n-dimensionsal point with a specific activity. The d
variable creates points in d
dimensions. We have set this to two right now so that visualization is easy, but below we will expand on this to allow us to visualize points in higher dimensions.
The rates_to_points
function below generates $m$ d-dimensional points from the rate timeseries above.
# Sample function to sample send rates to act as points for kNN training
def rates_to_points(rates, times, sampling_period):
# generate points. each point is sampled according to some sampling rate
points = [rates[i:min(i+sampling_period, rates.size-1)] for i in range(0, rates.size, sampling_period)]
times = [times[i] for i in range(0, times.size, sampling_period)]
return np.array(points[:-1]), np.array(times[:-1])
# number of send rate samples to include in each point.
# How many total seconds will each point represent?
# This is ultimately the *dimension* of the space, k, in our KNN classifer.
d = 2
# get d-dimensional points and the time for each point.
# we need to get the times because we're going to label each point based on an activity at a given time.
points, point_times = rates_to_points(rates,
rate_times,
d)
Let's take a quick look at the result. We started off with a certain number of rates at each point in time. Then we binned those into k-dimensional points, each at a point in time. We thus have $T/k$ number of data points if our original rate timeseries had $T$ samples.
Now we have points and associated times. If you choose sampling_period = 2
, then each sample will correspond to a two-dimensional point. This would allow us to plot the points. Let's try that first and plot the result.
First, read the labels from the text file. These labels are analogous to our classes/colors from the first example above, except that instead of {red, green, blue}, we have two labels:
livestream
, which indicates that the camera is simply monitoring; andmotion
, which indicates that the camera has detected motion and has begun to record/upload a video.labels = pd.read_csv('data/nestcam_live_labels.txt', header=None, names=["time", "activity"])
labels.head(10)
time | activity | |
---|---|---|
0 | 16:10:00 | livestream |
1 | 16:12:20 | motion |
2 | 16:14:00 | livestream |
3 | 16:16:20 | motion |
4 | 16:18:00 | livestream |
5 | 16:20:45 | motion |
6 | 16:22:00 | livestream |
7 | 16:24:15 | motion |
8 | 16:26:00 | livestream |
9 | 16:28:20 | motion |
Assign a label to each point based on timestamp.
First, read in the labels.
Finally, map the points to labels.
Now that we've labeled the data, we can assocate each point with a label (class), and re-plot the above scatterplot with the appropriate colors corresponding to classes.
Let's divide the points into a training set and a test set.
Now we will train a random forest classifier on the training set.
Note: Your choice of data representation actually makes a huge difference as far as accuracy is concerned! Try this exercise for different values of T and see how it affects accuracy!
Train a RandomForestClassifier
on your labeled data.
Perform prediction on the test set, and use accuracy_score
to report accuracy.
Now that we have a baseline accuracy, we can tweak the data preprocessing and classifier parameters to improve the accuracy. Look back through the code we've run so far. Which values have we set arbitrarily that could affect the results? Try changing these parameters and re-running the code to see how the classification accuracy is affected. Remember to re-run all of the cells below each change (or just restart the kernel and re-run all cells).