Linear Regression on Network Traffic¶

In this hands-on assignment, we will explore learn models, in particular linear regression. Linear models can work well when there is a linear relationship between the target variable being predicted and the input features. In other words, when the target prediction can be modeled as a linear combination of the input features, linear models may be appropriate.

We'll explore the relationship beween bytes and packets in this hands-on, which may have a linear relationship at times. We will also explore how basis expansion---in particular polynomial basis expansion---can allow linear regression to fit more complex relationships between features and targets.

In [1]:

# Machine Learning Libraries
import numpy as np
import pandas as pd
from sklearn import linear_model

import logging
logging.getLogger("scapy.runtime").setLevel(logging.ERROR)

import sys
sys.path.insert(1,"/Users/feamster/research/netml/src/")
from netml.pparser.parser import PCAP
from netml.utils.tool import dump_data, load_data

# Plotting Library
import matplotlib.pyplot as plt
%matplotlib inline
plt.rcParams["figure.figsize"] = (8,6)

Part 1: Simple Linear Regression.¶

Group Traffic Info Flows¶

This example below uses a packet capture that is provided in the repository. You are also welcome (and encouraged!) to capture your own traffic.

Use netml to load the pcap and convert packets to flows, using pcap2flows.

In [2]:

hpcap = PCAP('data/http.pcap', flow_ptks_thres=2, verbose=10)
hpcap.pcap2flows()
len(hpcap.flows)

'_pcap2flows()' starts at 2022-10-31 14:13:13
pcap_file: data/http.pcap
ith_packets: 0
ith_packets: 10000
ith_packets: 20000
len(flows): 593
total number of flows: 593. Num of flows < 2 pkts: 300, and >=2 pkts: 293 without timeout splitting.
kept flows: 293. Each of them has at least 2 pkts after timeout splitting.
flow_durations.shape: (293, 1)
        col_0
count 293.000
mean   11.629
std    15.820
min     0.000
25%     0.076
50%     0.455
75%    20.097
max    46.235
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 293 entries, 0 to 292
Data columns (total 1 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   col_0   293 non-null    float64
dtypes: float64(1)
memory usage: 2.4 KB
None
0th_flow: len(pkts): 4
After splitting flows, the number of subflows: 291 and each of them has at least 2 packets.
'_pcap2flows()' ends at 2022-10-31 14:13:18 and takes 0.0843 mins.

Out[2]:

Generate Features¶

Use the netml "STATS" option to generate features for each flow, which will provide the following features for each flow:

Duration
Packets per second
Bytes per second
Mean packet size
Standard deviation of packet sizes
25th, median, 75th, min, max packet sizes

The exercise below requires a slight modification to the netml library (for versions <= 0.2.1) to add two additional features:

number of packets per flow
number of bytes per flow

You can modify the library (line 455 of parser.py) to add total number of packets and bytes per flow, or optionally explore the relationships between some of the other features.

Train Model¶

Explore the relationship between some of these features. A natural one to explore is the relationship between the number of bytes in a flow and the number of packets in a flow.

Train a Linear Regression model from scikit learn to model this relationship, and output the resulting predictions into a vector y_hat.

Visualize Your Model¶

Plot the relationship that your model has learned, by plotting the learned model (which should be a line), along with the points. Label your axes!

What do you notice about the relationship, and how it relates to the original points?

Evaluation: Error Computation¶

You can compute how well your manual fit is by computing the error, in terms of residual sum of squares.

Part 2: Polynomial Basis Expansion¶

Recall that one of the benefits of a polynomial feature expansion is that it is possible to fit a linear model to a resulting polynomial expansion of the features.

We will do that below. Let's first create the regular features and then the polynomial expansion. You will need the PolynomialFeatures library from sklearn.preprocessing, as well as the fit_transform function to generate your feature expansion.

Train the linear regression model on the expanded set of features, and generate a new set of predictions, y_hat_poly.

Visualize your results again. What do you notice about the predicted values?

Evaluate your error once again. What happened to overall mean squared error?

Part 3: Exploring Relationships between Other Features¶

In the earlier parts of this hands-on, we explored simple relationship between features and outcome variables. You can extend your analysis by exploring other relationships, such as the relationships between one or more of the features already output from netml.

Bonus: Reducing Error on Test Set¶

In the above example, we used a polynomial basis expansion to reduce the error on the training set. But, we have no test set, so it is difficult to tell whether the model above is overfit to the training data.

In order for us to tell whether this model is a good fit, we need to test it on data that the model has not yet seen. This requires splitting the data into a training set and a test set.

A typical split between training data and test data might be 80% training data, 20% test data. You can also perform this process repeatedly and average the results. This process is called cross-validation.

Part 1¶

Take a packet trace using wireshark and load it into the notebook.
Perform a simple linear regression fit and a fit with basis expansion (same as above), comparing errors.

The first two steps are the same as above, but you might try doing this for a larger sample so that your network traffic has more flows (i.e., data points).

Part 2¶

Use functions from sklearn to split the data into training and testing. (train_test_split, or sklearn's CV function).
How does model accuracy compare for different polynomial basis expansions? (n=1, 2, 3, ?) At what point is the model overfit?
Can you experiment with a regularization parameter to control or reduce overfitting?