In this hands-on assignment, we will explore learn models, in particular linear regression. Linear models can work well when there is a linear relationship between the target variable being predicted and the input features. In other words, when the target prediction can be modeled as a linear combination of the input features, linear models may be appropriate.
We'll explore the relationship beween bytes and packets in this hands-on, which may have a linear relationship at times. We will also explore how basis expansion---in particular polynomial basis expansion---can allow linear regression to fit more complex relationships between features and targets.
# Machine Learning Libraries
import numpy as np
import pandas as pd
from sklearn import linear_model
import logging
logging.getLogger("scapy.runtime").setLevel(logging.ERROR)
import sys
sys.path.insert(1,"/Users/feamster/research/netml/src/")
from netml.pparser.parser import PCAP
from netml.utils.tool import dump_data, load_data
# Plotting Library
import matplotlib.pyplot as plt
%matplotlib inline
plt.rcParams["figure.figsize"] = (8,6)
hpcap = PCAP('data/http.pcap', flow_ptks_thres=2, verbose=10)
hpcap.pcap2flows()
len(hpcap.flows)
'_pcap2flows()' starts at 2022-10-31 14:13:13 pcap_file: data/http.pcap ith_packets: 0 ith_packets: 10000 ith_packets: 20000 len(flows): 593 total number of flows: 593. Num of flows < 2 pkts: 300, and >=2 pkts: 293 without timeout splitting. kept flows: 293. Each of them has at least 2 pkts after timeout splitting. flow_durations.shape: (293, 1) col_0 count 293.000 mean 11.629 std 15.820 min 0.000 25% 0.076 50% 0.455 75% 20.097 max 46.235 <class 'pandas.core.frame.DataFrame'> RangeIndex: 293 entries, 0 to 292 Data columns (total 1 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 col_0 293 non-null float64 dtypes: float64(1) memory usage: 2.4 KB None 0th_flow: len(pkts): 4 After splitting flows, the number of subflows: 291 and each of them has at least 2 packets. '_pcap2flows()' ends at 2022-10-31 14:13:18 and takes 0.0843 mins.
291
Use the netml
"STATS" option to generate features for each flow, which will provide the following features for each flow:
The exercise below requires a slight modification to the netml
library (for versions <= 0.2.1) to add two additional features:
You can modify the library (line 455 of parser.py
) to add total number of packets and bytes per flow, or optionally explore the relationships between some of the other features.
Explore the relationship between some of these features. A natural one to explore is the relationship between the number of bytes in a flow and the number of packets in a flow.
Train a Linear Regression
model from scikit learn to model this relationship, and output the resulting predictions into a vector y_hat
.
Plot the relationship that your model has learned, by plotting the learned model (which should be a line), along with the points. Label your axes!
What do you notice about the relationship, and how it relates to the original points?
You can compute how well your manual fit is by computing the error, in terms of residual sum of squares.
Recall that one of the benefits of a polynomial feature expansion is that it is possible to fit a linear model to a resulting polynomial expansion of the features.
We will do that below. Let's first create the regular features and then the polynomial expansion. You will need the PolynomialFeatures
library from sklearn.preprocessing
, as well as the fit_transform
function to generate your feature expansion.
Train the linear regression model on the expanded set of features, and generate a new set of predictions, y_hat_poly
.
Visualize your results again. What do you notice about the predicted values?
Evaluate your error once again. What happened to overall mean squared error?
In the earlier parts of this hands-on, we explored simple relationship between features and outcome variables. You can extend your analysis by exploring other relationships, such as the relationships between one or more of the features already output from netml
.
In the above example, we used a polynomial basis expansion to reduce the error on the training set. But, we have no test set, so it is difficult to tell whether the model above is overfit to the training data.
In order for us to tell whether this model is a good fit, we need to test it on data that the model has not yet seen. This requires splitting the data into a training set and a test set.
A typical split between training data and test data might be 80% training data, 20% test data. You can also perform this process repeatedly and average the results. This process is called cross-validation.
The first two steps are the same as above, but you might try doing this for a larger sample so that your network traffic has more flows (i.e., data points).