In this exercise, we will use principal component analysis to perform dimensionality reduction on our dataset. Recall that the netml
library has mechanisms to extract a wide range of statistics from traffic flows. We will start with those (an N-dimensional feature set), and then use principal component analysis to reduce the dimensionality of the data.
import logging
logging.getLogger("scapy.runtime").setLevel(logging.ERROR)
from netml.pparser.parser import PCAP
import pandas as pd
Load the Log4j and HTTP packet capture files and extract features from the flows. You can feel free to compute features manually, although it will likely be more convenient at this point to use the netML
library.
Find the function in netml
that converts the pcap file into flows. Examing the resulting data structure. What does it contain?
How many flows are in each of these pcaps? (Use the netml
library output to determine the size of each data structure.)
You should now have data that can be input into a model with scikit-learn. Import the scikit-learn package (sklearn
) and a classification model of your choice to test that you can train your model with the above data.
Hint: The function you want to call is fit
.
Before applying dimensionality reduction, we can first build on our understanding of classifiers from past notebooks to see which features may be most important.
Let's try training a model first with a Decision Tree, then with a RandomForest classifier, which gives us nice ways of exploring feature importance.
Train a decision tree below.
Below is an example showing how the resulting decision tree can be visualized. We kept the number of features small above to make visualization quick. You may need to install some dependencies (pydotplus, plus the graphviz executable, to get this cell to run.
To install graphviz on a Mac, for example, you can use a tool like Homebrew (e.g., brew install graphviz).
from sklearn import tree
import pydotplus
from IPython.display import Image
labels = ['duration', 'packets per second',
'bytes per second',
'mean', 'standard deviation',
'25', 'median', '75',
'minimum', 'maximum',
'packets','bytes']
dot_data = tree.export_graphviz(dt, out_file=None, feature_names=labels)
graph = pydotplus.graph_from_dot_data(dot_data)
Image(graph.create_png())
We can look at the accuracy of this simple model using a K-Fold cross validation and accuracy score.
Random forest improves on a decision tree classifier using an ensemble learning method called bagging. In a random forest,
The balanced mode, which we use below, uses the values of y to automatically adjust weights inversely proportional to class frequencies in the input data as n_samples / (n_classes * np.bincount(y)).
Train a random forest classifier using the same approach as you used for the decision tree.
Evaluate your random forest classifier using the same evaluation method as for the decision tree. Does accuracy improve?
One of the powerful aspects of a random forest classifier is that it is easy to plot and understand feature importance. The feature_importances_ data structure maintains these relative feature importance values, which we can plot and visualize.
One way to reduce the dimensionality of a dataset is with a technique called principal components analysis (PCA). PCA is a linear transformation that maps the points into a space where the lower dimensions are orthogonal and also capture the highest variance in the dataset.
So, the first principal component (PC1) is a vector that is oriented in the direction that captures the highest variance in the dataset. You can think of this as the single dimension that has the most information in the dataset. PC2 captures the next highest variance, in the direction that is orthogonal to PC2, and so forth.
There are many applications of PCA, but one application is the visualization of a high-dimension dataset, since PCA is just a transformation of the data that does not inherently lose information. When we only project into the top two dimensions, some information is lost (whatever is in the lower principal compoenents), but we can visualize the data in terms of the two dimensions that capture the most "information" (i.e., variance) in the dataset.
The first principal components will capture msot of the variance. We can look at the explained variance ratio to understand how much variance each of those principal components captures, which will give us a sense of where we might be able to cut off this data, particularly if we look at cumulative explained variance.
pca.explained_variance_ratio_
array([0.91489261, 0.08507621])
pca = PCA(n_components=12)
pca.fit(X)
plt.plot(np.cumsum(pca.explained_variance_ratio_))
plt.ylim([0.9,1])
plt.xlabel('number of components')
plt.ylabel('cumulative explained variance');
If we look at the top principal component, we can also see that it is a linear combination of our original features, which has a disproportionate "amount" of the 25th percentile of packet size.
pca.components_[0]
array([-1.11131659e-06, 7.11715178e-03, 9.83734774e-01, 8.89456135e-05, -2.89616810e-07, 1.03849723e-04, 1.05013360e-04, 1.02492726e-04, 3.33622642e-06, 5.35074914e-05, 1.23282170e-04, 1.79485885e-01])
# Sort in descending order
indices = np.argsort(pca.components_[0])[::-1]
# Sort the labels in a corresponding fashion
names = [labels[i] for i in indices]
names
['bytes per second', 'bytes', 'packets per second', 'packets', 'median', '25', '75', 'mean', 'maximum', 'minimum', 'standard deviation', 'duration']
Evaluate the accuracy of the lower dimensionality model trained on the random forest classifier in the same fashion as above. What happens to the accuracy? Why?
What might be a good reason to reduce the dimensionality of the dataset?