Chapter 6: Unsupervised Learning

In this chapter, we introduce unsupervised learning, the process by which a machine learning model can learn from unlabeled examples. The goal of unsupervised learning is to identify patterns in data that are useful for understanding the data or processing the data further.

Most data in the world is unlabeled, including most network data. For example, packet captures do not inherently include labels of perceived quality of service or presence of malware. The prevalence of unlabeled data makes unsupervised learning a powerful tool for data analytics.

Throughout this chapter, we will describe a variety of unsupervised learning models, using networking examples as a guide. This book does not necessarily assume you’ve seen these models before, and so readers who are aiming to get basic intuition behind different models will find this chapter helpful. Readers who are already familiar with these models may still find these examples helpful, as they present cases where particular models or types of models are suited to different problems, as well as cases in the networking domain where these models have been successfully applied in the past.

We organize our discussion of unsupervised learning into the following categories: (1) dimensionality reduction (i.e., models that reduce the number of features in a data set to those most useful for a task); (2) clustering (i.e., models that group data based on similarity); and (3) semi-supervised learning (i.e., models that use unsupervised techniques to prepare data for supervised learning).

Dimensionality Reduction

Networking data sets are often high dimensional, meaning that each example in the data set has many features. This is typically true regardless of the measurement approach used to collect the data set. Individual packets have many headers. IPFIX records have many fields for each recorded flow. Internet scans results contain a variety of information about each endpoint or IP addressed scanned. While high-dimensional data provides lots of information that is useful for maching learining, it also poses two distinct challenges.

First, high-dimensional data can’t be easily plotted on a 2D or 3D graph. This makes it difficult to explore the data visually and gain an intuition about important patterns that may be essential to understanding the meaning of the data. Producing useful visualizations of high-dimensional data requires reducing the number of features such that 2D or 3D visualizations are possible.

Second, the training time of most machine learning models increases with the number of features in a data set. For very high dimensional data, it may be desirable to reduce the number of features as a preprocessing step to make training computationally feasible. While this may result in a reduction in model performance, it is preferrable to a model that can’t be feasibly trained at all.

Dimensionality reduction algorithms can be used to address either of these challenges. These algorithms work by removing or combining features to produce a new version of the dataset in a lower target dimensionality while attempting to preserve important patterns within the data. There are many dimensionality reduction algorithms, far more than we can discuss in this book, so we focus on three commonly used algorithms that can be readily applied to network data.

Principal Component Analysis

The goal of principal component analysis (PCA) is to transform the data to have a new, smaller set of features derived from the original features. The PCA algorithm attempts to minimize the amount that individual data points change as a result of the transformation while maximizing the variance of the data points in the new, lower dimensionality. This tradeoff seeks to reduce information loss caused by dimensionality reduction.

The features in the target dimensionality are called principle components. Each of the principle components is a combination of the original features. Regular PCA is limited to linear combinations, but alternatives, such as kernel PCA can also account for non-linear relationships in the data.

PCA is non-parametric deterministic dimensionality reduction algorithm. This means that PCA does not have any parameters that require training and that independent applications of PCA to the same dataset with the same target dimensionality will produce the same results. Applying PCA requires choosing the target dimensionality and whether you would like to use the linear or kernel version of the algorithm. Kernel PCA also requires the choice of kernel function, for which a polynomial or radial basis function (RBF) kernel is often a good place to start.

Using PCA to visualize packet capture data in 2D.

T-Distributed Stochastic Neighbor Embedding

T-distributed stochastic neighbor embedding (T-SNE) is a dimensionality reduction algorithm that typically produces much cleaner visualizations in two or three dimensions than PCA. T-SNE is particularly useful when you want to visualize your data to gain intuition about underlying patterns that might prove informative for supervised models or clustering.

T-SNE uses probability distributions to spread out dissimilar points while keeping similar points near each other in the target diminsionality. The algorithm involves three main steps:

  1. Fitting a normal (Gaussian) distribution to the distances between pairs of points in the original data

  2. Mapping the normal distribution in the original high-dimensional space to a T-distribution in the target dimensional space that minimizes the divergence between the distributions

  3. Select new locations for the points in the target dimensional space by drawing from this T-distribution

Because T-distributions have more probility density in the tails than a normal distribution, this spreads out dissimilar points in the target dimensionality while keeping similar points in proximity. Visualizations produced using T-SNE show distinct clustering if such structure exists in the original high dimensional data.

T-SNE is a non-parametric stochasic dimensionality reduction algorithm. This means that T-SNE does not have any parameters that require training. However, the use of a randomized draw to place data points in the target dimensionality space means that independent applications of T-SNE to the same data set may produce different results.

Using T-SNE to visualize packet capture data in 2D.

from netml.pparser.parser import PCAP
from sklearn.manifold import TSNE
import matplotlib
import matplotlib.pyplot as plt
import pandas
import numpy

pcap = PCAP('../examples/notebooks/data/http.pcap')
pcap.pcap2pandas()
pdf = pcap.df[['ip_dst_int', 'port_dst', 'ip_src_int', 'port_src']]
is_dns = pcap.df['is_dns']

tsne = TSNE(n_components=2)
result = tsne.fit_transform(pdf)

dns_colors = pandas.factorize(is_dns)[0]
flow_colors = pdf.apply(lambda row: hash(tuple(row)), axis=1)


fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(10, 5))
ax1.scatter(result[:,0], result[:,1], c=flow_colors, s=8, cmap="tab20")
ax2.scatter(result[:,0], result[:,1], c=dns_colors, s=8, cmap=matplotlib.colors.ListedColormap(['blue', 'red']))
ax1.set_title("TSNE Result: Packets Colored by Flow")
ax2.set_title("TSNE Result: DNS Packets in Red")
ax1.tick_params(axis='both', labelsize=0)
ax2.tick_params(axis='both', labelsize=0)
plt.savefig("unsupervised_tsne.png")
plt.show()
'_pcap2pandas()' starts at 2023-05-04 23:53:14
'_pcap2pandas()' ends at 2023-05-04 23:53:25 and takes 0.1825 mins.
/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/sklearn/manifold/_t_sne.py:800: FutureWarning: The default initialization in TSNE will change from 'random' to 'pca' in 1.2.
  warnings.warn(
/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/sklearn/manifold/_t_sne.py:810: FutureWarning: The default learning rate in TSNE will change from 200.0 to 'auto' in 1.2.
  warnings.warn(
../images/unsupervised_tsne.png

Autoencoders

Autoencoders are unsupervised neural network models that perform dimensionality reduction.

An autoencoder network has input and output layers that are the same size as the number of features in the data. The intermediate layers of the network have an “hourglass” shape, with decreasing numbers of nodes from the input layer to a central “encoding” layer and increasing numbers of nodes from the encoding layer to the output layer. This reduction in layer size forces information loss as each example passes through the autoencoder, since the encoding layer cannot retain all features of the input data.

TODO: IMAGE OF AUTOENCODER

Autoencoders are trained to reproduce input examples as closely as possible in their output. In other words, the sama data is used as both the training examples and the training labels. This causes the network to find parameters such that the encoding layer retains the most important information about the input features and serves as the target low-dimensional representation. The size of the encoding layer is selected beforehand to match the target dimensionality of the dimensionality reduction process.

Unlike PCA and T-SNE autoencoders can discover highly nonlinear relationships between features in the original dataset and use these relationships to find good lower-dimensionality representations. Also unlike PCA and T-SNE, autoencoders are parametric, meaning that they require training. In practice, autoencoders are more frequenly used to reduce the number of features to make training computationally feasible rather than to produce a 2D or 3D version of the data for visualization.

TODO: AUTOENCODER EXAMPLE

Clustering

Clustering algorithms group data points by similarity, identifying latent structure in the dataset.

Clustering algorithms are extremely useful for data exploration, as understanding a data set often requires understanding similarities among groups of data points. For example, it might be valuable to know that a dataset of network flows can be naturally grouped into two clusters: “elephant” flows consuming lots of bandwidth and “mouse” flows consuming relatively little bandwidth. Similarly, it might be valuable to learn that a dataset of packets naturally clusters into 3 groups: network configuration packets, user application packets, and malicious packets. If the clusters found by a clustering algorithm do not match your understanding of the data, it may be that there is something more interesting going on in the data set that motivates further exploration.

Clustering algorithms are also useful for anomaly detection, a machine learning task that involves identifying anomalous data points that are dissimilar to most other points in the data set. This is pratically relevant for security tasks, as anomalous packets or flows might be due to novel network attacks.

There are many clustering algorithms, far more than we can discuss in this book, so we focus on three commonly used algorithms that can be readily applied to network data.

K-Means

K-means is a fairly simple algorithm that clusters a dataset into K groups:

  1. Choose a target number of clusters K

  2. Choose K random points as starting centroids (points that define the center of a cluster)

  3. Assign all other points in the data set to the cluster with the closest centroid

  4. Update the centroids to the mean locations of each the points in their cluster

  5. Repeat steps 3 and 4 until the centroid locations stop changing.

This algorithm is fast and always converges. However, it has drawbacks that can limit its applicability. Most importantly, you have to choose the number of clusters. This can be straightfoward if you have existing knowledge about the structure of the dataset. For example, if you have a dataset of IP packets that you want to cluster into TCP and UDP traffic, you could choose K=2 and then check whether the discovered clusters actually match these protocols.

If you don’t know the number of clusters, you can run K-means with increasing cluster numbers to see which produces the cleanest clustering, but you might be better off choosing a different algorithm that does not require an a priori choice of the number of clusters. K-means also performs poorly for non-spherical clusters or clusters of varying density (where some clusters have points that are much more similar to each other than the points in other clusters). If your data falls into either of these categories, you might also be better off choosing a different algorithm.

TODO: EXAMPLE OF K-MEANS

Gaussian Mixture Models

This alternative to K-means defines clusters not just by their center point (centroid) but also by the variance of the distribution of points in each cluster. This assumes that the locations of points in each cluster clusters follow a normal (Gaussian) distribution. While this may not be strictly true for some data sets, it is often a good approximation for large data sets due to the central limit theorem.

The process of applying Gaussain mixture models (GMM) is fairly similar to K-means. You must choose a number of clusters (or repeat the model iteratively with different number of clusters), and the model will find a normal distribution for each cluster with a mean and variance that best fits the data in the cluster.

Gaussian mixture model can also be used to generate new data by drawing new data points from the normal distributions corresponding to each cluster. This allows you to create new data with similar characteristics as your training data, which can be useful for augmenting a data set to provide enough data for training a supervised algorithm.

TODO: EXAMPLE OF GMM

Density-Based Spatial Clustering of Applications with Noise (DBSCAN)

DBSCAN uses datapoint density to identify clusters similarly to how humans visually identify clusters of points on a plot. High-density groups of points (groups with relatively many points a relatively small distance from each other) become clusters. These clusters are defined by a core example and a neighborhood distance.

DBSCAN has a lot of advantages. It does not force you to choose the number of clusters beforehand; it will find as many groups of nearby dense points as it can. It also works for datasets that aren’t spherical. DBSCAN is frequently used for anomaly detection, because it can automatically identify points that don’t fit in to any existing clusters. This is very useful in networks problems, such as malicious traffic detection, where identifying unusual examples is valuable.

DBSCAN has some disadvantages due to its dependency on data density. If you have some clusters that are tightly packed and other clusters that are more spread out, DBSCAN may be unable to achieve the desired clustering. DBSCAN can also struggle with high dimensional data because the ‘’curse of dimensionality’’ means that all data points appear far apart in high dimensional space.

TODO: EXAMPLE OF DBSCAN

Hierarchical Clustering

Hierarchical clustering algorithms contruct a ‘’dendrogram,’’ or tree diagram, that illustrates how examples can be progressively grouped based on similarity. This provides a nice visualization of your dataset indicating which points are more closely related than others. You can choose what similarity metric is used to construct the dendrogram (Euclidean distance is a common choice) based on how you want to interpret data point similarity. For example, you might want to hierarchically cluster a packet capture dataset based on the proximity of packets in the IP address space. In this case, you could choose the Hamming distance metric to measure the number of bit positions in which IP addresses differ.

If you want to create a specific set of clusters from a hierarchical dendrogram, you can divide the tree at a specific similarity threshold. All data points at least that similar to each other are then part of the same cluster.

TODO: EXAMPLE OF HIERARCHICAL CLUSTERING

Semi-Supervised Learning

Semi-supervised learning leverages unsupervised learning to speed up the process of identifying labels for a supervised model. In nearly all fields of ML, manual labeling is tedious. This is especially true for network data. Imagine going through a packet capture dataset to manually apply a label to every individual packet.

Semi-supervised learning allows you to combine a relatively small number of manual labels with a clustering algorithm to produce a fully labeled dataset.

Semi-supervised starts by applying a clustering algorithm to group the unlabeled training data. You then manually label a few randomly selected points from each cluster and propagate the most frequent manual label in each cluster to the other points in the cluster. This gives you a fully labeled data set even though you only had to manually label a few data points.

Ideally, the clustering algorithm produces clusters in which all points are from the same class. In practice, some clusters may have examples from multiple classes. You can perform semi-supervised learning recursively to address this issue. For example, if some clusters have points from different classes, you can re-run the clustering algorithm on these clusters individually to identify sub-clusters corresponding to single classes.

TODO: EXAMPLE OF SEMI-SUPERVISED LEARNING