Chapter 7: Large Language Models¶
Background¶
Large language models have gained increasing prevalence in many aspects of society today, with interactive sessions such as OpenAI’s ChatGPT and Google’s Bard allowing users to query and receive answers for a diverse set of general downstream tasks. Although they are highly developed and tailored, the underlying architecture of the models for systems such as GPT and LaMDA is based on a transformer model. In such a model, each word in a given input is represented by one or more tokens, and each input is represented with an embedding, a high-dimensional representation of the input that captures its semantics. Self-attention is then used to assign a weight for each token in an input based on its importance to its surrounding tokens. This mechanism allows transformer models to capture fine-grained relationships in input semantics that are not captured by other machine learning models, including conventional neural networks.
The use of large-language models in computer networks is rapidly expanding. One of the areas where these models show promise is in the areas of network security and performance troubleshooting. In this chapter, we will explore some of these early examples in more detail, as well as discuss some of the practical hurdles towards deploying LLMs in production networks.
Large language models typically operate on a vocabulary of words. Since this book is about applications of machine learning to networking, ultimately the models we work with will operate on network data (e.g., packets, elements from network traffic), not words or text in a language. Nonetheless, before we talk about applications of LLMs to networking, it helps to understand the basic design of LLMs and how they operate on text data. We will provide this overview by providing background on two key concepts in LLMs: vectors and transformers.
Vectors¶
Language models represent each word as a long array of numbers called a word vector. Each word has a corresponding word vector, and each word thus represents a point in a high-dimensional space. This representation allows models to reason about spatial relationships between words. For example, the word vector for “cat” might be close to the word vector for “dog”, since these words are semantically similar. In contrast, the word vector for “cat” might be far from the word vector for “computer”, since these words are semantically different. In the mid-2010s, Google’s word2vec project led to significant advances in the quality of word vectors; specifically, these vectors allowed various semantic relationships, such as analogies, to be captured in the spatial relationships. dictionary meaning, not linguistic context
(free: speech or beer) While word vectors, and simple arithmetic operations on these vectors, have turned out out to be useful for capturing these relationships, they missed another important characteristic, which is that words can change meaning depending on context (e.g., the word “sound” might mean very different things depending on whether we were talking about a proof or a musical performance). Fortunately, word vectors have also been useful as input to more complex large language models that are capable of reasoning about the meaning of words from context. These models are capable of capturing the meaning of sentences and paragraphs, and are the basis for many modern machine learning applications. LLMs comprise many layers of transformers, a concept we will discuss next.
Transformers¶
The fundamental building block of a large language model is the transformer. In large language models, each token is represented as a high-dimensional vector. In GPT-3, for example, each token is represented by a vector of nearly 13,000 dimensions. The model first applies what is referred to as an attention layer to assign weights to each token in the input based on its relationships to the tokens in the rest of the input. In the attention layer, so-called attention heads retrieve information from earlier words in the prompt.
Second, the feed-forward portion of the model then uses the results from the attention layer to predict the next token in a sequence given the previous tokens. This process is accomplished using the weights calculated by the self-attention mechanism to calculate a weighted average of the token vectors in the input. This weighted average is then used to predict the next token in the sequence. The feed-forward layers in some sense represent a databased of information that the model has learned from from the training data; feed-forward layers effectively encode relationships between tokens as seen elsewhere in the training data.
Large language models tend to have many sets of attention and feed-forward layers, resulting in the ability to make fairly complex predictions on text. Of course, network traffic does not have the same form or structure as text, but if packets are treated as tokens, and the sequence of packets is treated as a sequence of tokens, then the same mechanism can be used to predict the next packet in a sequence given the previous packets. This is the basic idea behind the use of large language models in network traffic analysis.
A key distinction of large-language models from other types of machine learning approaches that we’ve read about in previous chapters is that training them doesn’t rely on having explicitly labeled data. Instead, the model is trained on a large corpus of text, and the model learns to predict the next word in a sequence given the previous words. This is, in some sense, another form of unsupervised learning.
Transformers tend to work well on problems that (1) can be represented with sequences of structured input; and (2) have large input spaces that any one feature set cannot sufficiently represent. In computer networking, several areas, including protocol analysis and traffic analysis, bear some of these characteristics. In both of these cases, manual analysis of network traffic can be cumbersome. Yet, some of the other machine learning models and approaches we have covered in previous chapters can also be difficult for certain types of problems. For example, mappings of byte offsets or header fields and their data types for all protocols, as well as considering all values a field may take, may yield prohibitively large feature spaces. For example, detecting and mitigating protocol misconfiguration can be well-suited to transformer models, where small nuances, interactions, or misinterpretations of protocol settings can lead to complicated corner cases and unexpected behavior that may be challenging to encode in either static rule sets or formal methods approaches.
BERT is popular transformer-based model that has been successfully extended to a number of domains, with modifications to the underlying vocabulary used during training. At a high level, BERT operates in two phases: pre-training and fine-tuning. In the pre-training phase, BERT is trained over unlabeled input, and is evaluated on two downstream tasks to verify its understanding of the input. After pre-training, BERT models may then be fine-tuned with labeled data to perform tasks such as classification (or, in other domains, text generation) that have the same input format.
In recent years, transformer-based models have been applied to large text corpora to perform a variety of tasks, including question answering, text generation, and translation. On the other hand, their utility outside of the context of text—and especially in the context of data that does not constitute English words—remains an active area of exploration.
Large Language Models in Networking¶
The utility of large language models for practical network management applications is an active area of research. In this section, we will explore a particular early-stage example of the use of large language models for the analysis of network traffic: the analysis of network protocols.
Network Protocol Analysis¶
We will explore a recent example from Chu et al., who explored the use of large language models to detect vulnerable or misconfigured versions of the TLS protocol. In this work, BERT was trained using a dataset of TLS handshakes.
A significant challenge in applying large language models to network data is to build a vocabulary and corresponding training set that would allow the model to understand TLS handshakes. This step is necessary, and important, because existing LLMs are typically trained on text data, and the vocabulary used in these models is typically based on the vocabulary of the English language. To train a model that understands TLS handshakes, the first step involved building a vocabulary that would allow the model to understand TLS handshakes. In this case, the input to the model is a concatenation of values in the headers of the server_hello and server_hello_done messages, as well as any optional server steps in the TLS handshake. The resulting input was normalized (i.e., to lowercase ASCII characters) and tokenized.
The resulting trained model was evaluated against a set of labeled TLS handshakes, with examples of known misconfigurations coming from the Qualys SSL Server Test website. The model was able to correctly identify TLS misconfigurations with near-perfect accuracy.