Chapter 1: Introduction¶
The Internet is a global network of interconnected computer networks that forms the basis of much modern communication, including web browsing, messaging, email, video conferencing, on-demand movies, and gaming. Over time, it has come to constitute critical infrastructure, as much of society depends on the Internet for work, education, and recreation. The Internet is indeed central to just about every aspect of modern life that we can think of. The COVID-19 pandemic, in particular, highlighted the importance of Internet access—and the dire consequences to work, health, education, and education for those without reliable Internet access.
Yet, despite the critical importance of the Internet, it is prone to failures, misconfigurations, and attacks that can lead to disruptions of connectivity and service, leading to downtime that can cause significant inconvenience to users. The Internet is not a self-managing system: as problems arise, network operators must detect, diagnose, and fix these problems so that users can enjoy good performance and uninterrupted service. This process of network management involves monitoring the network for aberrations and fixing problems as (or before) they arise, so that the network can continue to run well.
How Networks Run¶
To maintain good performance for the applications they must support, networks must continually adapt to changing conditions. For example, when a component of the network fails, either due to misconfiguration or a hardware failure, the network must be able to detect the failure and reroute traffic around the failed component. Similarly, when the network is congested or traffic demands change, the network must reroute traffic around the link or provision additional capacity. Finally, when a new application is deployed, the network must be configured to support the application’s requirements. These tasks are often referred to as network management.
In 2020, the COVID-19 pandemic was a testament to the importance of network management—and the fact that networks don’t just automatically adapt to these changes (yet), but rather than human network engineers and operators needed to respond to the changing conditions that resulted from sudden and dramatic changes to traffic patterns. Specifically, as much of society shifted to working from home, Internet traffic patterns changed dramatically. Traffic that previously was exchanged on enterprise, corporate, and university networks suddenly shifted to home networks; video conference traffic from home networks skyrocketed, placing unprecedented load and strain on home networks. And yet, despite predictions of its demise, the Internet actually handled these dramatic shifts quite well.
The process of adapting to these changes, however, was not automatic. Network management tasks on both short and long timescales ensured that the Internet continued to operate well in spite of these dramatic shifts. In the short term, network operators needed to detect and respond to various performance issues, as sudden shifts in traffic placed unprecedented strain on parts of the network: For example, as more traffic shifted to video conferencing, the traffic volumes between access Internet service providers and the cloud service providers hosting Zoom, WebEx, Teams, and other video conference services introduced higher traffic loads, which (at least temporarily) introduced additional congestion on the network. Similarly, as entertainment stayed home, traffic volumes to video on demand services increased beyond normal levels.
In general, communications networks need to be continually maintained by humans, who are typically referred to as network operators. Network management tasks can be short-term or long-term. Short-term tasks involve responding to equipment or infrastructure failures, sudden and unexpected changes in traffic load, and attacks. Long-term tasks involve provisioning capacity, establishing and satisfying service-level agreements, or making other more substantial changes to infrastructure. All of these tasks typically require a careful process, sometimes involving simulations or models to help predict how a particular change to network configuration or infrastructure might ultimately affect how traffic flows through the network and, ultimately, user experience.
For a long time, these network management tasks tasks have typically relied on extensive domain knowledge—and a lot of trial and error. These tasks have also become more difficult and error-prone, even as the systems themselves have become more complex. The need to predict the outcomes of these tasks—and automate them—has become particularly challenging. Whereas several decades ago, the networking industry had some success in creating closed-form models and simulations of protocols, applications and systems, today these systems are far too complex for closed-form analysis.
The Role of Machine Learning¶
The complexity of network management has been contemporaneous with several other trends: (1) the ability to collect an increasing amount of data from (and about) the network; and (2) the emergence of practical machine learning models and algorithms that can make this prediction easier.
Before understanding how machine learning can improve the operation of communications networks, we must first understand what machine learning is, and how it might be used. Early definitions of machine learning date to the 1950s. Samuel defined machine learning as “the ability to learn without being explicitly programmed”. In the 1990s, Mitchell refined this definition: “A computer program is said to learn from experience E, with respect to some task T and some performance measure P, if its performance on T, as measured by P, improves with experience E.” In this book, we will explore a wide array of machine learning algorithms, but some of the more common themes include recognizing patterns in data, making predictions based on past data or observations, inferring properties that cannot be directly measured, and automatic generation of synthetic data.
Figure 1 illustrates the potential role of machine learning in the network ecosystem. The network produces much information for a network operator to make sense of, including raw traffic flow records, summary statistics, and even user feedback. Measurement, which we will discuss in detail in Chapter 3, is the process by which this raw data can be generated. Data from measurement forms the input to machine learning models. Subsequently, machine learning models can use that data as input, to make predictions or otherwise infer information about the network (e.g., the presence of an attack, the performance of a particular application, a user’s experience).
Given inference or prediction from a model, a network operator can then make decisions about possible changes to a network. Those changes might be implemented directly (and manually) by a network operator; or perhaps in some automated fashion, such as through a program that automatically updates the network configuration. Recent trends in networking have called this closed-loop control a “self-driving network”; in fact, most of the time updates to network configuration typically involve a human in the loop. In this sense, the application of machine learning to networking is complementary to another trend in networking called software-defined networking, which seeks to add programmatic control to network configuration—and, particularly, to control the actions on network traffic, such as how traffic us forwarded. Knowing what changes to make in the first place, however, requires the ability to infer, forecast, and predict, so that the appropriate changes can be made in the first place.
Why Now?¶
Machine learning has been applied to networking since the mid-1990s, dating back to the design of early email spam filters, as well as early anomaly and malware detection systems on closed datasets (e.g., the DARPA 1999 KDD challenge). In the mid-2000s, machine learning applications to networking experienced a renaissance, with applications to both network security and performance. In the security realm, researchers devised techniques to use network traffic as inputs for machine learning models that detect “botnets”, large armies of compromised hosts that launch large-scale Internet attacks. Others determined that applying machine learning to network traffic could be a far more robust technique for detecting email spam than analyzing the contents of the mail. Applications of machine learning in these domains ultimately formed the basis of a wide array of commercial products in the network security space, from startups to Fortune 500 companies. A few years later, machine learning emerged as a feasible approach for predicting the performance of networked systems that were difficult to model in closed form. For example, researchers developed machine learning models for predicting how network configuration changes would ultimately affect the performance of a web application, such as web search response time.
Applications of machine learning to networking in the mid-2000s led to substantial advances in network management in the realms of both security and performance. From a security perspective, modeling network traffic enabled detection of new, previously unseen attacks. These models also assisted with performance prediction and diagnosis: Previously, protocols and systems were simple enough to analyze in closed form, but as networked services and applications such as web hosting, content distribution, and distributed dynamic applications became more prevalent, systems became too complex to model in closed form. Machine learning began to show significant promise in this era, but most of the applications of machine learning remained focused on offline detection and prediction.
Now, nearly 20 years later, machine learning is experiencing yet another rebirth in networking. The “democratization” of machine learning through widely available software libraries and tools and the automation of many machine learning pipelines make machine learning more widely accessible to those who wish to apply it to existing datasets. To complement these developments, the emergence of programmable network telemetry has made it more possible to gather a wide variety of network traffic data—and different representations of that traffic—from the network, often in real-time. The combination of these two developments has created vast possibilities for how machine learning can be applied to networking, from long-term prediction and forecasting, to short-term diagnosis, to real-time adaptation to changing conditions.
What You Will Learn in This Book¶
This book offers a practical introduction to network measurement, machine learning methods and models, and applications. We will explore the details of network measurement, in particular the different types of network data that are available given current network devices, from routers to middleboxes, including the techniques for gathering and acquiring network data; what is possible to learn and infer from different types of network data; and new directions in network measurement and data representation, from advances in network telemetry to existing and emerging ways of representing network data for input to models.
Equipped with a better understanding of how various data is acquired from the network, we will then turn to an overview of machine learning methods. This book is not designed to give you a detailed overview or mathematical foundations behind each of the models—there are plenty of books out there for that already. Our goal in this book, rather, is to give you exposure to a variety of methods, show examples of how these models and methods can be applied in practice, provide an understanding of how these methods work, as well as when a particular model might be appropriate or not to apply for a particular dataset. A large focus of the book is also on how to apply these models in practice; as a result, we will focus significant attention on data acquisition, and representation—as we will see, the choice of how to represent the data to a model is often as important as the choice of model itself.
This practical view of machine learning for networking necessitates a focus on the entire machine learning pipeline, from data acquisition to model maintenance. Whereas most machine learning courses and textbooks focus on the modeling aspects of machine learning in isolation, in practice, the machine learning models themselves are but a small contributor to the overall effectiveness of prediction and inference. As such, we will explore all aspects of the machine learning and data science pipeline, from data acquisition to model deployment and maintenance, to the ethical and legal concerns surrounding applications of machine learning for networking.