Chapter 8: Reinforcement Learning

In this chapter, we introduce reinforcement learning, the process by which a machine learning model acts in an environment and learns to improve its performance based on feedback. The goal of reinforcement learning is to identify a strategy for choosing actions that will maximize the model’s expected rewards over time.

Background

Unlike supervised learning, which requires pairs of examples and correct labels, reinforcement learning only requires feedback when the model does something relevant to the task. Unlike both supervised and unsupervised learning, many reinforcement learning models are able to explore their environment, making tradeoffs between searching for new approaches and optimizing existing knowledge.

Throughout this chapter, we will describe the fundamental concepts of reinforcement learning as well as several reinformcent learning models, using networking examples as a guide. We will focus on a class of reinforcement learning models called Q-learning, as these models are relatively straightforward and perform well on many diverse tasks.

We do not assume you’ve seen these models before, and so readers who are aiming to get basic intuition behind different models will find this chapter helpful. Readers who are already familiar with these models may still find the examples in this chapter helpful, as they present cases where particular models or types of models are suited to different problems, as well as cases in the networking domain where these models have been successfully applied in the past.

We organize our discussion of reinforcement learning into the following topics: (1) important concepts, (2) Q-value iteration, (3) Q-learning, (4) deep Q-learning.

Markov Decision Processes

Reinforcement learning models are set in a environment that defines the possible states where the model can exist, what actions the model can take, and the effects of those actions. We can conceptualize an environment as a Markov decision process, i.e. a graph that connects states (vertices) with actions (edges).

For every pair of states and a corresponding action, there is a transition probability. If the model is in the first state and takes the action, this is the probability that it will end up in the second state. If the environment is deterministic, the transition probabilities are all either 0 or 1, indicating that there is no uncertainty about the results of actions. Real-world environments are typically nondeterministic or are deterministic but rely on values inaccessible to the model. In these cases, the transition probabilities are real-valued and reflect the fact that actions have uncertain outcomes from the model’s perspective.

EXAMPLE MDP FIGURE

Many tasks can be expressed in this framework. For example, we can frame a network intrusion detection system FINISH EXAMPLE.

Discounted Rewards

Two values are essential to guiding the learning process of reinforcement learning models:

A reward function provides positive feedback to the model when it does something that is useful for the task and negative feedback when it does something that is counterproductive to the task. For every combination of two states and an action, the reward function defines a reward that the model should receive if it starts in the first state, takes the action, and ends up in the second state. The creation of an effective reward function is essential to the learning process, as positive and negative rewards guide the model towards learning a strategy to perform the desired task.

The discount factor is a scalar hyperparameter that quantifies how much the model should prioritize immediate rewards over future rewards. A high discount factor will incentivize the model to learn a strategy that maximizes rewards now (or in the near future), potentially lowering its total reward received over time (if it were allowed to run long enough to collect rewards in the distant future). A low discount factor will prioritize maxmizing the total reward received over time, but might result in lower rewards received in the near future.

Q-Values

To make things easier (and because it is essential for the models discussed in the next sections), we will introduce one additional composite value: the q-value. Q-values are defined with respect to state/action pairs. The q-value of a state/action pair is the sum of the discounted future rewards that the model can expect if it chooses to take that action when it is in that state.

This is a big concept, so let’s break it down. First, imagine that the model is in a particular state s. It can potentially choose among several actions, but let’s imagine that it chooses action a. As a result of this choice, the model will end up in some other state (based on the transition probabilities) and might receive some immediate reward (based on the reward function). The model will then go on its merry way, continuing to choose actions and receive rewards on into the future. So just how good was the choice of action a from state s? To quantify this, we have to account for any immediate reward received and any future rewards that the model might have set itself up for by making this choice. Since we don’t want this q-value to depend on the future (it should only depend on s and a), let’s assume that the model only makes perfect choices in the future, receiving the maximum amount of reward possible. Of course, these future rewards might be less enticing to the model than immediate rewards based on the discount factor. The q-value therefore combines both the immediate reward and the discounted optimal future rewards into a single number indicating the quality of the decision of taking action a in state s.

The q-value is important because a model that knows (or learns) accurate q-values can perform optimally. When it finds itself in any state, it simply chooses the action for which the corresponding state/action q-value is the highest. Because the q-values have already accounted for the effect of this action on the model’s future options, the model doesn’t need to worry about the future when making any individual decision – it just chooses an action based on the q-values associated with its current state.

This formulation of q-values reframes reinforcement learning into a task of determining the most accurate q-values for any state-action combinations that the model might encounter. Fortunately, this has guided model design and led to many effective reinforcement learning algorithms, including those we discuss below. Unfortunately, this formulation does not address the fundamental limitations addressed in the previous section: if there are an infinite number of states and/or actions in an environment, there will also be an infinite number of q-values. The following approaches address this issue for environments of increasing complexity.

Q-Value Iteration

Fortunately, some tasks, especially in the networks space, have narrowly defined environments, in which it is possible to determine and store all information about the environment, including the states, actions, transition probabilities, reward function, and discount factor prior to any learning taking place. FOR EXAMPLE.

In these cases, the agent can run a offline iterative algorithm to determine all q-values without needing to explore or otherwise interact with the environment whatsoever. Starting with all q-values set to \(0\), the model computes the following equation iteratively for each [state, action] pair until the q-values converge. where \(s\) is the state, \(a\) is the action, \(s'\) is a different state, \(a'\) is a different action, \(T\) is the transition probability, \(R\) is the reward function, and \(\gamma\) is the discount factor. This processes uses the q-value computed at iteration \(k\) to compute the q-value at iteration \(k+1\).

\[Q_{k+1}(s,a) \leftarrow \sum_{s'}T(s,a,s')[R(s,a,s') + \gamma \max_{a'} Q_k(s', a')]\]

Once all q-values have converged, the agent can then use the q-values to act optimally in the environment. Regardless of which state \(s\) the agent finds itself, it chooses the action \(a\) for which the q-value of the pair \((s,a)\) is the highest.

Q-Learning

Unfortunately, for most reinforcement learning tasks, the agent does not know much (or anything) about the environment at the outset. Instead, the agent must explore the environment to discover the states and rewards. This is an online process, meaning that the q-values, and therefore the best strategy, cannot be calculated a priori. Only by engaging with the environment can the agent figure out the q-values. Starting with estimates for all q-values at 0, the agent performs the following iterative update after each exploratory step, where \(\alpha\) is a learning rate parameter that controls how fast new exploratory steps update the existing q-value estimates.

\[Q(s,a) \leftarrow (1-\alpha)Q(s,a) + \alpha(r+\gamma \max Q(s',a'))\]

This process tells the agent how to update q-value estimates, but it doesn’t say anything about which actions to take while exploring. Environment exploration is an example of an “exploration/exploitation” tradeoff. On one hand, it makes sense to venture into new areas of the environment to see if they yield new approaches to the solution. On the other hand, it makes sense to fine tune strategies already known to work well. In practice, Q-learning agents will define a (potentially dynamic) parameter \(0 \le \epsilon \le 1\) such that random actions are chosen with probability \(\epsilon\) (new strategies) and the action with the highest estimated q-value at the current state chosen with probability \(1-\epsilon\). Once the agent has finished exploring and is satisfied with its q-value estimates, it can enter a deployment phase where it always chooses the action that maximizes the estimated q-value.

Approximate Q-Learning

For all but the simplest tasks, it is not possible to enumerate all (state, action) pairs, making it impossible to keep an estimate of all q-values in memory. In analog real-world environments, there are a (practically) infinite number of states. Even digitally represented environments often have more possible states and actions than can be feasibly computed or stored. Part of the advancement of reinforcement learning has involved the development of models that better handle these aspects of realistic environments.

Rather than creating an agent that attempts to store all q-values, approximate q-learning has the agent use a machine learning model to predict the q-value for the current state and all possible actions. This approach significantly reduces the amount of storage needed, as the agent need only maintain the model, the current state, and the set of possible actions. A version of approximate Q-learning called deep q-learning uses a neural network model to predict q-values. The agent collects training data during exploration. If the agent takes action \(a\) from state \(s\) during exploration and receives reward \(r\), it updates the label (target) q-value for pair \((s,a)\) to the following:

\[Q_{target}(s,a) = r + \gamma \max_{a'}Q_{pred}(s',a')\]

After taking several exploratory steps and updating several q-value targets, the agent re-trains the deep learning model such that its predicted q-values are closer to the targets. Once the agent has finished exploring and is satisfied with the final trained deep learning model, it can enter a deployment phase where it always chooses the action with the highest predicted q-value at the current state.