Chapter 5: Supervised Learning¶
In this chapter, we will discuss supervised learning, the process by which a machine learning model can learn from labeled data (sometimes called labeled examples). Supervised learning requires having access to one or more labeled datasets—data that has not only the features, but also an associated label for each data point. For example, in the case of malware classification, the features might include metrics from the network traffic (e.g., bytes per second, packets per second, number of IP addresses contacted), and the labels could be whether the traffic is being generated by a malicious software software program (“malware”).
In this chapter, we will describe a variety of supervised learning models, using examples from networking as an illustrative guide. We do not assume you’ve seen these models before, and so readers who want to get basic intuition behind different models and how they can be applied in different network settings should find this chapter illuminating. Readers who are already familiar with these models may also find the discussion helpful, as the examples in the chapter present cases where particular models or types of models are suited to different classification problems, as well as cases in the networking domain where these models have been successfully applied.
We organize the discussion of supervised learning in this chapter into the following categories:
non-parametric models (i.e., models where the size of the model grows with the size of the dataset);
linear models (i.e., models based on a linear prediction function, including possible basis expansion);
tree-based models;
ensemble methods (i.e., models that make predictions by combining the predictions of simpler models); and
deep learning models (i.e., those that can also learn representations of the data).
Non-Parametric Models¶
Non-parametric models grow as the size of the dataset grows. Perhaps the most well-known and widely used non-parametric model is k-nearest neighbors (kNN). We describe this model below as well as examples where kNN has been used in networking. We also provide examples for you to try.
K Nearest Neighbors¶
With K-nearest neighbors, the model has training data, and when trying to predict a label for a new example, the model predicts the most common label (classification) or mean (regression) of the k closest examples in the training data. The model can vary based on the distance function to use to define “closest” (there are many options), but ultimately, k-nearest neighbors boils down to finding the closest training examples and predicting their mean or their mode.
Training a k-nearest neighbors model is simple and efficient: there is nothing to do other than store the training data in a data structure, such as a KD-tree, that makes it easy to compare distances between a data point and observations in the training data. All of the work and computational effort thus occurs during prediction, where these distances are computed. The choice of k is a hyperparameter that can be selected using a standard tuning approach, such as a line search, along with a validation set.
Because kNN is sometimes impractical (e.g., storing the training set may be prohibitive, inference time can be large), the model is often used as a baseline to compare model performance against other more practical models. kNN is relatively easy to train and optimize, but the computation performance is poor if the training set has many examples—and gets worse as the training set grows. Nonetheless, kNN is often used as a baseline for comparing against other models.
kNN example here
Because kNN operates by distance computations in a multi-dimensional space, the model works best when features are standardized to have zero mean and unit variance. Otherwise, the closest examples may be overwhelmed by one feature that just happens to have values in smaller units.
Unfortunately, kNN also scales poorly with the number of features in each example (i.e., the dimensionality of the feature set). Many functions that can be used to compute distance between vectors don’t work very well in high dimensions, because in higher dimensional spaces, everything starts to be far away from everything else. Even if two examples happen to be close to each other on one dimension, the fact that there’s so many dimensions in the data set means that they’re likely to be far away on others. As you’re adding the distances along each dimension or taking the square of the distances along each dimension, those distances can start to blow up. This phenomenon called the curse of dimensionality.
You may sometimes try to address this by using some different distance functions that are tailored for different types of data, but generally, this so-called “curse of dimensionality” is just a problem for geometric models that rely on vector similarity or vector distance.
Despite their simplicity, kNN models have been successfully used in various contexts to perform basic classification tasks using network data. Examples include: (1) positioning and geolocation; (2) website or device fingerprinting; and (3) attack classification and detection (e.g., DDoS detection). Such classifiers are common in research papers, particularly as a baseline approach or model—or in attack papers where the attack, such as website fingerprinting, need not be efficient (e.g., if it can be performed offline, or simply to demonstrate feasibility of an attack). In practice, other models are typically more common, particularly due to the computational requirements that can make kNN inefficient in practice.
Linear Models¶
Some prediction problems are well suited to linear models (or simple polynomial expansions of linear models). In the case of regression (predicting a continuous target variable), linear regression may be appropriate. In the case of
Linear Regression¶
Linear regression is one of the simplest supervised models.
Imagine that each data point x has N+1 features \(\mathbf{x} = [1, x_1, ..., x_N]\).
A linear regression model trained on this data will have N weight parameters w and a single bias parameter b: \(\mathbf{w} = [b, w_1, ..., w_N]\) The model predicts labels as the linear combination of a data point’s features weighted by the model parameters \(\hat{y} = b + w_1x_1 + ... + w_Nx_N\) This can be simplified as the dot product of the data point and the model parameters \(\hat{y} = \mathbf{x} \cdot \mathbf{w}\) If predicting labels for an entire dataset X, this can be written as the inner product of the weights with the matrix containing all data points. \(\mathbf{\hat{y}} = \mathbf{w}^T\mathbf{X}\)
This is essentially the equation for a line (\(y = mx + b\)) generalized to more dimensions, where the “slope” of the line is the weight parameters and the “intercept” of the line is the bias parameter.
Training a linear regression model involves choosing the weight and bias parameters to minimize the error between the predicted labels and the actual labels for the training set. There are many error functions that can be used for this training minimization. A common choice, the mean-square error, is also convenient for training: \(Error = \frac{1}{m} \sum^{m}_{i=1}(\mathbf{w}^T\mathbf{x} - y)^2\)
We can use either a closed-form solution or gradient descent to find values of w that minimize this error across the examples x in the training set.
Regularization¶
Ridge Regression is just linear regression with the L2 norm of the parameters added to the error function. This weight of this term is controlled with a hyperparameter that allows you to tune the relative emphasis given to the simplicity of the model (L2 penalty) versus the fit of the model. If you tune this hyperparameter up quite high, the gradient desceent is really incentivised to keep the parameter values low in magnitude and the model simply. If you tune this hyperparameter low, it will incentivize the algorithm to closely fit the training data.
Lasso Regression is just linear regression with the L1 norm of the parameters added to the error function. Instead of using the Euclidean magnitude of the parameter vector as the penalty, you use the sum of the absolute values of the parameters. A benefit of using the L1 norm is that it can push the values of the parameters that aren’t particularly important to 0. This means that you could even decide to take out those features altogether and further simplify your model. Unfortunately Lasso regression gradients can start to act erratically if there are lots of correlations between features. So you can get into a situation where as you get closer to the minimum using gradient descent, your updates start to bounce around rather than settling into the final value.
To get the benefits of both Lasso and ridge regression, you can combine them into ElasticNet. The cost function for ElasticNet include the original error function for linear regression, the term for L1 penalty (from lasso), the term for L2 penalty (from ridge), and another hyperparamater r that determines how to mix the two penalties. The more you turn up r, the more it behaves like lasso. The more you turn down r, the more lit looks like ridge. For the most part, if your dataset is simple enough for a model to perform well using any one of these approaches, it will also likely perform well using any of the others. Data is generally either amenable to one of these linear models, or these models just don’t provide enough expressivity and it won’t matter which regularization option you choose [CITATION NEEDED].
Polynomial Regression (Basis Expansion)¶
Polynomial regression involves preprocessing the features in your dataset to include polynomial combinations of existing features. For example, you might add the square of each feature and the all pairwise products of the features. Then when you train a linear regression, it’s effectively the same as training a quadratic because you’re doing linear combinations of second degree combinations of parameters. You could also do this for features that include all of the third degree combinations of the original features. This would quickly increase the number of features as you increase the degree of the polynomials, but you gain the ability to do the same linear regression training task while modeling higher degree patterns between your features. This will let you learn curves that aren’t just straight lines [figure]. It will let you learn a polynomial model of any degree the same process and gradient descent.
Logistic Regression¶
Another common form of regression is logistic regression. Logistic regression performs almost exactly the same process as linear regression, but with one minor change: Rather than just performing a linear combination of parameters and features, the output of that linear combination is used as input to a sigmoid function. Performing this transformation constrains the output prediction to the [0,1] range, whereby the output of the linear combination is either very close to 1 or very close to 0 with a region in the center where that transition happens fairly quickly. This is useful for doing classification.
When performing classification, you want to know whether a data point is in a specific class or not, so by wrapping model output in a sigmoid, you say if the output is >0.5 assume class 1, and if the output is <0.5 assume class 0. In most cases, you’ll already be very close to 1 or 0. Everything else in the training process works the same way, you just compute the gradient of the error function using this as your predictor. When you compute the gradients, you can use the chain rule to computer the partial derivates. The sigmoid function is continuous and differentiable so that’s not a problem. Everything else, all the gradient descent works the same way, just with the exact equation slightly different as a result for the sigmoid. The book chapter concludes by taking the logistic regression and generalizing it top the multi-class case. The generalization of logistic regression is called “softmax regression”, which we will not talk about today. We will get to the softmax function later, especially in deep learning, but this is good for today. We will use some of these things in programming practice on Thursday. We’ll also do a bit about nearest neighbors, which won’t take us very long at the start.
Support Vector Machines¶
If you are trying to perform a binary (or multi-class) classification task with linearly separable data, the optimal model will consist of a line (or plane, or hyperplane) that divides the feature space such that all of the examples on one side of the line are in one class and all of the examples on the other side of the line (or plane, or hyperplane) are in the other class. When asked to predict the class of a new example, you make the prediction based on which side of the line the example occurs.
SVMs are very common, and they perform very well for small datasets. A dataset that is small but contains features that separate well in linear space may be very amenable to SVMs. SVMs can produce predictions that are robust to overfitting and often good at generalizing. It is also possible to perform regression with SVMs. To do so, the aim is to fit all of the data within the margin instead of outside the margin. The decision line (or plane) thus becomes a regression line.
Max-Margin Classifiers¶
The question remains, how do we choose which line or plane to use for this model? There might be an infinite number of planes that separate the data, how do you choose the one with the best chance of optimizing prediction accuracy? Remember that prediction accuracy comes down to how well the model generalizes to new data outside the training set. The core intuition behind an SVM is that the best separating line or plane is the one with the most space between training examples of different classes—the maximum margin, or the “max margin”.
Training an SVM involves finding that line that maximizes the margin, i.e. the space separating the line from the training data. The examples end up closest to this optimal line are called the support vectors. These are the examples that determine the position of a line. If you were to collect a lot more data, but all of that data were to fall on further from the separating line than the existing support vectors, it wouldn’t change the position of the line. This makes support vector machines fairly robust to overfitting because the only data that affects the ultimate position of the model are those examples closest to these margin boundaries.
Training and Prediction¶
The SVM training process involves finding the separating line (or plane) with the maximum margin. Of course, real datasets are rarely linearly separable, so we add another variable to the model that allows for some slack, i.e. for some training examples to fall on the wrong side of the line or plane. The primary goals of training are to find parameters W and b that minimize the error between the predictions and the actual values that also maximize the margin. This goal can be achieved either with a quadratic programming solver or with gradient descent, which are typically programmed into the SVM models in machine learning libraries. For a linear SVM, the predicted labels y_hat is a piecewise function based off a linear combination of the features with weights w and biases b. If that linear combination is less than zero, we predict class 0. If this linear combination is greater than or equal to zero, we predict class 1. SHOW FORMULA. Show how this combines the algebra and the geometry (sign of dotproduct) to be “above” or “below” the line
Regularization is also possible with SVM models: The higher the value of hyperparameter C, the more importance the model places on getting the classifications right (i.e. all examples on the correct sides of the margins). The lower you make C, the less importance the model places on a few incorrect classifications as it attempts to find the largest marign possible.
Finally, there are multiple ways to use SVMs to perform multiclass classifcation. A simple approach is one-versus-rest, where if you have N classes, you train N SVM classifiers, each binary. The first classifier predicts whether a example is in class 1 or some different class. The second classifier whether an example is in class 2 or some different class. Each of those classifiers would give you a different decision line and you’d have to do a prediction with all of them and decide which prediction was best for that dataset. Other approaches include one-versus-one.
Kernel Methods¶
Of course, many datasets are not linearly separable. If you have a dataset that relies on nonlinear interactions between features, a linear SVM is going to perform poorly. One option, as we saw with polynomial regression, is to take the existing features and compute polynomial combinations of them. However, such an expansion can generate in too many features for higher degree combinations, degrading both performance and model accuracy.
One solution to this problem is to apply the kernel trick, which relies on the fact that data which isn’t linearly seperable in low dimensions may be linearly seperable in high dimensions. Additionally, the SVM training algorithm can be reformulated (into the dual format) to involve only similarity metrics between example features, never on the exact values of the features themselves. This allows you to use a kernel function in your model, which computes the distance between examples in a higher-dimensional space without ever actually having to project those examples to the higher dimensional space.
SHOW FORMULA OF SVM WITH KERNEL K.
There are many well-studied kernel functions that existing machine learning libraries provide, each with pros and cons. Nonetheless, they all allow us to adapt linear SVMs to nonlinear data.
Probabilistic Models¶
Naive Bayes¶
Let’s now explore a classifier called the naive Bayes classifier. The naive Bayes classifier is a type of classification method in a family called kernel density classification. Naive Bayes works well on small data sets, it’s computationally efficient for both training and prediction, and it works well in a variety of settings.
It typically requires knowing or estimating the probability density of the feature space from which we’re trying to perform the estimation. There’s a notion called an optimal base classifier, where if you knew that the data had an underlying probability distribution you could formulate an optimal classifier as follows given a set of observations X, you could pick the Y that maximizes the posterior probability of observing Y being a particular class given those observations. Doing so requires knowing the probability distributions of the features on which we’re performing this estimation, and that distribution typically isn’t known, but we can make some assumptions. What’s commonly done is to select a family of parametric distributions such as a Gaussian, Bernoulli, or multinomial distribution, and then try to estimate the parameters of that distribution once we’ve estimated the density of those features. We can then perform this computation.
If we’re trying to estimate the probability of Y being a particular value or class given a set of observations, X, we can use Bayes rule to express that as the posterior probability of X given Y times the prior probability of Y, divided by the prior probability of observing those X values. The naive Bayes classifier is called a probabilistic classifier, because not only does it make a class prediction, but it can also predict the likelihood of an observation being of a particular class, given a set of observations. The naive Bayes classifier compares posterior probabilities for each possible class. Because those probabilities for each observation does not depend on prior probability of observing the feature values, X, we can drop the denominator and just compare those numerator values and the class value for y that has the largest posterior numerator is the predicted class.
The Naive Bayes classifier makes two important assumptions. One is that each feature X and its corresponding likelihood is independent; that is, there’s no relationship between the likelihoods of pairs or groups of features. That’s an assumption that is commonly violated in practice but making this assumption allows Bayes rule to avoid the curse of dimensionality, where the size of the required training set often grows exponentially with the number of features in the model. The second assumption is that for each feature we have to assume some statistical distribution on the features themselves; in other words, we have to estimate the kernel density for those features.
There are three common distributions that are used in naive Bayes classification. When we have numerical features it’s common to assume a Gaussian distribution. If there are categorical features we may use a Bernoulli distribution. If there are discrete or count features, it’s common to use a multinomial distribution. Why do we have to make these statistical distribution assumptions? If we look at the quantity we’re trying to compute again it involves computing two quantities: one is the likelihood of Y, or the prior probability of Y. That’s easy: we just figure out how often each class occurs in our data set. The second quantity we need to know is the probability of observing a set of X’s given Y; unless we have an extremely large number of data points we’re not going to be able to compute this directly, and so we’ll have to make some assumptions about this particular distribution.
This is where we’re going to make assumptions about the distribution of X, and specifically the distribution of X conditioned on Y. The second assumption that we need to make is that those probabilities x given y are independent. In other words, we can compute the joint probability distribution of those X’s conditioned on Y by computing the probabilities of each individual feature X given Y and multiplying them together. This independence assumption greatly simplifies our computation, so now all we need to do when maximizing this posterior probability is to compute the following: we choose the value for y that maximizes the prior probability of observing that value of Y times the probability of each X_i observation, given that particular class value of Y.
The Naive Bayes classifier has various advantages and disadvantages. It’s efficient and scalable: it’s a very simple algorithm based essentially on counting frequencies of occurrence. It works on very small data sets. It’s interpretable: each distribution is estimated as a one-dimensional distribution because the probabilities of each feature occurring are assumed to be independent. Unfortunately, because the Naive Bayes classifier assumes independence of features, it can’t learn relationships between those features, which may sometimes be something you want to do in practice.
Expectation Maximization¶
Decision Trees¶
Another way of performing classification or prediction is through performing a sequence of decisions based on features. Each decision, or step, sub-divides the data into regions. The goal is to end up with regions that contain only data points of a single class. Training a decision tree involves making a sequence of decisions that sub-divide the data into these regions. The model is called a tree because sequences of decisions are easy to interpret when depicted as trees. Once the tree is trained, classification is simple, just start at the root node and follow the links corresponding to the example you wish to classify until you reach a leaf node. Decision trees can also be used for both classification and regression.
Training¶
The goal of training a decision tree is to train a balanced tree that has the minimal training error, in other words, the minimal difference between the predicted classes in the training set and the true classes. Balancing the tree reduces the computational complexity of the prediction process, because it reduces the maximum number of questions, or splits, that are required to go from the root of the tree to a leaf. Unfortunately, the problem of finding the optimally balanced tree for an arbitrary dataset is NP-complete. In practice, decision trees rely on iterative algorithms that attempt to optimize for balance at each step, but do not guarantee that the final tree is as balanced as possible.
One early decision tree algorithm, the classification and regression tree algorithm (CART), iteratively selects a feature and finds the boolean comparison or numeric threshold that all of the examples in the training set as evenly as possible by number and as uniformly as possible by class. In other words, the algorithm attempts to choose a feature and a question/threshold that divides the examples into a left child node and a right child node. CART is a greedy algorithm that starts at the root of the tree with all of the training examples ands repeats the same process with all of the child nodes. This continues until each leaf node contains examples from a single class only.
Gini vs. Entropy – probably unnecessary
Parameter Tuning¶
Unfortunately, decision trees can be prone to overfitting. As with k-nearest neighbors, decision trees are nonparametric, which means that they can be trained to fit the training data perfectly. In the limit, training will result in a decision tree where every leaf node has examples from only one class. One way to limit overfitting with decision trees is to set a maximum depth (or a minimum split) hyperparameter, which sets the maximum depth of the decision tree. For any remaining nodes with training examples from more than one class, the mode of the classes in each node serves as the prediction label. Another approach to limiting overfitting is “pruning,” which trains a complete decision tree and removes splits that causes relatively small decreases in the cost function.
Benefits and Drawbacks¶
Decision trees require very little data pre-processing. It doesn’t matter whether your features are numeric, binary, or nominal, you can still have conditions in the nodes that work for those classes. For example, one node could split based on a numeric feature, if packets_per_flow > 1 or == 1. Another node could split on a nominal feature, “is there an ACK packet in the flow.o Ypu don’t need to do a one hot encoding, you don’t need to do an ordinal encoding, you can just feed them right into the tree training algorithm. And it’ll work no matter what format your features are.
You also don’t need to do any standardization or normalization. as there’s no notion of this decision tree being geometric, so we don’t need to ensure that our features are mean zero and variance one. Decision trees are easily interpretable by humans. It is possible to look at a decision tree and understand how it arrived at a particlar prediction.
Decision trees make it easy to compute and compare the relative importance of features. We often want to know which features of our dataset are particularly important for a particular classification. For example, we might want to know whether the number of packets in a flow is crucially important or peripheral to our problem. In addition to providing a better understanding of the model, this can also provide a better understanding of the underlying phenomenon. For EXAMPLE.
Ensemble Methods¶
If you can train one classifier, why not train more and improve your accuracy by combining their predictions together. The core idea behind ensemble learning is that if you have a complex phenomenon that you’re trying to understand, you can do a better job of by training a bunch of simpler models with different perspectives instead of a single complex model. This is analogous to the “wisdom of the crowd”.
Voting¶
A “voting classifier” uses several different classes of models (e.g. decision tree, SVM, kNN, etc.) and predicts the majority vote class predicted by each of these models. If the phenomenon is complicated enough, there may not always be one algorithm which does best on new examples. So by having a bunch of algorithms try it, as long as the majority of them do the right thing, you can still give you the right answer in the end.
You can also use the confidence of these models to weight the votes (soft voting classifier).
Bagging & Pasting¶
The next approach, bagging and pasting, trains different instances of the same algorithm on different subsets of your training set. Bagging and pasting help to reduce classification variance by sampling from the training set with replacement to create N new training sets that are all slightly different. You train a different model on each set and use the majority vote prediction of all these models.
Random Forests¶
Random forests are a particularly important version of bagging, in which you train many small decision trees limited to a maximum depth. (decision trees limited to a single set of child nodes are called decision stumps). Random forests have distinction of being a very, very practical high performance algorithm. Random forests can compete with deep learning algorithms, especially when you’re given datasets that have obvious features. Deep learning really shines when you’re given data that is sort of raw unpasteurised things like images or natural language. But if you’re given a dataset with clear existing features, in many cases a random forest will do as well as a deep learning algorithm on that data. Random forests also have many fewer hyperparameters than a neural network. They are also robust to overfitting.
Bagging and random forests are really amenable to parallelization, you can do the sampling. And then you can put each training on a different core, in your data center, train up all in models in parallel
Boosting¶
Another ensemble method callse Boosting has a different motivation than bagging or random forests. Bagging, pasting, and random forests seek to reduce prediction variance. However, Boosting attempts to prevent bias errors, which happen when you choose a model that is unable to represent the complexity of the data. In Boosting, you train one algorithm to make a prediction and then you train another algorithm to try to correct the prediction made by the first one. You can repeat this as many times as you like such that by chaining simple models together, you can end up with something which is quite complicated and is able to represent the data very well, even if the data itself is complex. The name comes from the fact that each successive classifier in the sequence is trying to boost the performance of the previous ones.
In gradient boosting, you start off with your training data, you start by training a model, usually a decision tree. This tree gets some of the training set predictions right, and it gets some of them wrong, allowing you to compute a residual between the actual value predicted and the correct value for each examples. Then you train another decision tree to predict the residual of the first tree. If that prediction is accurate, you can take the prediction of the first tree, correct for the error predicted vy the second tree, and get the right prediction overall. You can also train a third tree to predict the error of the second tree, etc.
Because boosting is inherently sequential, it is not that amenable to parallelized training. Typically, you make each individual classifier very simple and fast to train, so the entire boosted classifier is also efficient.
Another type of boosting, AdaBoost, also uses sequential classifiers that try to improve each other’s performance. The weight given to each example in the training set is increased if the previous classifier got that example incorrect and decreased otherwise. This means that successive classifiers put more effort into correctly predicting examples that were missed by earlier classifiers. Each successive classifier is itself weighted by how well it performs on the entire training set.
There is a proof that AdaBoost combined with any weak learning algorithm, i.e. any classifier that does better than random guessing, will eventually produce a model which perfectly fits your training data. Empirically, this also improves test error as well.
Gradient Boosting…
Deep Learning¶
In this section, we will begin our exploration of deep learning by discussing a particular type of neural network architecture called feed forward neural networks, which are also sometimes referred to as multi-layer perceptrons. But before we dive into the technical details of feed forward neural networks, it is essential to understand the context of deep learning.
Deep learning is a subset of machine learning that is concerned with representation learning. Unlike traditional machine learning methods, where the features used as input to the model are manually defined by the designer of the model, representation learning relies on the algorithm to learn the best representation of the inputs. An example of a specific type of algorithm that does this is an autoencoder. The idea behind representation learning is that the model should learn the best representation for the input, rather than the designer of the algorithm having to figure out how to represent the inputs to the model.
Deep learning takes this concept of representation learning one step further by introducing many transformations or layers in the model. Hence the name “deep” learning. The basic unit of deep learning is called a neuron, which takes a multidimensional input and applies weights to each input feature. The output of this computation is then passed through an activation function, which maps the output to 0, 1, or 2. Different shapes of activation functions can be used for this purpose.
Deep learning is concerned with representation learning and involves many transformations or layers in the model. A feed forward neural network is a particular type of neural net architecture that is used in deep learning. The training process of a neural net is iterative, involving both forward and backward propagation through the network. The weights of the neural net are adjusted to reduce loss, and a typical training process might include hundreds or even thousands of epochs.
Multi-Layer Perceptron¶
A multi-layer perceptron takes multidimensional input features, applies weights to those inputs, passes each input feature through a neuron, and ultimately aggregates the output of the hidden layer to a single neuron that performs the prediction based on the output of an activation function. The deep learning training process attempts to find good values for each of these weights in the network, and each neuron has an activation function. The three most common shapes of activation functions are the sigmoid function, the hyperbolic tangent function, and the rectified linear unit function.
Training the weights of a neural net is an iterative process that involves both forward and backward propagation through the network. Each epoch involves both backpropagation and forward propagation. In forward propagation, we evaluate the output against the true value of y and compute a loss or error function. Backpropagation involves adjusting each of the weights in the neural net to correct for the resulting error. There is no typical closed-form optimal solution for the weights. Generally, it is an iterative search process, and a typical training process might include hundreds or even thousands of epochs.