Backpropagation
Neural Network Topology
- Architecture is fixed
- Activation Functions are fixed
- Rain the network for weights and biases
Training a Neural Network
- The training task is to compute the optimal weights and biases by minimizing some cost function.
- The task of training neural networks is exactly the same as that of other ML models such as linear regression, SVMs etc. The desired output (output from the last layer) minus the actual output is the cost (or the loss), and we have to tune the parameters w and b such that the total cost is minimized.
- An important point to note is that if the data is large (which is often the case), loss calculation itself can get pretty messy. For example, if you have a million data points, they will be fed into the network (in batch), the output will be calculated using feedforward and the loss/cost (for data point) will be calculated. The total loss is the sum of losses of all the individual data points. Hence, Total Loss =
- The total Loss is a function of 's and 's. Once total loss is computed, the weights and biases are updated (in the direction of decreasing loss). In other words, is minimized with respect to the 's and 's.
- This can be done using any optimization routine such as gradient descent.
- We minimize the average of the total loss and not the total loss. Minimizing the average implies that the total loss is getting minimized.
Mathematically
- - output of the last layer for a given set of weights w, bias b and input
- Expected output -
- Cost/Loss:
- Total cost across training data
- (number of data points = N)
- Training Problem -> find w, b to minimize total cost
Complexity of the Loss Function
- Training refers to the task of finding the optimal combination of weights and biases to minimise the total loss (with a fixed set of hyperparameters).
-
The optimisation is done using the familiar gradient descent algorithm. Recall that in gradient descent, the parameter being optimised is iterated in the direction of reducing cost according to the following rule:
- The same can be written for the biases. Note that the weights and biases are often collectively represented by one matrix called . will represent the collective matrix of weights and biases.
- The main challenge is that is a huge matrix, and thus, the total loss as a function of is a complex function.
Example
- 5 hidden layers
- 100 neurons input layer
- 100 neurons output layer
- weight matrices with six values
- 100*100 = 10000 interconnections between each layer
- 6 layers with above interconnections = 60000 weights
- 100 biases for each of the layers = 6*100 = 600 biases
- Total parameters = 60600
In the above simple example, we have to obtain values for 60600 parameters and hence we can say that the problem is very complex even for a small neural network.
To solve such a complex problem, we can take help from the gradient descent algorithm.
Gradient Descent
- Gradient tells us in which direction the next iterative step should be taken
- Gradient tells us how large the step will be
- Both the ideas are captured in the term
-
The gradient can be thought of as the slope of a hill whose height represents the total cost and each location on the hill represents a unique weight [w1, w2].
- We can imagine a hill whose height represents the total cost and each location on it represents a unique weight [w1, w2].
- Then we want to minimize the height (i.e. the cost), i.e. move in the direction of the slope of the hill to a point [w1, w2] where the cost / height is minimum.
Questions
Statement | True / False |
---|---|
Gradient is an n-dimensional vector | True |
Gradient gives the direction of increasing loss | True |
Say a layer in your neural network has a large number of biases - n biases, denoted by the variable b . In some iteration of the algorithm, the gradient is computed to be as follows:
Imagine an n -dimensional space whose each dimension (axis) corresponds to one parameter of the network. Choose all the correct statements:
Statement | True / False |
---|---|
In the next iteration, the algorithm should move in the direction of decreasing b1 | True |
In the next iteration, the algorithm should move in the direction of increasing b1 | False |
In the next iteration, the algorithm should move in the direction of increasing b2 | True |
In the next iteration, the algorithm should move in the direction of decreasing b2 | False |
In the next iteration, the algorithm will take a larger step along the dimension than | True |
In the next iteration, the algorithm will take a larger step along the dimension than | False |
Changing slightly does not affect the value of the current loss significantly | True |
- The loss with respect to b increases if is positive (and vice-versa).
- If it is zero, it means the loss does not depend (locally) on that variable.
- Also, the magnitude of the gradient defines how large a step the algorithm takes along that variable.
Gradient Descent Algorithm
- Training a neural network basically implies finding correct values for weights and biases which minimises the loss function.
- The model starts with a random guess of the initial weights, predict the output using feedforward and change the weights in the direction of reducing loss.
- This is the gradient descent algorithm.
Backpropagation
Algorithm
- The data points are first fed forward
- the loss of each data point is computed
- then aggregated to compute the average loss
- then the gradient of loss is computed, and
- finally weights and biases are updated once
Loss function of a neural network
The loss function is defined in terms of the network output and the ground truth . Since depends on the weights and biases, the loss , in turn, is a function of . The average loss across all data points is denoted by which we want to minimize.
Cross Entropy Loss
-
In case the prediction is correct, then
- 1.log(1) = 0
- 0.log(0) = 0
-
In case the prediction is incorrect, then
- 1.log(0) = infinity
Example
Suppose . What is the value of the C.E. loss?
- 0.2
Suppose . What is the value of the C.E. loss?
You can notice that the cross-entropy loss is designed such that when the predicted probability is close to the ground truth, the loss value is close to zero, and vice-versa.
Notations
To simplify the process, we will denote the cumulative input coming to layer as .
Therefore,
- softmax output
Gradient Descent on Output Layer with 3 neurons
-
Loss function will be given as:
-
-
-
-
-
- Now we substitute and simplify to get:
- , where
-
-
We can use the above equations along with the chain rule to calculate:
Notation
Strategy
The general strategy to compute the gradients in a backward direction is shown in the figure below.
You can see that the gradients are calculated in a backward direction \starting from . Hence, we'll calculate the gradients in the following sequence (dX is short for ):
This is why the process is known as backpropagation.
Why do we call it backpropagation
We propagate the gradients in a backward direction starting from the output layer
Gradient Descent - Detailed Solution for 3 layers having 2 neurons each
Stochastic Gradient Descent - SGD
What is SGD and why is it used
- For updating weights and biases using plain backpropagation, you have to scan through the entire data set to make a single update to the weights.
- This is computationally very expensive for large datasets.
- Thus, you use multiple batches (or mini-batches) of data points, compute the average gradient for a batch, and update the weights based on that gradient.
- But there is a danger in doing this - you are making weight updates based only on gradients computed for small batches, not the entire training set.
- Thus, you make multiple passes through the entire training set using epochs.
- An epoch is one pass through the entire training set, and you use multiple epochs (typically 10, 20, 50, 100 etc.) while training.
- In each epoch, you reshuffle all the data points, divide the reshuffled set into m batches, and update weights based on gradient of each batch.
This training technique is called stochastic gradient descent, commonly abbreviated as SGD.
Algorithm
for epoch in 1 to L: randomize-data numbatches = n/m for batch in numbatches: - compute gradient for each input in batch () - average gradient - average gradient - -
SGD Training Procedure
- You specify the number of epochs (typical values are 10, 20, 50, 100 etc.) - more epochs require more computational power
- You specify the number of batches m (typical values are 32, 64, 128, etc.)
- At the start of each epoch, the data set is reshuffled and divided into m batches.
- The average gradient of each batch is then used to make a weight update.
- The training is complete at the end of all the epochs
Questions
Consider a dataset of 1,00,000 data points. You decide to perform a mini batch/ stochastic gradient descent with the batch size of 50. How many batches will be there?
- 2000
- number of batches = n/m = 1,00,000 / 50
Consider a dataset of 1,00,000 data points. You decide to perform a mini batch/ stochastic gradient descent with the batch size of 50 and for 3 epochs. How many updates will you make at the end of 3 epochs?
- 6000
- You make number of epochs X number of mini batch updates = 3 x 2000 = 6000
Advantages of SGD
- fast
- reaches global minima
Exploration and Exploitation
- To avoid the problem of getting stuck at a local optimum, you need to strike a balance between exploration and exploitation.
- Exploration means that you try to minimise the loss function with different starting points of W and b , i.e., you initialise W and b with different values.
- On the other hand, exploitation means that you try to reach the global minima starting from a particular W and b and do not explore the terrain at all.
- That might lead you to the lowest point locally, but not the necessarily the global minimum.