Gradient Descent

Gradient Descent is one of the core optimization algorithms used to train models. It is how machines “learn”.

A machine learning model like a neural network has weights and biases and these weights and biases define what the output will be for a given input.

Initially, we give a model random values for weights and biases and then start the learning process. Model takes first input from the labeled dataset, computes the output (using randomly initialized weights and biases) and then checks how close or far the output was from the actual result. This computation of how close or far the result is from the actual answer is computed using the cost function / loss function (a formula) and the computed values is called the cost. The model then adjusts its weights and biases according to the cost so that next time it will give a better answer.

The question is, how do models know what changes to make to the weights and biases so that it will improve its answer? The answer lies in calculus.

We need a way to know how much the model’s output changes if we change the value of a weight. Do we increase or decrease the weight to give us a lower error?

Partial derivative of a weight w.r.t error function (gradient) will give this answer and this is computed using a process known as Back Propagation

Once we have the Gradient, we need to update our weight such that it decreases if gradient is positive and increases if gradient is negative.

The formula for computing updated weight is

$w \leftarrow w - \eta \cdot \frac{\partial E}{\partial w}$.

$\eta $ (eta) is the learning rate or step size. 0.001, 0.1 are some of the common values of learning rate.

If learning rate is large, we will take larger steps towards the minima and if it is small we take small steps.

Sources

3Blue1Brown- Gradient descent, how neural networks learn | Deep Learning Chapter 2