Gradient Descent is a very important algorithm that is used in Machine learning. There are numerous variations of GD and is an active research area. This post is part of the notes that I wrote down while learning various variations of Gradient Descent. It is not complete. I will be adding more details as I explore more interesting parts of Gradient Descent.
What’s Gradient Descent?
In simple terms, Gradient Descent is an algorithm that is used to find the minimum of a function.
You can imagine the function to be like the bowl here. We need to find the minimum of the function, which is the bottom of the bowl.
Imagine you roll a small spherical ball from the top. It follows the most curved path and reaches the bottom. Gradient descent also works the same way. It follows the negative gradient(slope) and tries to reach a minimum of the function.
You can see the algorithm moving along the path to reach the minimum in this picture.
Let’s try to find the minimum of a function using Gradient Descent. (This example is taken from wiki page for Gradient descent)
The gradient descent algorithm is applied to find a local minimum of the function f(x)=x4−3x3+2, with derivative f'(x)=4x3−9x2.
# From calculation, it is expected that the local minimum occurs at x=9/4 cur_x = 6 # The algorithm starts at x=6 gamma = 0.01 # step size multiplier precision = 0.00001 previous_step_size = cur_x def df(x): return 4 * x**3 - 9 * x**2 while previous_step_size > precision: prev_x = cur_x cur_x += -gamma * df(prev_x) previous_step_size = abs(cur_x - prev_x) print("The local minimum occurs at %f" % cur_x)
Here Slope Parameter(alpha) is known as learning rate in neural network training.
J is called as the cost function. It is the output value computed by the neural networks, which we try to minimize.
Stochastic gradient descent (SGD)
Vanilla gradient descent runs over the whole dataset to find the next update for the parameters. When it comes to large datasets this approach might become intractable in a machine. So Stochastic gradient descent is a slight modification of the vanilla gradient descent algorithm.
Vanilla descent estimates the expectation of the function with all items in the dataset. SGD uses a single sample to update the values
Mini batch Gradient Descent
Simple samples used by gradient descent has a lot of noise in the update. So a small subset of the samples is used as mini-batch. A common mini batch size in 256. Mini batches tend to average a little of the noise out.