Gradient Descent with Momentum
Why applying Momentum? Gradient descent might be computationally heavier in some cases, especially in neural networks. For these reasons, different versions are being developed. Momentum is one of the tricks which can make it converge faster.

Adding Momentum may converge faster but it does not guarantee. It will just gain momentum.

A related StackOverflow Question: Gradient Descent with Momentum Working Worse than Normal Gradient Descent

Gradient Descent without Momentum

First, let me illustrate the gradient descent. There is no momentum for now.

For this example, I picked a function with two local minimum points.

Running gradient descent 3 times from different initial points. It will converge to a local minimum or global minimum:

VIDEO

See the code (R)

Gradient Descent with Momentum Now, let's apply momentum.

First set:

vdx = 0 vdy = 0 On iteration i:

Having \(dx\) and \(dy\), calculate the \(vdx\) and \(vdy\) as follows:

\(vdx = \beta*vdx + dx\)

\(vdy = \beta*vdy + dy\)

Then update x an y as follows:

\(x = x - \alpha*vdx\)

\(y = y - \alpha*vdy\)

Let's illustrate now.

In below video, there are 3 scenarios, respectively:

It is stuck in a local minimum. It reaches the global minimum bypassing the local minimum. It reaches the local minimum bypassing the global minimum.

VIDEO

Code the code (R )

Illustration on a 3D surface A fancier example. Testing with Himmelblauâ€™s function .

\[f(x,y) = (x^2+y-11)^2+(x+y^2-7)^2\]

Its partial derivatives:

\[\frac{\partial f}{\partial x} = 4x^3-4xy-42x+4xy-14\] \[\frac{\partial f}{\partial y} = 4y^3+2x^2-26y+4xy-22\]

Finally, here is the momentum illustration on a 3D surface:

VIDEO

Gradient Descent
Please enable JavaScript to view the comments powered by Disqus.