Gradient Descent with Momentum

Why applying Momentum?

Gradient descent might be computationally heavier in some cases, especially in neural networks. For these reasons, different versions are being developed. Momentum is one of the tricks which can make it converge faster. 

Adding Momentum may converge faster but it does not guarantee. It will just gain momentum. 

A related StackOverflow Question: Gradient Descent with Momentum Working Worse than Normal Gradient Descent

Gradient Descent without Momentum

First, let me illustrate the gradient descent. There is no momentum for now.

For this example, I picked a function with two local minimum points.

Running gradient descent 3 times from different initial points. It will converge to a local minimum or global minimum:

See the code (R)

Gradient Descent with Momentum

Now, let's apply momentum.

First set:

vdx = 0
vdy = 0

On iteration i:

Having  \(dx\) and \(dy\), calculate the \(vdx\) and \(vdy\) as follows:

\(vdx = \beta*vdx + dx\)

\(vdy = \beta*vdy + dy\)

Then update x an y as follows:

\(x = x - \alpha*vdx\)

\(y = y - \alpha*vdy\)

Let's illustrate now. 

In below video, there are 3 scenarios, respectively:

  1. It is stuck in a local minimum.
  2. It reaches the global minimum bypassing the local minimum.
  3. It reaches the local minimum bypassing the global minimum.

Code the code (R)

Illustration on a 3D surface

A fancier example. Testing with Himmelblau’s function

\[f(x,y) = (x^2+y-11)^2+(x+y^2-7)^2\]

Its partial derivatives:

\[\frac{\partial f}{\partial x} = 4x^3-4xy-42x+4xy-14\] \[\frac{\partial f}{\partial y} = 4y^3+2x^2-26y+4xy-22\]

Finally, here is the momentum illustration on a 3D surface:

See the code (Python)

  • Gradient Descent