Gradient Descent with Momentum

Why applying Momentum?

Gradient descent might be computationally heavier in some cases, especially in neural networks. Momentum is one of the tricks which make it run faster. 

Gradient Descent with Momentum may converge faster. It may also find the global minimum by passing through a local minimum.

I also need to note that there might be other methods (a.k.a. Optimizers) to make gradient descent run faster. They may already include the combination of several tricks including momentum. If you don't have a particular reason to pick an optimizer, you can directly use one of those advanced optimizers (like Adam Optimizer).

Gradient Descent without Momentum

First, let me illustrate the gradient descent. There is no momentum for now.

For this example, I picked a function with two local minimum points.

Running gradient descent 3 times from different initial points:

See the code (R)

Gradient Descent with Momentum

Now, let's apply momentum.

First set:

vdx = 0
vdy = 0

On iteration i:

Having  \(dx\) and \(dy\), calculate the \(vdx\) and \(vdy\) as follows:

\(vdx = \beta*vdx + dx\)

\(vdy = \beta*vdy + dy\)

Then update x an y as follows:

\(x = x - \alpha*vdx\)

\(y = y - \alpha*vdy\)

Let's illustrate now. 

In below video, there are 3 scenarios, respectively:

  1. It is stuck in a local minimum.
  2. It reaches the global minimum bypassing the local minimum.
  3. It reaches the local minimum bypassing the global minimum.

Code the code (R)

Illustration on a 3D surface

A fancier example. Testing with Himmelblau’s function

\[f(x,y) = (x^2+y-11)^2+(x+y^2-7)^2\]

Its partial derivatives:

\[\frac{\partial f}{\partial x} = 4x^3-4xy-42x+4xy-14\] \[\frac{\partial f}{\partial y} = 4y^3+2x^2-26y+4xy-22\]

Finally, here is the momentum illustration on a 3D surface:

See the code (Python)

  • Gradient Descent