Gradient descent might be computationally heavier in some cases, especially in neural networks. For these reasons, different versions are being developed. Momentum is one of the tricks which can make it converge faster.
Adding Momentum may converge faster but it does not guarantee. It will just gain momentum.
A related StackOverflow Question: Gradient Descent with Momentum Working Worse than Normal Gradient Descent
Gradient Descent without Momentum
First, let me illustrate the gradient descent. There is no momentum for now.
For this example, I picked a function with two local minimum points.
Running gradient descent 3 times from different initial points. It will converge to a local minimum or global minimum:
Now, let's apply momentum.
First set:
vdx = 0
vdy = 0
On iteration i:
Having \(dx\) and \(dy\), calculate the \(vdx\) and \(vdy\) as follows:
\(vdx = \beta*vdx + dx\)
\(vdy = \beta*vdy + dy\)
Then update x an y as follows:
\(x = x - \alpha*vdx\)
\(y = y - \alpha*vdy\)
Let's illustrate now.
In below video, there are 3 scenarios, respectively:
A fancier example. Testing with Himmelblau’s function.
\[f(x,y) = (x^2+y-11)^2+(x+y^2-7)^2\]
Its partial derivatives:
\[\frac{\partial f}{\partial x} = 4x^3-4xy-42x+4xy-14\] \[\frac{\partial f}{\partial y} = 4y^3+2x^2-26y+4xy-22\]
Finally, here is the momentum illustration on a 3D surface: