Gradient descent might be computationally heavier in some cases, especially in neural networks. Momentum is one of the tricks which make it run faster.

Gradient Descent with Momentum may converge faster. It may also find the global minimum by passing through a local minimum.

I also need to note that there might be other methods (a.k.a. Optimizers) to make gradient descent run faster. They may already include the combination of several tricks including momentum. If you don't have a particular reason to pick an optimizer, you can directly use one of those advanced optimizers (like Adam Optimizer).

Gradient Descent without Momentum

First, let me illustrate the gradient descent. There is no momentum for now.

For this example, I picked a function with two local minimum points.

Running gradient descent 3 times from different initial points: