# Basic aspects of Gradient Descent and an implementation in Go

This article provides a follow-up of my series about basic nonlinear optimization techniques with sample implementations. Again, we don’t aim to cover all deep aspects, but focus on giving an overview and introduction.

This account introduces another very popular, if not the most popular, algorithm for finding minima of real-valued functions in several variables — the Gradient Descent.

Our main task can be formulated as:

The main idea behind the algorithm is to start at an arbitrary point and to move a small step into a direction which deems promising for the function to fall very quickly. Then we repeat this until we ‘hopefully’ reach a local minimum.

In order to find this direction, one could just test several directions, but this would be cumbersome. Due to the fact that f is differentiable, such a direction can be obtained from f directly.

## Direction of steepest descent:

This way we have formulated the problem of finding a suitable direction into an optimization problem. Notice, this nice and common trick of replacing a function in several variables by one which has only one variable!

Next we apply a fundamental theorem from linear algebra which finally yields an optimal solution:

## The algorithm:

As you see, in each step we move from the current point in direction of the negative gradient. This can be easily implemented in Go:

The function expects as input a starting point x_0, the step-size tau, the gradient and a level of accuracy for the result. The code is straightforward if you know how to use gonum which is the standard matrix library for Go.

You can run and play around with it by using this go-playground: https://play.golang.org/p/RnwXRJLfKlb

Here I have applied the algorithm onto the Himmelblau function:

… which looks like

If you increase the step-size, you will quickly see cases where the algorithm is not converging. A sufficient condition which ensures convergence will be given next.

## Convergence:

In other words, if we choose the step size sufficiently small, the sequence of values under f is strictly decreasing. Moreover, if the function is bounded from below, such a sequence must convergence. From the last equation it is easy to see that even the gradient converges against 0: Just refactor to isolate the gradient on the l.h.s and note that |f(x_n)-f(x_n+1)| approaches 0 in limit due to convergence.

But note, although these all are necessary conditions for a local minimum, one can find examples where the limit is a saddle point (see f(x)=x³). In other words, the result of the algorithm must be verified to really yield a local minimum.