This article provides a follow-up of my series about basic nonlinear optimization techniques with sample implementations. Again, we don’t aim to cover all deep aspects, but focus on giving an overview and introduction.
This account introduces another very popular, if not the most popular, algorithm for finding minima of real-valued functions in several variables — the Gradient Descent.
Our main task can be formulated as:
The main idea behind the algorithm is to start at an arbitrary point and to move a small step into a direction which deems promising for the function to fall very quickly. Then we repeat this until we ‘hopefully’ reach a local minimum.
In order to find this direction, one could just test several directions, but this would be cumbersome. Due to the fact that f is differentiable, such a direction can be obtained from f directly.
Direction of steepest descent:
Let us give some notations and formalize this ‘search of a direction’.
This way we have formulated the problem of finding a suitable direction into an optimization problem. Notice, this nice and common trick of replacing a function in several variables by one which has only one variable!
Next we apply a fundamental theorem from linear algebra which finally yields an optimal solution:
Now we have all together in order to formulate the algorithm in its most basic form:
As you see, in each step we move from the current point in direction of the negative gradient. This can be easily implemented in Go:
The function expects as input a starting point x_0, the step-size tau, the gradient and a level of accuracy for the result. The code is straightforward if you know how to use gonum which is the standard matrix library for Go.
You can run and play around with it by using this go-playground: https://play.golang.org/p/RnwXRJLfKlb
Here I have applied the algorithm onto the Himmelblau function:
… which looks like
If you increase the step-size, you will quickly see cases where the algorithm is not converging. A sufficient condition which ensures convergence will be given next.
This becomes a final rather technical section. If you are interested in such details, it provides a nice example application of traditional theorems from analysis. Though, if you played in above code with the step-size, you convince yourself how such considerations are of immense importance. First we show that under suitable prerequisites the sequence f(x_n) is strictly decreasing.
In other words, if we choose the step size sufficiently small, the sequence of values under f is strictly decreasing. Moreover, if the function is bounded from below, such a sequence must convergence. From the last equation it is easy to see that even the gradient converges against 0: Just refactor to isolate the gradient on the l.h.s and note that |f(x_n)-f(x_n+1)| approaches 0 in limit due to convergence.
But note, although these all are necessary conditions for a local minimum, one can find examples where the limit is a saddle point (see f(x)=x³). In other words, the result of the algorithm must be verified to really yield a local minimum.
There has been developed many variations of gradient-descent which attempts to speed up convergence. Especially, attention has been laid on choosing the step size more optimal. I hope this little account will help you understanding easier these more sophisticated versions whenever needed or at least having given some light to an everyday used black-box.