The Living Thing / Notebooks :

Gradient descent

First order of business

Gradient descent, a classic first order optimisation], with many variants, and many things one might wish to understand.

There are only few things I wish to understand for the moment

Coordinate descent

Descent each coordinate individually.

Small clever hack for certain domains: log gradient descent.


Zeyuan Allen-Zhu : Faster Than SGD 1: Variance Reduction:

SGD is well-known for large-scale optimization. In my mind, there are two (and only two) fundamental improvements since the original introduction of SGD: (1) variance reduction, and (2) acceleration. In this post I’d love to conduct a survey regarding (1),



How and when does it work? and how well? Moritz Hardt, The zen of gradient descent explains it through Chebychev polynomials . Sebastian Bubeck explains it from a different angle, Revisiting Nesterov’s Acceleration to expand upon the rather magical introduction given in his lecture Wibisono et al explain it in terms of

Sundry Hacks

Yellowfin an automatic SGD momentum tuner

Mini-batch and stochastic methods for minimising loss when you have a lot of data, or a lot of parameters, and using it all at once is silly, or when you want to iteratively improve your solution as data comes in, and you have access to a gradient for your loss, ideally automatically calculated. It’s not clear at all that it should work, except by collating all your data and optimising offline, except that much of modern machine learning shows that it does.

Sometimes this apparently stupid trick it might even be fast for small-dimensional cases, so you may as well try.

Technically, “online” optimisation in bandit/RL problems might imply that you have to “minimise regret online”, which has a slightly different meaning and, e.g. involves seeing each training only as it arrives along some notional arrow of time, yet wishing to make the “best” decision at the next time, and possibly choosing your next experiment in order to trade-off exploration versus exploitation etc.

In SGD you can see your data as often as you want and in whatever order, but you only look at a bit at a time. Usually the data is given and predictions make no difference to what information is available to you.

Some of the same technology pops up in each of these notions of online optimisation, but I am really thinking about SGD here.

There are many more permutations and variations used in practice.

Conditional Gradient

a.k.a. Frank-Wolfe algorithm: Don’t know much about this.