The Living Thing / Notebooks :

Optimisation

Crawling through alien landscapes in the fog, looking for mountain peaks.

I’m mostly interested in continuous optimisation, but, you know, combinatorial optimisation is a whole thing.

A vast topic, with many sub-topics.

As Moritz Hardt observes (and this is just in the convex context)

It’s easy to spend a semester of convex optimization on various guises of gradient descent alone. Simply pick one of the following variants and work through the specifics of the analysis: conjugate, accelerated, projected, conditional, mirrored, stochastic, coordinate, online. This is to name a few. You may also choose various pairs of attributes such as “accelerated coordinate” descent. Many triples are also valid such as “online stochastic mirror” descent. An expert unlike me would know exactly which triples are admissible. You get extra credit when you use “subgradient” instead of “gradient”. This is really only the beginning of optimization and it might already seem confusing.

When I was younger and even more foolish I decided the divide was between online optimization and offline optimization, which in hindsight is neither clear nor useful. Now there are more tightly topical pages, such as gradient descent, and Hessian free optimisation, surrogate optimisation, and I shall create additional such as circumstances demand.

Brief taxonomy here.

TODO: Diagram.

See Zeyuan Allen-Zhu and Elad Hazan on their teaching strategy which also gives a split into 16 different areas:

The following dilemma is encountered by many of my friends when teaching basic optimization: which variant/proof of gradient descent should one start with? Of course, one needs to decide on which depth of convex analysis one should dive into, and decide on issues such as “should I define strong-convexity?”, “discuss smoothness?”, “Nesterov acceleration?”, etc.

[…] If one wishes to go into more depth, usually in convex optimization courses, one covers the full spectrum of different smoothness/ strong-convexity/ acceleration/ stochasticity regimes, each with a separate analysis (a total of 16 possible configurations!)

This year I’ve tried something different in COS511 @ Princeton, which turns out also to have research significance. We’ve covered basic GD for well-conditioned functions, i.e. smooth and strongly-convex functions, and then extended these result by reduction to all other cases! A (simplified) outline of this teaching strategy is given in chapter 2 of Introduction to Online Convex Optimization.

Classical Strong-Convexity and Smoothness Reductions:

Given any optimization algorithm A for the well-conditioned case (i.e., strongly convex and smooth case), we can derive an algorithm for smooth but not strongly functions as follows.

Given a non-strongly convex but smooth objective \(f\), define a objective by \(f_1(x)=f(x)+e\|x\|^2\).

It is straightforward to see that \(f_1\) differs from \(f\) by at most ϵ times a distance factor, and in addition it is ϵ-strongly convex. Thus, one can apply A to minimize \(f_1\) and get a solution which is not too far from the optimal solution for \(f\) itself. This simplistic reduction yields an almost optimal rate, up to logarithmic factors.