Crawling through alien landscapes in the fog, looking for mountain peaks.
I'm mostly interested in continuous optimisation, but, you know, combinatorial optimisation is a whole thing.
A vast topic, with many subtopics.
As Moritz Hardt observes (and this is just in the convex context)
It’s easy to spend a semester of convex optimization on various guises of gradient descent alone. Simply pick one of the following variants and work through the specifics of the analysis: conjugate, accelerated, projected, conditional, mirrored, stochastic, coordinate, online. This is to name a few. You may also choose various pairs of attributes such as “accelerated coordinate” descent. Many triples are also valid such as “online stochastic mirror” descent. An expert unlike me would know exactly which triples are admissible. You get extra credit when you use “subgradient” instead of “gradient”. This is really only the beginning of optimization and it might already seem confusing.
When I was younger and even more foolish I decided the divide was between online optimization and offline optimization, which in hindsight is neither clear nor useful. Now there are more tightly topical pages, such as gradient descent, and Hessian free optimisation, surrogate optimisation, and I shall create additional such as circumstances demand.
Brief taxonomy here.
TODO: Diagram.
See Zeyuan AllenZhu and Elad Hazan on their teaching strategy which also gives a split into 16 different areas:
The following dilemma is encountered by many of my friends when teaching basic optimization: which variant/proof of gradient descent should one start with? Of course, one needs to decide on which depth of convex analysis one should dive into, and decide on issues such as “should I define strongconvexity?”, “discuss smoothness?”, “Nesterov acceleration?”, etc.
[…] If one wishes to go into more depth, usually in convex optimization courses, one covers the full spectrum of different smoothness/ strongconvexity/ acceleration/ stochasticity regimes, each with a separate analysis (a total of 16 possible configurations!)
This year I've tried something different in COS511 @ Princeton, which turns out also to have research significance. We've covered basic GD for wellconditioned functions, i.e. smooth and stronglyconvex functions, and then extended these result by reduction to all other cases! A (simplified) outline of this teaching strategy is given in chapter 2 of Introduction to Online Convex Optimization.
Classical StrongConvexity and Smoothness Reductions:
Given any optimization algorithm A for the wellconditioned case (i.e., strongly convex and smooth case), we can derive an algorithm for smooth but not strongly functions as follows.
Given a nonstrongly convex but smooth objective , define a objective by .
It is straightforward to see that differs from by at most ϵ times a distance factor, and in addition it is ϵstrongly convex. Thus, one can apply A to minimize and get a solution which is not too far from the optimal solution for itself. This simplistic reduction yields an almost optimal rate, up to logarithmic factors.
Keywords: Complimentary slackness theorem, High or very high dimensional methods, approximate method, Lagrange multipliers, primal and dual problems, fixed point methods, gradient, subgradient, proximal gradient, optimal control problems, convexity, sparsity, ways to avoid wrecking finding the extrema of perfectly simple little 10000parameter functions before everyone observes that you are a fool in the guise of a mathematician but everyone is not there because you wandered off the optimal path hours ago, and now you are alone and lost in a valley of lowercase Greek letters.
See also geometry of fitness landscapes, expectation maximisation, matrix factorisations, discrete optimisation, natureinspired “metaheuristic” optimisation.
General
Brief intro material

Zeyuan ALLENZHU: Recent Advances in Stochastic Convex and NonConvex Optimization. Clear, missing some details but has good pointers.

basic but enlightening, John Nash's graphical explanation of R's optimization

Martin Jaggi's Optimization in two hours

Celebrated union of optimisation, computational complexity and commandandcontroleconomics, by that showoff Cosma Shalizi: In Soviet Union, Optimization Problem Solves You

Elad Hazan The two cultures of optimization:
The standard curriculum in high school math includes elementary functional analysis, and methods for finding the minima, maxima and saddle points of a single dimensional function. When moving to high dimensions, this becomes beyond the reach of your typical highschool student: mathematical optimization theory spans a multitude of involved techniques in virtually all areas of computer science and mathematics.
Iterative methods, in particular, are the most successful algorithms for largescale optimization and the most widely used in machine learning. Of these, most popular are firstorder gradientbased methods due to their very low periteration complexity.
However, way before these became prominent, physicists needed to solve large scale optimization problems, since the time of the Manhattan project at Los Alamos. The problems that they faced looked very different, essentially simulation of physical experiments, as were the solutions they developed. The Metropolis algorithm is the basis for randomized optimization methods and Markov Chain Monte Carlo algorithms.[…]
In our recent paper (AbHa15), we show that for convex optimization, the heat path and central path for IPM for a particular barrier function (called the entropic barrier, following the terminology of the recent excellent work of Bubeck and Eldan) are identical! Thus, in some precise sense, the two cultures of optimization have been studied the same object in disguise and using different techniques.
Textbooks
Whole free textbooks online. Mostly convex.

K. Madsen, H.B. Nielsen, O. Tingleff, Methods for Nonlinear Least Squares Problems is super simple for leastsquares type optimisations

Aharon BenTal and Arkadi Nemirovski's lectures on modern convex optimization

Arkadi Nemirovski, Interior point polynomial time methods in convex programming

Boyd and Vandenberghe's influential Convex Optimization

Bubeck, S. (2014). Convex Optimization: Algorithms and Complexity. arXiv:1405.4980 [cs, Math, Stat]. based on Bubeck's course notes

Elad Hazan's Introduction to Online Convex Optimization.
Constrained optimisation
Related: constraint solvers.
Lagrange multipliers
Constrained optimisation using Lagrange's one weird trick, and the Karush–Kuhn–Tucker conditions. The search for saddle points and roots.
Alternating Direction Method of Multipliers
Dunno. It's everywhere, though. Maybe this is a problem of definition, though? (Boyd10)
In this review, we argue that the alternating direction method of multipliers is well suited to distributed convex optimization, and in particular to largescale problems arising in statistics, machine learning, and related areas. The method was developed in the 1970s, with roots in the 1950s, and is equivalent or closely related to many other algorithms, such as dual decomposition, the method of multipliers, Douglas–Rachford splitting, Spingarn's method of partial inverses, Dykstra's alternating projections, Bregman iterative algorithms for problems, proximal methods, and others. After briefly surveying the theory and history of the algorithm, we discuss applications to a wide variety of statistical and machine learning problems of recent interest, including the lasso, sparse logistic regression, basis pursuit, covariance selection, support vector machines, and many others. We also discuss general distributed optimization, extensions to the nonconvex setting, and efficient implementation, including some details on distributed MPI and Hadoop Map Reduce implementations.
Reductions
Oh crap, where did I rip this quote from?
The diverse world of machine learning applications has given rise to a plethora of algorithms and optimization methods, finely tuned to the specific regression or classification task at hand. We reduce the complexity of algorithm design for machine learning by reductions: we develop reductions that take a method developed for one setting and apply it to the entire spectrum of smoothness and strongconvexity in applications.
It will be from a Hazan and AllenZhu paper.
Continuous approximations of iterations
Recent papers (WiWJ16 WiWi15) argue that the discrete time steps can be viewed as a discrete approximation to a continuous time ODE which approaches the optimum (which in itself is trivial), but moreover that many algorithms fit into the same families of ODEs, that these ODEs explain Nesterov acceleration and generate new, improved optimisation methods. (which is not trivial.)
Continuous relaxations of parameters
Solving discrete problems with differentiable, continuous, surrogate parameters.
Convex
…Of composite functions
Hip for sparse regression, compressed sensing. etc. “FISTA” is one option. (Bubeck explains.)
Second order (QuasiNewton)
If you have the second derivative you can be fancy when finding zeros.
Trust region
Pretend the gradient is locally quadratic, then seeing how bad that pretense was.
Least Squares
This particular objective function has some particular shortcuts; e.g. you don't necessarily need a gradient to do it.
Sebastien Bubeck has a good writeup. Part 1, PArt 2.
Conjugate gradient method
Finding quadratic minima, as in Least Squares.
And also notquitelinear uses of this.
A first order method.
 Jonathan Richard Shewchuk, An Introduction to the Conjugate Gradient Method Without the Agonizing Pain is fun
…on a manifold
What if your constraints are naturally represented as some kind of smooth manifold? Is that worth thinking about? Apparently sometimes it is. See, e.g. ToKW16, BMAS14, or the free textbook on this theme.
See also Nicolas Boumen's introductory blog post.
Optimization on manifolds is about solving problems of the form
where is a nice, known manifold. By “nice”, I mean a smooth, finitedimensional Riemannian manifold.
Practical examples include the following (and all possible products of these):
 Euclidean spaces
 The sphere (set of vectors or matrices with unit Euclidean norm)
 The Stiefel manifold (set of orthonormal matrices)
 The Grassmann manifold (set of linear subspaces of a given dimension; this is a quotient space)
 The rotation group (set of orthogonal matrices with determinant +1)
 The manifold of fixedrank matrices
 The same, further restricted to positive semidefinite matrices
 The cone of (strictly) positive definite matrices
 …
Conceptually, the key point is to think of optimization on manifolds as unconstrained optimization: we do not think of mathcal{M} as being embedded in a Euclidean space. Rather, we think of mathcal{M} as being “the only thing that exists,” and we strive for intrinsic methods. Besides making for elegant theory, it also makes it clear how to handle abstract spaces numerically (such as the Grassmann manifold for example); and it gives algorithms the “right” invariances (computations do not depend on an arbitrarily chosen representation of the manifold).
There are at least two reasons why this class of problems is getting much attention lately. First, it is because optimization problems over the aforementioned sets (mostly matrix sets) come up pervasively in applications, and at some point it became clear that the intrinsic viewpoint leads to better algorithms, as compared to generalpurpose constrained optimization methods (where mathcal{M} is considered as being inside a Euclidean space mathcal{E}, and algorithms move in mathcal{E}, while penalizing distance to mathcal{M}). The second is that, as I will argue momentarily, Riemannian manifolds are “the right setting” to talk about unconstrained optimization. And indeed, there is a beautiful book by [Absil, Sepulchre, Mahony], called Optimization algorithms on matrix manifolds (freely available), that shows how the classical methods for unconstrained optimization (gradient descent, Newton, trustregions, conjugate gradients…) carry over seamlessly to the more general Riemannian framework.
To file
Miscellaneous optimisation techniques suggested on Linkedin
The whole world of exotic specialized optimisers. See, e.g. Nuit Blanche namedropping Bregmann iteration, alternating method, augmented Lagrangian…
Coordinate descent
Descent each coordinate individually.
Small clever hack for certain domains: log gradient descent.
Fixed point methods
Contraction maps are nice when you have them. TBD.
Primal/dual problems
Majorizationminorization
DifferenceofConvexobjectives
When your objective function is not convex but you can represent it in terms of convex functions, somehow or other, use DC optimisation. (GaRC09.) (I don't think this guarantees you a global optimum, but rather faster convergence to a local one)
Nonconvex optimisation
“How doomed are you?”
Gradientfree optimization
Not all the methods described here use gradient information, but it's frequently assumed to be something you can access easily. It's worth considering which objectives you can optimize easily
But not all objectives are easily differentiable, even when parameters are continuous. For example, if you are not getting your measurement from a mathematical model, but from a physical experiment you can't differentiate it since reality itself is usually not analytically differentiable. In this latter case, you are getting close to a question of online experiment design, as in ANOVA, and a further constraint that your function evaluations are possibly stupendously expensive. See Bayesian optimisation for one approach to this i the context of experiment design.
In general situations like this we use gradientfree methods, such as simulated annealing or numerical gradient etc. “Metaheuristic” methods
Biologicallyinspired or arbitrary. Evolutionary algorithms, particle swarm optimisation, ant colony optimisation, harmony search. A lot of the tricks from these are adopted into mainstream stochastic methods. Some not.
See biometic algorithms for the care and husbandry of such as those.
Annealing and Monte Carlo optimisation methods
Simulated annealing: Constructing a process to yield maximallylikely estimates for the parameters. This has a statistical mechanics justification that makes it attractive to physicists; But it's generally useful. You don't necessarily need a gradient here, just the ability to evaluate something interpretable as a “likelihood ratio”. Long story. I don't yet cover this at Monte carlo methods but I should.
Expectation maximization
My problem: constrained, pathwise sparsifyingpenalty optimisers for nonlinear problems
I'm trialling a bunch of sparse Lassolike regression models.
I want them to be fastish to run and fast to develop.
I want them to go in python.
I would like to be able to vary my regularisation parameter and warmrestart,
like the glmnet
Lasso.
I would like to be able to handle constraints,
especially componentwise nonnegativity,
and matrix nonnegativedefiniteness.
Notes on that here.
Ideas:
use scipy.optimize
give up the idea of warm restarts, and enforce constraints in the callback.
use cvxpy
Pretty good, but not that fast since it does not in general exploit gradient information. For some problems this is fine, though.
use spams
Wonderful, but only if your problem fits one of their categories. Otherwise you can maybe extract some bits from their code and use them, but that is now a serious project. They have a passable LASSO.
Roll my own algorithm
Potentially yakshaving. But I can work from the examples of my colleagues, which are specialpurpose algorithms, usually reasonably fast ones.
The hip highdimensionallytractable version of classic offline optimisation, one step at a time.
This page is deprecated! It was an awful way of organising optimization subdisciplines, and didn't even meet my study needs. I will gradually be salvaging the good bits and deleting it.
Minibatch and stochastic methods for minimising loss when you have a lot of data, or a lot of parameters, and using it all at once is silly, or when you want to iteratively improve your solution as data comes in, and you have access to a gradient for your loss, ideally automatically calculated. It's not clear at all that it should work, except by collating all your data and optimising offline, except that much of modern machine learning shows that it does.
Sometimes this apparently stupid trick it might even be fast for smalldimensional cases, so you may as well try.
Technically, “online” optimisation in bandit/RL problems might imply that you have to “minimise regret online”, which has a slightly different meaning and, e.g. involves seeing each training only as it arrives along some notional arrow of time, yet wishing to make the “best” decision at the next time, and possibly choosing your next experiment in order to tradeoff exploration versus exploitation etc.
In SGD you can see your data as often as you want and in whatever order, but you only look at a bit at a time. Usually the data is given and predictions make no difference to what information is available to you.
Some of the same technology pops up in each of these notions of online optimisation, but I am really thinking about SGD here.
There are many more permutations and variations used in practice.
To file

Elad Hazan and Satyan Kale's tutorial on online convex optimisation.

Geoffrey Hinton's slides are good overview from the artificial neural network perspective, where “good in messy circumstances” is the rule and convexity is not assumed.

Elad Hazan's Introduction to Online Convex Optimization.

Suvrit Sra's eyebleeding ugly introduction to this stuff

Francis Bach's slides on practical ML SGD.
Gradient Descent
The workhorse of many largeish scale machinelearning techniques and especially deep neural networks. See Sebastian Ruder's explanation and his comparison.
Conditional Gradient
a.k.a. FrankWolfe algorithm: Don't know much about this.
Stochastic
Variancereduced
Zeyuan AllenZhu : Faster Than SGD 1: Variance Reduction:
SGD is wellknown for largescale optimization. In my mind, there are two (and only two) fundamental improvements since the original introduction of SGD: (1) variance reduction, and (2) acceleration. In this post I'd love to conduct a survey regarding (1),
Second order (Quasinewton at scale)
LiSSA attempts to make 2nd order gradient descent methods practical (AgBH16):
linear time stochastic second order algorithm that achieves linear convergence for typical problems in machine learning while still maintaining runtimes theoretically comparable to stateoftheart first order algorithms. This relies heavily on the special structure of the optimization problem that allows our unbiased hessian estimator to be implemented efficiently, using only vectorvector products.
David McAllester observes:
Since can be computed efficiently whenever we can run backpropagation, the conditions under which the LiSSA algorithm can be run are actually much more general than the paper suggests. Backpropagation can be run on essentially any natural loss function.
What is Francis Bach's new baby? Finite sample guarantees for certain unusual treatements of SGD for certain problems: (BaMo11, BaMo13)
Beyond stochastic gradient descent for largescale machine learning
Many machine learning and signal processing problems are traditionally cast as convex optimization problems. A common difficulty in solving these problems is the size of the data, where there are many observations ('large n') and each of these is large ('large p'). In this setting, online algorithms such as stochastic gradient descent which pass over the data only once, are usually preferred over batch algorithms, which require multiple passes over the data. In this talk, I will show how the smoothness of loss functions may be used to design novel algorithms with improved behavior, both in theory and practice: in the ideal infinitedata setting, an efficient novel Newtonbased stochastic approximation algorithm leads to a convergence rate of O(1/n) without strong convexity assumptions, while in the practical finitedata setting, an appropriate combination of batch and online algorithms leads to unexpected behaviors, such as a linear convergence rate for strongly convex problems, with an iteration cost similar to stochastic gradient descent. (joint work with Nicolas Le Roux, Eric Moulines and Mark Schmidt).
…With sparsity
Hmmm.
Sundry Hacks
…
Yellowfin an automatic SGD momentum tuner
Parallel
Classic, basic SGD takes walks through the data set examplewise or featurewise  but this doesn't work in parallel, so you tend to go for minibatch gradient descent so that you can at least vectorize. Apparently you can make SGD work in “true” parallel across communicationconstrained cores, but I don't yet understand how.
Implementations
Specialised optimisation software.
See also statistical software, and online optimisation

SPORCO a Python package for solving optimisation problems with sparsityinducing regularisation. These consist primarily of sparse coding and dictionary learning problems, including convolutional sparse coding and dictionary learning, but there is also support for other problems such as Total Variation regularisation and Robust PCA. In the current version, all of the optimisation algorithms are based on the Alternating Direction Method of Multipliers (ADMM).

scipy.optimise.minimize: The python default. Includes many different algorithms than can do whatever you want. Failure modes are opaque, onlineonly and they don't support warmrestarts, which is a thing for me, but a good starting point unless you have reason to prefer others. (i.e. if all your data does not fit in RAM, don't bother.)

SPAMS (SPArse Modeling Software) is an optimization toolbox for solving various sparse estimation problems. Dictionary learning and matrix factorization (NMF, sparse PCA, …) Solving sparse decomposition problems with LARS, coordinate descent, OMP, SOMP, proximal methods Solving structured sparse decomposition problems ( sparse group lasso, treestructured regularization structured sparsity with overlapping groups,…). It is developped by Julien Mairal, with the collaboration of Francis Bach, Jean Ponce, Guillermo Sapiro, Rodolphe Jenatton and Guillaume Obozinski. It is coded in C++ with a Matlab interface. Recently, interfaces for R and Python have been developed by JeanPaul Chieze (INRIA), and archetypal analysis was written by Yuansi Chen (UC Berkeley).

…is a user friendly interface to several conic and integer programming solvers, very much like YALMIP or CVX under MATLAB.
The main motivation for PICOS is to have the possibility to enter an optimization problem as a high level model, and to be able to solve it with several different solvers. Multidimensional and matrix variables are handled in a natural fashion, which makes it painless to formulate a SDP or a SOCP. This is very useful for educational purposes, and to quickly implement some models and test their validity on simple examples.
also maintains a list of other solvers.

Manifold optimisation implementations:

… is a free software package for convex optimization based on the Python programming language. It can be used with the interactive Python interpreter, on the command line by executing Python scripts, or integrated in other software via Python extension modules. Its main purpose is to make the development of software for convex optimization applications straightforward by building on Python’s extensive standard library and on the strengths of Python as a highlevel programming language. […]

efficient Python classes for dense and sparse matrices (real and complex), with Python indexing and slicing and overloaded operations for matrix arithmetic

an interface to most of the doubleprecision real and complex BLAS

an interface to LAPACK routines for solving linear equations and leastsquares problems, matrix factorisations (LU, Cholesky, LDLT and QR), symmetric eigenvalue and singular value decomposition, and Schur factorization

an interface to the fast Fourier transform routines from FFTW

interfaces to the sparse LU and Cholesky solvers from UMFPACK and CHOLMOD

routines for linear, secondorder cone, and semidefinite programming problems

routines for nonlinear convex optimization

interfaces to the linear programming solver in GLPK, the semidefinite programming solver in DSDP5, and the linear, quadratic and secondorder cone programming solvers in MOSEK

a modeling tool for specifying convex piecewiselinear optimization problems.
seems to reinvent half of numpy and scipy. Also seems to be used by the all the other python packages. Including…
…is a Pythonembedded modeling language for convex optimization problems. It allows you to express your problem in a natural way that follows the math, rather than in the restrictive standard form required by solvers.
So it's a DSL for convex constraint programming. Can be extended heuristically to nonconvex constraints by…

… is a package for modeling and solving problems with convex objectives and decision variables from a nonconvex set. This package provides heuristic such as NCADMM (a variation of alternating direction method of multipliers for nonconvex problems) and relaxroundpolish, which can be viewed as a majorizationminimization algorithm. The solver methods provided and the syntax for constructing problems are discussed in our associated paper.


… is a free/opensource library for nonlinear optimization, providing a common interface for a number of different free optimization routines available online as well as original implementations of various other algorithms. Its features include:

Callable from C, C++, Fortran, Matlab or GNU Octave, Python, GNU Guile, Julia, GNU R, Lua, and OCaml.

A common interface for many different algorithms—try a different algorithm just by changing one parameter.

Support for largescale optimization (some algorithms scalable to millions of parameters and thousands of constraints)…

Algorithms using function values only (derivativefree) and also algorithms exploiting usersupplied gradients.


…(pronounced teefox) provides a set of Matlab templates, or building blocks, that can be used to construct efficient, customized solvers for a variety of convex models, including in particular those employed in sparse recovery applications. It was conceived and written by Stephen Becker, Emmanuel J. Candès and Michael Grant.

stan is famous for Monte Carlo sampling, but also does deterministic optimisation using automatic differentiation. this is a luxurious “full service” option, although with limited scope for customisation; Curious how it performs in very high dimensions, as LBFGS does not scale forever.
Optimization algorithms:

Limitedmemory BFGS (Stan's default optimization algorithm)

BFGS

Laplace's method for classical standard error estimates and approximate Bayesian posteriors


Optim.jl is a generic optimizer for julia

JuMP.jl is a domainspecific modeling language for mathematical optimization embedded in Julia. It currently supports a number of opensource and commercial solvers (Bonmin, Cbc, Clp, Couenne, CPLEX, ECOS, FICO Xpress, GLPK, Gurobi, Ipopt, KNITRO, MOSEK, NLopt, SCS, BARON) for a variety of problem classes, including linear programming, (mixed) integer programming, secondorder conic programming, semidefinite programming, and nonlinear programming.

NLsolve.jl solves systems of nonlinear equations. […]
The package is also able to solve mixed complementarity problems, which are similar to systems of nonlinear equations, except that the equality to zero is allowed to become an inequality if some boundary condition is satisfied. See further below for a formal definition and the related commands.
Since there is some overlap between optimizers and nonlinear solvers, this package borrows some ideas from the Optim package, and depends on it for linesearch algorithms.
Many of these solvers optionally use commercial backends such as Mosek.
Refs
The hip highdimensionallytractable version of classic offline optimisation, one step at a time.
This page is deprecated! It was an awful way of organising optimization subdisciplines, and didn't even meet my study needs. I will gradually be salvaging the good bits and deleting it.
Minibatch and stochastic methods for minimising loss when you have a lot of data, or a lot of parameters, and using it all at once is silly, or when you want to iteratively improve your solution as data comes in, and you have access to a gradient for your loss, ideally automatically calculated. It's not clear at all that it should work, except by collating all your data and optimising offline, except that much of modern machine learning shows that it does.
Sometimes this apparently stupid trick it might even be fast for smalldimensional cases, so you may as well try.
Technically, “online” optimisation in bandit/RL problems might imply that you have to “minimise regret online”, which has a slightly different meaning and, e.g. involves seeing each training only as it arrives along some notional arrow of time, yet wishing to make the “best” decision at the next time, and possibly choosing your next experiment in order to tradeoff exploration versus exploitation etc.
In SGD you can see your data as often as you want and in whatever order, but you only look at a bit at a time. Usually the data is given and predictions make no difference to what information is available to you.
Some of the same technology pops up in each of these notions of online optimisation, but I am really thinking about SGD here.
There are many more permutations and variations used in practice.
To file

Elad Hazan and Satyan Kale's tutorial on online convex optimisation.

Geoffrey Hinton's slides are good overview from the artificial neural network perspective, where “good in messy circumstances” is the rule and convexity is not assumed.
 Elad Hazan's Introduction to Online Convex Optimization.

Suvrit Sra's eyebleeding ugly introduction to this stuff

Francis Bach's slides on practical ML SGD.
Gradient Descent
The workhorse of many largeish scale machinelearning techniques and especially deep neural networks. See Sebastian Ruder's explanation and his comparison.
Conditional Gradient
a.k.a. FrankWolfe algorithm: Don't know much about this.
Stochastic
Variancereduced
Zeyuan AllenZhu : Faster Than SGD 1: Variance Reduction:
SGD is wellknown for largescale optimization. In my mind, there are two (and only two) fundamental improvements since the original introduction of SGD: (1) variance reduction, and (2) acceleration. In this post I'd love to conduct a survey regarding (1),
Second order (Quasinewton at scale)
LiSSA attempts to make 2nd order gradient descent methods practical (AgBH16):
linear time stochastic second order algorithm that achieves linear convergence for typical problems in machine learning while still maintaining runtimes theoretically comparable to stateoftheart first order algorithms. This relies heavily on the special structure of the optimization problem that allows our unbiased hessian estimator to be implemented efficiently, using only vectorvector products.
David McAllester observes:
Since can be computed efficiently whenever we can run backpropagation, the conditions under which the LiSSA algorithm can be run are actually much more general than the paper suggests. Backpropagation can be run on essentially any natural loss function.
What is Francis Bach's new baby? Finite sample guarantees for certain unusual treatements of SGD for certain problems: (BaMo11, BaMo13)
Beyond stochastic gradient descent for largescale machine learning
Many machine learning and signal processing problems are traditionally cast as convex optimization problems. A common difficulty in solving these problems is the size of the data, where there are many observations ('large n') and each of these is large ('large p'). In this setting, online algorithms such as stochastic gradient descent which pass over the data only once, are usually preferred over batch algorithms, which require multiple passes over the data. In this talk, I will show how the smoothness of loss functions may be used to design novel algorithms with improved behavior, both in theory and practice: in the ideal infinitedata setting, an efficient novel Newtonbased stochastic approximation algorithm leads to a convergence rate of O(1/n) without strong convexity assumptions, while in the practical finitedata setting, an appropriate combination of batch and online algorithms leads to unexpected behaviors, such as a linear convergence rate for strongly convex problems, with an iteration cost similar to stochastic gradient descent. (joint work with Nicolas Le Roux, Eric Moulines and Mark Schmidt).
…With sparsity
Hmmm.
Sundry Hacks
…
Yellowfin an automatic SGD momentum tuner.
Parallel
Classic, basic SGD takes walks through the data set examplewise or featurewise  but this doesn't work in parallel, so you tend to go for minibatch gradient descent so that you can at least vectorize. Apparently you can make SGD work in “true” parallel across communicationconstrained cores, but I don't yet understand how.
Refs
 RoSi71: H. Robbins, D. Siegmund (1971) A convergence theorem for non negative almost supermartingales and some applications. In Optimizing Methods in Statistics (pp. 233–257). Academic Press DOI
 YuTo09: Sangwoon Yun, KimChuan Toh (2009) A coordinate gradient descent method for ℓ 1regularized convex minimization. Computational Optimization and Applications, 48(2), 273–307. DOI
 BeTe09: Amir Beck, Marc Teboulle (2009) A Fast Iterative ShrinkageThresholding Algorithm for Linear Inverse Problems. SIAM Journal on Imaging Sciences, 2(1), 183–202. DOI
 Rupp85: David Ruppert (1985) A NewtonRaphson Version of the Multivariate RobbinsMonro Procedure. The Annals of Statistics, 13(1), 236–245. DOI
 STXY16: HaoJun Michael Shi, Shenyinying Tu, Yangyang Xu, Wotao Yin (2016) A Primer on Coordinate Descent Algorithms. ArXiv:1610.00040 [Math, Stat].
 CoPe08: Patrick L. Combettes, JeanChristophe Pesquet (2008) A proximal decomposition method for solving convex variational. Inverse Problems, 24(6), 065014. DOI
 FlPo63: R. Fletcher, M. J. D. Powell (1963) A Rapidly Convergent Descent Method for Minimization. The Computer Journal, 6(2), 163–168. DOI
 ACDL14: Alekh Agarwal, Olivier Chapelle, Miroslav Dudık, John Langford (2014) A Reliable Effective Terascale Linear Learning System. Journal of Machine Learning Research, 15(1), 1111–1133.
 RoMo51: Herbert Robbins, Sutton Monro (1951) A Stochastic Approximation Method. The Annals of Mathematical Statistics, 22(3), 400–407. DOI
 LuMH15: Aurelien Lucchi, Brian McWilliams, Thomas Hofmann (2015) A Variance Reduced Stochastic Newton Method. ArXiv:1503.08316 [Cs].
 WiWJ16: Andre Wibisono, Ashia C. Wilson, Michael I. Jordan (2016) A variational perspective on accelerated methods in optimization. Proceedings of the National Academy of Sciences, 113(47), E7351–E7358. DOI
 GhLa13a: Saeed Ghadimi, Guanghui Lan (2013a) Accelerated Gradient Methods for Nonconvex Nonlinear and Stochastic Programming. ArXiv:1310.3787 [Math].
 HuPK09: Chonghai Hu, Weike Pan, James T. Kwok (2009) Accelerated gradient methods for stochastic optimization and online learning. In Advances in Neural Information Processing Systems (pp. 781–789). Curran Associates, Inc.
 VSSM06: S.V. N. Vishwanathan, Nicol N. Schraudolph, Mark W. Schmidt, Kevin P. Murphy (2006) Accelerated Training of Conditional Random Fields with Stochastic Gradient Methods. In Proceedings of the 23rd International Conference on Machine Learning.
 Nest07: Yu Nesterov (2007) Accelerating the cubic regularization of Newton’s method on convex problems. Mathematical Programming, 112(1), 159–181. DOI
 PoJu92: B. T. Polyak, A. B. Juditsky (1992) Acceleration of Stochastic Approximation by Averaging. SIAM Journal on Control and Optimization, 30(4), 838–855. DOI
 MHSY13: H. Brendan McMahan, Gary Holt, D. Sculley, Michael Young, Dietmar Ebner, Julian Grady, … Jeremy Kubica (2013) Ad Click Prediction: A View from the Trenches. In Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 1222–1230). New York, NY, USA: ACM DOI
 KiBa15: Diederik Kingma, Jimmy Ba (2015) Adam: A Method for Stochastic Optimization. Proceeding of ICLR.
 Spal00: J. C. Spall (2000) Adaptive stochastic approximation by the simultaneous perturbation method. IEEE Transactions on Automatic Control, 45(10), 1839–1853. DOI
 DuHS11: John Duchi, Elad Hazan, Yoram Singer (2011) Adaptive Subgradient Methods for Online Learning and Stochastic Optimization. Journal of Machine Learning Research, 12(Jul), 2121–2159.
 Chre08: Stephane Chretien (2008) An Alternating l1 approach to the compressed sensing problem. ArXiv:0809.0660 [Stat].
 Rude16: Sebastian Ruder (2016) An overview of gradient descent optimization algorithms. ArXiv:1609.04747 [Cs].
 MZHR16: Ioannis Mitliagkas, Ce Zhang, Stefan Hadjis, Christopher Ré (2016) Asynchrony begets Momentum, with an Application to Deep Learning. ArXiv:1605.09774 [Cs, Math, Stat].
 HaLS15: Elad Hazan, Kfir Levy, Shai ShalevShwartz (2015) Beyond Convexity: Stochastic QuasiConvex Optimization. In Advances in Neural Information Processing Systems 28 (pp. 1594–1602). Curran Associates, Inc.
 Gile08: Mike B. Giles (2008) Collected Matrix Derivative Results for Forward and Reverse Mode Algorithmic Differentiation. In Advances in Automatic Differentiation (Vol. 64, pp. 35–44). Berlin, Heidelberg: Springer Berlin Heidelberg
 BiCY14: Wei Bian, Xiaojun Chen, Yinyu Ye (2014) Complexity analysis of interior point algorithms for nonLipschitz and nonconvex minimization. Mathematical Programming, 149(1–2), 301–327. DOI
 CGWY12: Xiaojun Chen, Dongdong Ge, Zizhuo Wang, Yinyu Ye (2012) Complexity of unconstrained L_2L_p. Mathematical Programming, 143(1–2), 371–383. DOI
 HaJN15: Zaid Harchaoui, Anatoli Juditsky, Arkadi Nemirovski (2015) Conditional gradient algorithms for normregularized smooth convex optimization. Mathematical Programming, 152(1–2), 75–112. DOI
 JaFM14: D. Jakovetic, J.M. Freitas Xavier, J.M.F. Moura (2014) Convergence Rates of Distributed NesterovLike Gradient Methods on Random Networks. IEEE Transactions on Signal Processing, 62(4), 868–882. DOI
 BoVa04: Stephen P. Boyd, Lieven Vandenberghe (2004) Convex optimization. Cambridge, UK ; New York: Cambridge University Press
 Bube15: Sébastien Bubeck (2015) Convex Optimization: Algorithms and Complexity. Foundations and Trends® in Machine Learning, 8(3–4), 231–357. DOI
 CeBS14: Volkan Cevher, Stephen Becker, Mark Schmidt (2014) Convex Optimization for Big Data. IEEE Signal Processing Magazine, 31(5), 32–43. DOI
 Bach13: Francis Bach (2013) Convex relaxations of structured matrix factorizations. ArXiv:1309.3117 [Cs, Math].
 WuLa08: Tong Tong Wu, Kenneth Lange (2008) Coordinate descent algorithms for lasso penalized regression. The Annals of Applied Statistics, 2(1), 224–244. DOI
 Boyd10: Stephen Boyd (2010) Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers. Foundations and Trends® in Machine Learning, 3(1), 1–122. DOI
 MKJS15: Chenxin Ma, Jakub Konečnỳ, Martin Jaggi, Virginia Smith, Michael I. Jordan, Peter Richtárik, Martin Takáč (2015) Distributed Optimization with Arbitrary Local Solvers. ArXiv Preprint ArXiv:1512.04039.
 MaBe17: Siyuan Ma, Mikhail Belkin (2017) Diving into the shallows: a computational perspective on largescale shallow learning. ArXiv:1703.10622 [Cs, Stat].
 Nest12: Y. Nesterov (2012) Efficiency of Coordinate Descent Methods on HugeScale Optimization Problems. SIAM Journal on Optimization, 22(2), 341–362. DOI
 SGAL14: Levent Sagun, V. Ugur Guney, Gerard Ben Arous, Yann LeCun (2014) Explorations on high dimensional landscapes. ArXiv:1412.6615 [Cs, Stat].
 GoSB15: Tom Goldstein, Christoph Studer, Richard Baraniuk (2015) FASTA: A Generalized Implementation of ForwardBackward Splitting. ArXiv:1501.04979 [Cs, Math].
 AbHa15: Jacob Abernethy, Elad Hazan (2015) Faster Convex Optimization: Simulated Annealing with an Efficient Universal Barrier. ArXiv:1507.02528 [Cs, Math].
 MEKU08: Doug Mcleod, Garry Emmerson, Robert Kohn, Geoff Kingston (universit (2008) Finding the invisible hand: an objective model of financial markets
 Batt92: Roberto Battiti (1992) Firstand secondorder methods for learning: between steepest descent and Newton’s method. Neural Computation, 4(2), 141–166. DOI
 Dala17: Arnak S. Dalalyan (2017) Further and stronger analogy between sampling and optimization: Langevin Monte Carlo and gradient descent. ArXiv:1704.04752 [Math, Stat].
 Nest12: Yu Nesterov (2012) Gradient methods for minimizing composite functions. Mathematical Programming, 140(1), 125–161. DOI
 KiHa16: Daeun Kim, Justin P. Haldar (2016) Greedy algorithms for nonnegativityconstrained simultaneous sparse recovery. Signal Processing, 125, 274–289. DOI
 FrSc12: Michael P. Friedlander, Mark Schmidt (2012) Hybrid DeterministicStochastic Methods for Data Fitting. SIAM Journal on Scientific Computing, 34(3), A1380–A1405. DOI
 DPGC14: Yann Dauphin, Razvan Pascanu, Caglar Gulcehre, Kyunghyun Cho, Surya Ganguli, Yoshua Bengio (2014) Identifying and attacking the saddle point problem in highdimensional nonconvex optimization. In Advances in Neural Information Processing Systems 27 (pp. 2933–2941). Curran Associates, Inc.
 YaMM15: Jiyan Yang, Xiangrui Meng, Michael W. Mahoney (2015) Implementing Randomized Matrix Algorithms in Parallel and Distributed Environments. ArXiv:1502.03032 [Cs, Math, Stat].
 BoLl15: Zdravko I. Botev, Chris J. Lloyd (2015) Importance accelerated RobbinsMonro recursion with applications to parametric confidence limits. Electronic Journal of Statistics, 9(2), 2058–2075. DOI
 Mair15: J. Mairal (2015) Incremental MajorizationMinimization Optimization with Application to LargeScale Machine Learning. SIAM Journal on Optimization, 25(2), 829–855. DOI
 Nest04: Yurii Nesterov (2004) Introductory Lectures on Convex Optimization (Vol. 87). Boston, MA: Springer US
 StVi16: Damian Straszak, Nisheeth K. Vishnoi (2016) IRLS and Slime Mold: Equivalence and Convergence. ArXiv:1601.02712 [Cs, Math, Stat].
 WiNa16: David Wipf, Srikantan Nagarajan (2016) Iterative Reweighted l1 and l2 Methods for Finding Sparse Solution. Microsoft Research.
 ChYi08: R. Chartrand, Wotao Yin (2008) Iteratively reweighted algorithms for compressive sensing. In IEEE International Conference on Acoustics, Speech and Signal Processing, 2008. ICASSP 2008 (pp. 3869–3872). DOI
 SFJJ15: Virginia Smith, Simone Forte, Michael I. Jordan, Martin Jaggi (2015) L1Regularized Distributed Optimization: A CommunicationEfficient PrimalDual Framework. ArXiv:1512.04011 [Cs].
 Klei04: Dan Klein (2004) Lagrange multipliers without permanent scarring. University of California at Berkeley, Computer Science Division.
 BoLe04: Léon Bottou, Yann LeCun (2004) Large Scale Online Learning. In Advances in Neural Information Processing Systems 16. Cambridge, MA: MIT Press
 Bott10: Léon Bottou (2010) Largescale machine learning with stochastic gradient descent. In Proceedings of the 19th International Conference on Computational Statistics (COMPSTAT’2010) (pp. 177–186). Paris, France: Springer
 HoSr15: Reshad Hosseini, Suvrit Sra (2015) Manifold Optimization for Gaussian Mixture Models. ArXiv Preprint ArXiv:1506.07677.
 BMAS14: Nicolas Boumal, Bamdev Mishra, P.A. Absil, Rodolphe Sepulchre (2014) Manopt, a Matlab Toolbox for Optimization on Manifolds. Journal of Machine Learning Research, 15, 1455–1459.
 MaNT04: K Madsen, H.B. Nielsen, O. Tingleff (2004) Methods for nonlinear least squares problems
 BoLB16: Aleksandar Botev, Guy Lever, David Barber (2016) Nesterov’s Accelerated Gradient and Momentum as approximations to Regularised Update Descent. ArXiv:1607.01981 [Cs, Stat].
 HiSK00: Geoffrey Hinton, Nitish Srivastava, Kevin Swersky (n.d.) Neural Networks for Machine Learning
 BaMo11: Francis Bach, Eric Moulines (2011) NonAsymptotic Analysis of Stochastic Approximation Algorithms for Machine Learning. In Advances in Neural Information Processing Systems (NIPS) (p. ). Spain
 Devo98: Ronald A. DeVore (1998) Nonlinear approximation. Acta Numerica, 7, 51–150. DOI
 BaMo13: Francis R. Bach, Eric Moulines (2013) Nonstronglyconvex smooth stochastic approximation with convergence rate O(1/n). In arXiv:1306.2119 [cs, math, stat] (pp. 773–781).
 WiWi15: Andre Wibisono, Ashia C. Wilson (2015) On Accelerated Methods in Optimization. ArXiv:1509.03616 [Math].
 Heyd74: C. C. Heyde (1974) On martingale limit theory and strong convergence results for stochastic approximation procedures. Stochastic Processes and Their Applications, 2(4), 359–370. DOI
 Pate17: Vivak Patel (2017) On SGD’s Failure in Practice: Characterizing and Overcoming Stalling. ArXiv:1702.00317 [Cs, Math, Stat].
 Gold65: A. Goldstein (1965) On Steepest Descent. Journal of the Society for Industrial and Applied Mathematics Series A Control, 3(1), 147–151. DOI
 Bott98: Léon Bottou (1998) Online Algorithms and Stochastic Approximations. In Online Learning and Neural Networks (Vol. 17, p. 142). Cambridge, UK: Cambridge University Press
 Zink03: Martin Zinkevich (2003) Online Convex Programming and Generalized Infinitesimal Gradient Ascent. In Proceedings of the Twentieth International Conference on International Conference on Machine Learning (pp. 928–935). Washington, DC, USA: AAAI Press
 AlHa16: Zeyuan AllenZhu, Elad Hazan (2016) Optimal BlackBox Reductions Between Optimization Objectives. In Advances in Neural Information Processing Systems 29 (pp. 1606–1614). Curran Associates, Inc.
 BePo05: D. Bertsimas, I. Popescu (2005) Optimal Inequalities in Probability Theory: A Convex Optimization Approach. SIAM Journal on Optimization, 15(3), 780–804. DOI
 ScFR09: Mark Schmidt, Glenn Fung, Romer Rosales (2009) Optimization methods for l1regularization. University of British Columbia, Technical Report TR2009, 19.
 BoCN16: Léon Bottou, Frank E. Curtis, Jorge Nocedal (2016) Optimization Methods for LargeScale Machine Learning. ArXiv:1606.04838 [Cs, Math, Stat].
 Mair13a: Julien Mairal (2013a) Optimization with FirstOrder Surrogate Functions. In International Conference on Machine Learning (pp. 783–791).
 BJMO12: Francis Bach, Rodolphe Jenatton, Julien Mairal, Guillaume Obozinski (2012) Optimization with SparsityInducing Penalties. Foundations and Trends® in Machine Learning, 4(1), 1–106. DOI
 ZWLS10: Martin Zinkevich, Markus Weimer, Lihong Li, Alex J. Smola (2010) Parallelized Stochastic Gradient Descent. In Advances in Neural Information Processing Systems 23 (pp. 2595–2603). Curran Associates, Inc.
 FHHT07: Jerome Friedman, Trevor Hastie, Holger Höfling, Robert Tibshirani (2007) Pathwise coordinate optimization. The Annals of Applied Statistics, 1(2), 302–332. DOI
 RoZh07: Saharon Rosset, Ji Zhu (2007) Piecewise linear regularized solution paths. The Annals of Statistics, 35(3), 1012–1030. DOI
 PaBo14: Neal Parikh, Stephen Boyd (2014) Proximal Algorithms. Foundations and Trends® in Optimization, 1(3), 127–239. DOI
 ToKW16: James Townsend, Niklas Koep, Sebastian Weichwald (2016) Pymanopt: A Python Toolbox for Optimization on Manifolds using Automatic Differentiation. Journal of Machine Learning Research, 17(137), 1–5.
 LiMH16: Hongzhou Lin, Julien Mairal, Zaid Harchaoui (2016) QuickeNing: A Generic QuasiNewton Algorithm for Faster GradientBased Optimization *. In arXiv:1610.00960 [math, stat].
 MaBo10: J. Mattingley, S. Boyd (2010) RealTime Convex Optimization in Signal Processing. IEEE Signal Processing Magazine, 27(3), 50–61. DOI
 FoKe09: Alexander I. J. Forrester, Andy J. Keane (2009) Recent advances in surrogatebased optimization. Progress in Aerospace Sciences, 45(1–3), 50–79. DOI
 GaRC09: G. Gasso, A. Rakotomamonjy, S. Canu (2009) Recovering Sparse Signals With a Certain Family of Nonconvex Penalties and DC Programming. IEEE Transactions on Signal Processing, 57(12), 4686–4698. DOI
 LiLR16: Yuanzhi Li, Yingyu Liang, Andrej Risteski (2016) Recovery Guarantee of Nonnegative Matrix Factorization via Alternating Updates. In Advances in Neural Information Processing Systems 29 (pp. 4988–4996). Curran Associates, Inc.
 SFHT11: Noah Simon, Jerome Friedman, Trevor Hastie, Rob Tibshirani (2011) Regularization Paths for Cox’s Proportional Hazards Model via Coordinate Descent. Journal of Statistical Software, 39(5).
 FrHT10: Jerome Friedman, Trevor Hastie, Rob Tibshirani (2010) Regularization Paths for Generalized Linear Models via Coordinate Descent. Journal of Statistical Software, 33(1), 1–22. DOI
 Jagg13: Martin Jaggi (2013) Revisiting FrankWolfe: ProjectionFree Sparse Convex Optimization. In Journal of Machine Learning Research (pp. 427–435).
 AcNP05: Dimitris Achlioptas, Assaf Naor, Yuval Peres (2005) Rigorous location of phase transitions in hard optimization problems. Nature, 435(7043), 759–764. DOI
 DeBL14: Aaron Defazio, Francis Bach, Simon LacosteJulien (2014) SAGA: A Fast Incremental Gradient Method With Support for NonStrongly Convex Composite Objectives. In Advances in Neural Information Processing Systems 27.
 AgBH16: Naman Agarwal, Brian Bullins, Elad Hazan (2016) Second Order Stochastic Optimization in Linear Time. ArXiv:1602.03943 [Cs, Stat].
 BoBG09: Antoine Bordes, Léon Bottou, Patrick Gallinari (2009) SGDQN: Careful QuasiNewton Stochastic Gradient Descent. Journal of Machine Learning Research, 10, 1737–1754.
 Chen12: Xiaojun Chen (2012) Smoothing methods for nonsmooth, nonconvex minimization. Mathematical Programming, 134(1), 71–99. DOI
 ZYJZ15: Lijun Zhang, Tianbao Yang, Rong Jin, ZhiHua Zhou (2015) Sparse Learning for Largescale and Highdimensional Data: A Randomized Convexconcave Optimization Approach. ArXiv:1511.03766 [Cs].
 MaBP14: Julien Mairal, Francis Bach, Jean Ponce (2014) Sparse modeling for image and vision processing. Foundations and Trends® in Comput Graph. Vis., 8(2–3), 85–283. DOI
 LaLZ09: John Langford, Lihong Li, Tong Zhang (2009) Sparse Online Learning via Truncated Gradient. In Advances in Neural Information Processing Systems 21 (pp. 905–912). Curran Associates, Inc.
 WrNF09: S. J. Wright, R. D. Nowak, M. A. T. Figueiredo (2009) Sparse Reconstruction by Separable Approximation. IEEE Transactions on Signal Processing, 57(7), 2479–2493. DOI
 Lai03: Tze Leung Lai (2003) Stochastic Approximation. The Annals of Statistics, 31(2), 391–406. DOI
 GhLa13b: Saeed Ghadimi, Guanghui Lan (2013b) Stochastic First and Zerothorder Methods for Nonconvex Stochastic Programming. SIAM Journal on Optimization, 23(4), 2341–2368. DOI
 SSBA95: Sashank J. Reddi, Suvrit Sra, Barnabás Póczós, Alex Smola (1995) Stochastic FrankWolfe Methods for Nonconvex Optimization.
 Frie02: Jerome H. Friedman (2002) Stochastic gradient boosting. Computational Statistics & Data Analysis, 38(4), 367–378. DOI
 Bott12: Léon Bottou (2012) Stochastic Gradient Descent Tricks. In Neural Networks: Tricks of the Trade (pp. 421–436). Springer, Berlin, Heidelberg DOI
 Bott91: Léon Bottou (1991) Stochastic Gradient Learning in Neural Networks. In Proceedings of NeuroNîmes 91. Nimes, France: EC2
 Mair13b: Julien Mairal (2013b) Stochastic majorizationminimization algorithms for largescale optimization. In Advances in Neural Information Processing Systems (pp. 2283–2291).
 ShTe11: Shai ShalevShwartz, Ambuj Tewari (2011) Stochastic Methods for L1regularized Loss Minimization. Journal of Machine Learning Research, 12, 1865–1892.
 NLST17: Lam M. Nguyen, Jie Liu, Katya Scheinberg, Martin Takáč (2017) Stochastic Recursive Gradient Algorithm for Nonconvex Optimization. ArXiv:1705.07261 [Cs, Math, Stat].
 RHSP16: Sashank J. Reddi, Ahmed Hefny, Suvrit Sra, Barnabas Poczos, Alex Smola (2016) Stochastic Variance Reduction for Nonconvex Optimization. In PMLR (Vol. 1603, pp. 314–323).
 ZhWG17: Xiao Zhang, Lingxiao Wang, Quanquan Gu (2017) Stochastic Variancereduced Gradient Descent for Lowrank Matrix Recovery from Linear Measurements. ArXiv:1701.00481 [Stat].
 Wain14: Martin J. Wainwright (2014) Structured Regularizers for HighDimensional Problems: Statistical and Computational Issues. Annual Review of Statistics and Its Application, 1(1), 233–253. DOI
 QHSG05: Nestor V. Queipo, Raphael T. Haftka, Wei Shyy, Tushar Goel, Rajkumar Vaidyanathan, P. Kevin Tucker (2005) Surrogatebased analysis and optimization. Progress in Aerospace Sciences, 41(1), 1–28. DOI
 HaZh12: ZhongHua Han, KeShi Zhang (2012) Surrogatebased optimization.
 AABB16: Martın Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, … Matthieu Devin (2016) TensorFlow: LargeScale Machine Learning on Heterogeneous Distributed Systems. ArXiv Preprint ArXiv:1603.04467.
 BuEl14: Sébastien Bubeck, Ronen Eldan (2014) The entropic barrier: a simple and optimal universal selfconcordant barrier. ArXiv:1412.1587 [Cs, Math].
 PoKo97: Stephen Portnoy, Roger Koenker (1997) The Gaussian hare and the Laplacian tortoise: computability of squarederror versus absoluteerror estimators. Statistical Science, 12(4), 279–300. DOI
 This97: Ronald A. Thisted (1997) [The Gaussian Hare and the Laplacian Tortoise: Computability of SquaredError versus AbsoluteError Estimators]: Comment. Statistical Science, 12(4), 296–298.
 MeBM16: Song Mei, Yu Bai, Andrea Montanari (2016) The Landscape of Empirical Risk for Nonconvex Losses. ArXiv:1607.06534 [Stat].
 CHMB15: Anna Choromanska, MIkael Henaff, Michael Mathieu, Gerard Ben Arous, Yann LeCun (2015) The Loss Surfaces of Multilayer Networks. In Proceedings of the Eighteenth International Conference on Artificial Intelligence and Statistics (pp. 192–204).
 Levy16: Kfir Y. Levy (2016) The Power of Normalization: Faster Evasion of Saddle Points. ArXiv:1611.04831 [Cs, Math, Stat].
 BoBo08: Léon Bottou, Olivier Bousquet (2008) The Tradeoffs of Large Scale Learning. In Advances in Neural Information Processing Systems (Vol. 20, pp. 161–168). NIPS Foundation (http://books.nips.cc)
 Xu11: Wei Xu (2011) Towards Optimal One Pass Large Scale Learning with Averaged Stochastic Gradient Descent. ArXiv:1107.2490 [Cs].
 SaKi16: Tim Salimans, Diederik P Kingma (2016) Weight Normalization: A Simple Reparameterization to Accelerate Training of Deep Neural Networks. In Advances in Neural Information Processing Systems 29 (pp. 901–901). Curran Associates, Inc.