Crawling through alien landscapes in the fog, looking for mountain peaks.
Iâ€™m mostly interested in continuous optimisation, but, you know, combinatorial optimisation is a whole thing.
A vast topic, with many subtopics. I have neither the time nor the expertise to construct a detailed map of these As Moritz Hardt observes (and this is just in the convex context),
Itâ€™s easy to spend a semester of convex optimization on various guises of gradient descent alone. Simply pick one of the following variants and work through the specifics of the analysis: conjugate, accelerated, projected, conditional, mirrored, stochastic, coordinate, online. This is to name a few. You may also choose various pairs of attributes such as â€śaccelerated coordinateâ€ť descent. Many triples are also valid such as â€śonline stochastic mirrorâ€ť descent. An expert unlike me would know exactly which triples are admissible. You get extra credit when you use â€śsubgradientâ€ť instead of â€śgradientâ€ť. This is really only the beginning of optimization and it might already seem confusing.
When I was even younger and yet more foolish I decided the divide was between online optimization and offline optimization, which in hindsight is neither a clear nor useful taxonomy for the problems facing me. Now there are more tightly topical pages, such as gradient descent, and 2nd order methods, surrogate optimisation, constrained optimisation, and I shall create additional such as circumstances demand.
Brief taxonomy here.
đźš§ Diagram.
See Zeyuan AllenZhu and Elad Hazan on their teaching strategy which also gives a split into 16 different areas:
The following dilemma is encountered by many of my friends when teaching basic optimization: which variant/proof of gradient descent should one start with? Of course, one needs to decide on which depth of convex analysis one should dive into, and decide on issues such as â€śshould I define strongconvexity?â€ť, â€śdiscuss smoothness?â€ť, â€śNesterov acceleration?â€ť, etc.
[â€¦] If one wishes to go into more depth, usually in convex optimization courses, one covers the full spectrum of different smoothness/ strongconvexity/ acceleration/ stochasticity regimes, each with a separate analysis (a total of 16 possible configurations!)
This year Iâ€™ve tried something different in COS511 @ Princeton, which turns out also to have research significance. Weâ€™ve covered basic GD for wellconditioned functions, i.e.Â smooth and stronglyconvex functions, and then extended these result by reduction to all other cases! A (simplified) outline of this teaching strategy is given in chapter 2 of Introduction to Online Convex Optimization.
Classical StrongConvexity and Smoothness Reductions:
Given any optimization algorithm A for the wellconditioned case (i.e., strongly convex and smooth case), we can derive an algorithm for smooth but not strongly functions as follows.
Given a nonstrongly convex but smooth objective \(f\), define a objective by \(f_1(x)=f(x)+e\x\^2\).
It is straightforward to see that \(f_1\) differs from \(f\) by at most Ďµ times a distance factor, and in addition it is Ďµstrongly convex. Thus, one can apply A to minimize \(f_1\) and get a solution which is not too far from the optimal solution for \(f\) itself. This simplistic reduction yields an almost optimal rate, up to logarithmic factors.
Keywords: Complimentary slackness theorem, High or very high dimensional methods, approximate method, Lagrange multipliers, primal and dual problems, fixed point methods, gradient, subgradient, proximal gradient, optimal control problems, convexity, sparsity, ways to avoid wrecking finding the extrema of perfectly simple little 10000parameter functions before everyone observes that you are a fool in the guise of a mathematician but everyone is not there because you wandered off the optimal path hours ago, and now you are alone and lost in a valley of lowercase Greek letters.
See also geometry of fitness landscapes, expectation maximisation, matrix factorisations, discrete optimisation, natureinspired â€śmetaheuristicâ€ť optimisation.
General
Brief intro material
 Luca Tevisan, Posts on online optimisation.
 Zeyuan ALLENZHU: Recent Advances in Stochastic Convex and NonConvex Optimization. Clear, has good pointers.
 Basic but enlightening, John Nashâ€™s graphical explanation of Râ€™s optimization
 Martin Jaggiâ€™s Optimization in two hours
 Celebrated union of optimisation, computational complexity and commandandcontroleconomics, by Cosma Shalizi: In Soviet Union, Optimization Problem Solves You
Textbooks
Whole free textbooks online. Mostly convex.
 K. Madsen, H.B. Nielsen, O. Tingleff, Methods for Nonlinear Least Squares Problems is super simple for leastsquares type optimisations
 Aharon BenTal and Arkadi Nemirovskiâ€™s lectures on modern convex optimization
 Arkadi Nemirovski, Interior point polynomial time methods in convex programming
 Boyd and Vandenbergheâ€™s influential Convex Optimization
 Bubeck, S. (2014). Convex Optimization: Algorithms and Complexity. arXiv:1405.4980 [cs, Math, Stat]. based on Bubeckâ€™s course notes
 Elad Hazanâ€™s Introduction to Online Convex Optimization.
To file
 Elad Hazan and Satyan Kaleâ€™s tutorial on online convex optimisation.
 Elad Hazanâ€™s Introduction to Online Convex Optimization.
 Suvrit Sraâ€™s eyebleeding ugly but pertinent introduction to this stuff
 Francis Bachâ€™s slides on practical ML SGD.
Alternating Direction Method of Multipliers
Dunno. Itâ€™s everywhere, though. Maybe this is a problem of definition, though? (Boyd 2010)
In this review, we argue that the alternating direction method of multipliers is well suited to distributed convex optimization, and in particular to largescale problems arising in statistics, machine learning, and related areas. The method was developed in the 1970s, with roots in the 1950s, and is equivalent or closely related to many other algorithms, such as dual decomposition, the method of multipliers, Douglasâ€“Rachford splitting, Spingarnâ€™s method of partial inverses, Dykstraâ€™s alternating projections, Bregman iterative algorithms for \(\ell_1\) problems, proximal methods, and others. After briefly surveying the theory and history of the algorithm, we discuss applications to a wide variety of statistical and machine learning problems of recent interest, including the lasso, sparse logistic regression, basis pursuit, covariance selection, support vector machines, and many others. We also discuss general distributed optimization, extensions to the nonconvex setting, and efficient implementation, including some details on distributed MPI and Hadoop Map Reduce implementations.
Continuous approximations of iterations
Recent papers (WiWJ16 WiWi15) argue that the discrete time steps can be viewed as a discrete approximation to a continuous time ODE which approaches the optimum (which in itself is trivial), but moreover that many algorithms fit into the same families of ODEs, that these ODEs explain Nesterov acceleration and generate new, improved optimisation methods. (which is not trivial.)
đźš§
Optimisation on manifolds
See Nicolas Boumenâ€™s introductory blog post.
Optimization on manifolds is about solving problems of the form
\[\mathrm{minimize}_{x\in\mathcal{M}} f(x),\]
where \(\mathcal{M}\) is a nice, known manifold. By â€śniceâ€ť, I mean a smooth, finitedimensional Riemannian manifold.
Practical examples include the following (and all possible products of these):
 Euclidean spaces
 The sphere (set of vectors or matrices with unit Euclidean norm)
 The Stiefel manifold (set of orthonormal matrices)
 The Grassmann manifold (set of linear subspaces of a given dimension; this is a quotient space)
 The rotation group (set of orthogonal matrices with determinant +1)
 The manifold of fixedrank matrices
 The same, further restricted to positive semidefinite matrices
 The cone of (strictly) positive definite matrices
 â€¦
Conceptually, the key point is to think of optimization on manifolds as unconstrained optimization: we do not think of \(\mathcal{M}\) as being embedded in a Euclidean space. Rather, we think of \(\mathcal{M}\) as being â€śthe only thing that exists,â€ť and we strive for intrinsic methods. Besides making for elegant theory, it also makes it clear how to handle abstract spaces numerically (such as the Grassmann manifold for example); and it gives algorithms the â€śrightâ€ť invariances (computations do not depend on an arbitrarily chosen representation of the manifold).
There are at least two reasons why this class of problems is getting much attention lately. First, it is because optimization problems over the aforementioned sets (mostly matrix sets) come up pervasively in applications, and at some point it became clear that the intrinsic viewpoint leads to better algorithms, as compared to generalpurpose constrained optimization methods (where \(\mathcal{M}\) is considered as being inside a Euclidean space \(\mathcal{E}\), and algorithms move in \(\mathcal{E}\), while penalizing distance to \(\mathcal{M}\)). The second is that, as I will argue momentarily, Riemannian manifolds are â€śthe right settingâ€ť to talk about unconstrained optimization. And indeed, there is a beautiful book by [Absil, Sepulchre, Mahony], called Optimization algorithms on matrix manifolds (freely available), that shows how the classical methods for unconstrained optimization (gradient descent, Newton, trustregions, conjugate gradientsâ€¦) carry over seamlessly to the more general Riemannian framework.
Gradientfree optimization
Not all the methods described here use gradient information, but itâ€™s frequently assumed to be something you can access easily. Itâ€™s worth considering which objectives you can optimize easily
But not all objectives are easily differentiable, even when parameters are continuous. For example, if you are not getting your measurement from a mathematical model, but from a physical experiment you canâ€™t differentiate it since reality itself is usually not analytically differentiable. In this latter case, you are getting close to a question of online experiment design, as in ANOVA, and a further constraint that your function evaluations are possibly stupendously expensive. See Bayesian optimisation for one approach to this i the context of experiment design.
In general situations like this we use gradientfree methods, such as simulated annealing or numerical gradient etc.
â€śMetaheuristicâ€ť methods
Biologicallyinspired or arbitrary. Evolutionary algorithms, particle swarm optimisation, ant colony optimisation, harmony search. A lot of the tricks from these are adopted into mainstream stochastic methods. Some not.
See biometic algorithms for the care and husbandry of such as those.
Annealing and Monte Carlo optimisation methods
Simulated annealing: Constructing a process to yield maximallylikely estimates for the parameters. This has a statistical mechanics justification that makes it attractive to physicists; But itâ€™s generally useful. You donâ€™t necessarily need a gradient here, just the ability to evaluate something interpretable as a â€ślikelihood ratioâ€ť. Long story. I donâ€™t yet cover this at Monte carlo methods but I should.
Elad Hazan The two cultures of optimization:
The standard curriculum in high school math includes elementary functional analysis, and methods for finding the minima, maxima and saddle points of a single dimensional function. When moving to high dimensions, this becomes beyond the reach of your typical highschool student: mathematical optimization theory spans a multitude of involved techniques in virtually all areas of computer science and mathematics.
Iterative methods, in particular, are the most successful algorithms for largescale optimization and the most widely used in machine learning. Of these, most popular are firstorder gradientbased methods due to their very low periteration complexity.
However, way before these became prominent, physicists needed to solve large scale optimization problems, since the time of the Manhattan project at Los Alamos. The problems that they faced looked very different, essentially simulation of physical experiments, as were the solutions they developed. The Metropolis algorithm is the basis for randomized optimization methods and Markov Chain Monte Carlo algorithms.[â€¦]
In our recent paper (AbHa15), we show that for convex optimization, the heat path and central path for IPM for a particular barrier function (called the entropic barrier, following the terminology of the recent excellent work of Bubeck and Eldan) are identical! Thus, in some precise sense, the two cultures of optimization have been studied the same object in disguise and using different techniques.
Expectation maximization
Parallel
Classic, basic SGD takes walks through the data set examplewise or featurewise â€“ but this doesnâ€™t work in parallel, so you tend to go for minibatch gradient descent so that you can at least vectorize. Apparently you can make SGD work in â€śtrueâ€ť parallel across communicationconstrained cores, but I donâ€™t yet understand how.
Implementations
Specialised optimisation software.
See also statistical software, and online optimisation
SPORCO a Python package for solving optimisation problems with sparsityinducing regularisation. These consist primarily of sparse coding and dictionary learning problems, including convolutional sparse coding and dictionary learning, but there is also support for other problems such as Total Variation regularisation and Robust PCA. In the current version, all of the optimisation algorithms are based on the Alternating Direction Method of Multipliers (ADMM).
scipy.optimise.minimize: The python default. Includes many different algorithms than can do whatever you want. Failure modes are opaque, onlineonly and they donâ€™t support warmrestarts, which is a thing for me, but a good starting point unless you have reason to prefer others. (i.e.Â if all your data does not fit in RAM, donâ€™t bother.)

SPAMS (SPArse Modeling Software) is an optimization toolbox for solving various sparse estimation problems. Dictionary learning and matrix factorization (NMF, sparse PCA, â€¦) Solving sparse decomposition problems with LARS, coordinate descent, OMP, SOMP, proximal methods Solving structured sparse decomposition problems (\(ell_1/ell_2,\) \(\ell_1/\ell_\infty,\) sparse group lasso, treestructured regularization structured sparsity with overlapping groups,â€¦). It is developped by Julien Mairal, with the collaboration of Francis Bach, Jean Ponce, Guillermo Sapiro, Rodolphe Jenatton and Guillaume Obozinski. It is coded in C++ with a Matlab interface. Recently, interfaces for R and Python have been developed by JeanPaul Chieze (INRIA), and archetypal analysis was written by Yuansi Chen (UC Berkeley).

â€¦is a user friendly interface to several conic and integer programming solvers, very much like YALMIP or CVX under MATLAB.
The main motivation for PICOS is to have the possibility to enter an optimization problem as a high level model, and to be able to solve it with several different solvers. Multidimensional and matrix variables are handled in a natural fashion, which makes it painless to formulate a SDP or a SOCP. This is very useful for educational purposes, and to quickly implement some models and test their validity on simple examples.
also maintains a list of other solvers.
Manifold optimisation implementations:

â€¦ is a free software package for convex optimization based on the Python programming language. It can be used with the interactive Python interpreter, on the command line by executing Python scripts, or integrated in other software via Python extension modules. Its main purpose is to make the development of software for convex optimization applications straightforward by building on Pythonâ€™s extensive standard library and on the strengths of Python as a highlevel programming language. [â€¦]
efficient Python classes for dense and sparse matrices (real and complex), with Python indexing and slicing and overloaded operations for matrix arithmetic
an interface to most of the doubleprecision real and complex BLAS
an interface to LAPACK routines for solving linear equations and leastsquares problems, matrix factorisations (LU, Cholesky, LDLT and QR), symmetric eigenvalue and singular value decomposition, and Schur factorization
an interface to the fast Fourier transform routines from FFTW
interfaces to the sparse LU and Cholesky solvers from UMFPACK and CHOLMOD
routines for linear, secondorder cone, and semidefinite programming problems
routines for nonlinear convex optimization
interfaces to the linear programming solver in GLPK, the semidefinite programming solver in DSDP5, and the linear, quadratic and secondorder cone programming solvers in MOSEK
a modeling tool for specifying convex piecewiselinear optimization problems.
seems to reinvent half of numpy and scipy. Also seems to be used by the all the other python packages.

Pyomo is a Pythonbased opensource software package that supports a diverse set of optimization capabilities for formulating, solving, and analyzing optimization models.
A core capability of Pyomo is modeling structured optimization applications. Pyomo can be used to define general symbolic problems, create specific problem instances, and solve these instances using commercial and opensource solvers. Pyomoâ€™s modeling objects are embedded within a fullfeatured highlevel programming language providing a rich set of supporting libraries, which distinguishes Pyomo from other algebraic modeling languages like AMPL, AIMMS and GAMS.â€¦
Pyomo was formerly released as the Coopr software library.
â€¦is a Pythonembedded modeling language for convex optimization problems. It allows you to express your problem in a natural way that follows the math, rather than in the restrictive standard form required by solvers.
So itâ€™s a DSL for convex constraint programming. Can be extended heuristically to nonconvex constraints byâ€¦
â€¦ is a package for modeling and solving problems with convex objectives and decision variables from a nonconvex set. This package provides heuristic such as NCADMM (a variation of alternating direction method of multipliers for nonconvex problems) and relaxroundpolish, which can be viewed as a majorizationminimization algorithm. The solver methods provided and the syntax for constructing problems are discussed in our associated paper.

â€¦ is a free/opensource library for nonlinear optimization, providing a common interface for a number of different free optimization routines available online as well as original implementations of various other algorithms. Its features include:
Callable from C, C++, Fortran, Matlab or GNU Octave, Python, GNU Guile, Julia, GNU R, Lua, and OCaml.
A common interface for many different algorithmsâ€”try a different algorithm just by changing one parameter.
Support for largescale optimization (some algorithms scalable to millions of parameters and thousands of constraints)â€¦
Algorithms using function values only (derivativefree) and also algorithms exploiting usersupplied gradients.

â€¦(pronounced teefox) provides a set of Matlab templates, or building blocks, that can be used to construct efficient, customized solvers for a variety of convex models, including in particular those employed in sparse recovery applications. It was conceived and written by Stephen Becker, Emmanuel J. CandĂ¨s and Michael Grant.
stan is famous for Monte Carlo sampling, but also does deterministic optimisation using automatic differentiation. this is a luxurious â€śfull serviceâ€ť option, although with limited scope for customisation; Curious how it performs in very high dimensions, as LBFGS does not scale forever.
Optimization algorithms:
Limitedmemory BFGS (Stanâ€™s default optimization algorithm)
BFGS
Laplaceâ€™s method for classical standard error estimates and approximate Bayesian posteriors
Optim.jl is a generic optimizer for julia
JuMP.jl is a domainspecific modeling language for mathematical optimization embedded in Julia. It currently supports a number of opensource and commercial solvers (Bonmin, Cbc, Clp, Couenne, CPLEX, ECOS, FICO Xpress, GLPK, Gurobi, Ipopt, KNITRO, MOSEK, NLopt, SCS, BARON) for a variety of problem classes, including linear programming, (mixed) integer programming, secondorder conic programming, semidefinite programming, and nonlinear programming.
NLsolve.jl solves systems of nonlinear equations. [â€¦]
The package is also able to solve mixed complementarity problems, which are similar to systems of nonlinear equations, except that the equality to zero is allowed to become an inequality if some boundary condition is satisfied. See further below for a formal definition and the related commands.
Since there is some overlap between optimizers and nonlinear solvers, this package borrows some ideas from the Optim package, and depends on it for linesearch algorithms.
Many of these solvers optionally use commercial backends such as Mosek.
To file
Miscellaneous optimisation techniques suggested on Linkedin
The whole world of exotic specialized optimisers. See, e.g. Nuit Blanche namedropping Bregmann iteration, alternating method, augmented Lagrangianâ€¦
Primal/dual problems
đźš§
Majorizationminorization
đźš§
DifferenceofConvexobjectives
When your objective function is not convex but you can represent it in terms of convex functions, somehow or other, use DC optimisation. (Gasso, Rakotomamonjy, and Canu 2009) (I donâ€™t think this guarantees you a global optimum, but rather faster convergence to a local one)
Refs
Abadi, MartÄ±n, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, and Matthieu Devin. 2016. â€śTensorFlow: LargeScale Machine Learning on Heterogeneous Distributed Systems.â€ť arXiv Preprint arXiv:1603.04467. http://arxiv.org/abs/1603.04467.
Abernethy, Jacob, and Elad Hazan. 2015. â€śFaster Convex Optimization: Simulated Annealing with an Efficient Universal Barrier,â€ť July. http://arxiv.org/abs/1507.02528.
Achlioptas, Dimitris, Assaf Naor, and Yuval Peres. 2005. â€śRigorous Location of Phase Transitions in Hard Optimization Problems.â€ť Nature 435 (7043): 759â€“64. https://doi.org/10.1038/nature03602.
Agarwal, Alekh, Olivier Chapelle, Miroslav DudÄ±k, and John Langford. 2014. â€śA Reliable Effective Terascale Linear Learning System.â€ť Journal of Machine Learning Research 15 (1): 1111â€“33. http://www.jmlr.org/papers/volume15/agarwal14a/agarwal14a.pdf.
Agarwal, Naman, Brian Bullins, and Elad Hazan. 2016. â€śSecond Order Stochastic Optimization in Linear Time,â€ť February. http://arxiv.org/abs/1602.03943.
AllenZhu, Zeyuan, and Elad Hazan. 2016. â€śOptimal BlackBox Reductions Between Optimization Objectives.â€ť In Advances in Neural Information Processing Systems 29, edited by D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, 1606â€“14. Curran Associates, Inc. http://papers.nips.cc/paper/6364optimalblackboxreductionsbetweenoptimizationobjectives.pdf.
Arnold, SĂ©bastien M. R., and Chunming Wang. 2017. â€śAccelerating SGD for Distributed DeepLearning Using Approximated Hessian Matrix.â€ť In. http://arxiv.org/abs/1709.05069.
Ba, Jimmy, Roger Grosse, and James Martens. 2016. â€śDistributed SecondOrder Optimization Using KroneckerFactored Approximations,â€ť November. https://openreview.net/forum?id=SkkTMpjex.
Bach, Francis. 2013. â€śConvex Relaxations of Structured Matrix Factorizations,â€ť September. http://arxiv.org/abs/1309.3117.
Bach, Francis, Rodolphe Jenatton, Julien Mairal, and Guillaume Obozinski. 2012. â€śOptimization with SparsityInducing Penalties.â€ť Foundations and TrendsÂ® in Machine Learning 4 (1): 1â€“106. https://doi.org/10.1561/2200000015.
Bach, Francis, and Eric Moulines. 2011. â€śNonAsymptotic Analysis of Stochastic Approximation Algorithms for Machine Learning.â€ť In Advances in Neural Information Processing Systems (NIPS), â€“. Spain. http://hal.archivesouvertes.fr/hal00608041.
Bach, Francis R., and Eric Moulines. 2013. â€śNonStronglyConvex Smooth Stochastic Approximation with Convergence Rate O(1/N).â€ť In arXiv:1306.2119 [Cs, Math, Stat], 773â€“81. https://arxiv.org/abs/1306.2119v1.
Battiti, Roberto. 1992. â€śFirstand SecondOrder Methods for Learning: Between Steepest Descent and Newtonâ€™s Method.â€ť Neural Computation 4 (2): 141â€“66. https://doi.org/10.1162/neco.1992.4.2.141.
Beck, Amir, and Marc Teboulle. 2009. â€śA Fast Iterative ShrinkageThresholding Algorithm for Linear Inverse Problems.â€ť SIAM Journal on Imaging Sciences 2 (1): 183â€“202. https://doi.org/10.1137/080716542.
Bertsimas, D., and I. Popescu. 2005. â€śOptimal Inequalities in Probability Theory: A Convex Optimization Approach.â€ť SIAM Journal on Optimization 15 (3): 780â€“804. https://doi.org/10.1137/S1052623401399903.
Bian, Wei, Xiaojun Chen, and Yinyu Ye. 2014. â€śComplexity Analysis of Interior Point Algorithms for NonLipschitz and Nonconvex Minimization.â€ť Mathematical Programming 149 (12): 301â€“27. https://doi.org/10.1007/s1010701407535.
Bordes, Antoine, LĂ©on Bottou, and Patrick Gallinari. 2009. â€śSGDQN: Careful QuasiNewton Stochastic Gradient Descent.â€ť Journal of Machine Learning Research 10 (December): 1737â€“54. http://jmlr.org/papers/volume10/bordes09a/bordes09a.pdf.
Botev, Aleksandar, Guy Lever, and David Barber. 2016. â€śNesterovâ€™s Accelerated Gradient and Momentum as Approximations to Regularised Update Descent,â€ť July. http://arxiv.org/abs/1607.01981.
Botev, Zdravko I., and Chris J. Lloyd. 2015. â€śImportance Accelerated RobbinsMonro Recursion with Applications to Parametric Confidence Limits.â€ť Electronic Journal of Statistics 9 (2): 2058â€“75. https://doi.org/10.1214/15EJS1071.
Bottou, LĂ©on. 1991. â€śStochastic Gradient Learning in Neural Networks.â€ť In Proceedings of NeuroNĂ®mes 91. Nimes, France: EC2. http://leon.bottou.org/papers/bottou91c.
â€”â€”â€”. 1998. â€śOnline Algorithms and Stochastic Approximations.â€ť In Online Learning and Neural Networks, edited by David Saad, 17:142. Cambridge, UK: Cambridge University Press. http://leon.bottou.org/publications/pdf/online1998.pdf.
â€”â€”â€”. 2010. â€śLargeScale Machine Learning with Stochastic Gradient Descent.â€ť In Proceedings of the 19th International Conference on Computational Statistics (COMPSTATâ€™2010), 177â€“86. Paris, France: Springer. http://leon.bottou.org/papers/bottou2010.
â€”â€”â€”. 2012. â€śStochastic Gradient Descent Tricks.â€ť In Neural Networks: Tricks of the Trade, 421â€“36. Lecture Notes in Computer Science. Springer, Berlin, Heidelberg. https://doi.org/10.1007/9783642352898_25.
Bottou, LĂ©on, and Olivier Bousquet. 2008. â€śThe Tradeoffs of Large Scale Learning.â€ť In Advances in Neural Information Processing Systems, edited by J.C. Platt, D. Koller, Y. Singer, and S. Roweis, 20:161â€“68. NIPS Foundation (http://books.nips.cc). http://leon.bottou.org/papers/bottoubousquet2008.
Bottou, LĂ©on, Frank E. Curtis, and Jorge Nocedal. 2016. â€śOptimization Methods for LargeScale Machine Learning,â€ť June. http://arxiv.org/abs/1606.04838.
Bottou, LĂ©on, and Yann LeCun. 2004. â€śLarge Scale Online Learning.â€ť In Advances in Neural Information Processing Systems 16, edited by Sebastian Thrun, Lawrence Saul, and Bernhard SchĂ¶lkopf. Cambridge, MA: MIT Press. http://leon.bottou.org/papers/bottoulecun2004.
Boumal, Nicolas, Bamdev Mishra, P.A. Absil, and Rodolphe Sepulchre. 2014. â€śManopt, a Matlab Toolbox for Optimization on Manifolds.â€ť Journal of Machine Learning Research 15: 1455â€“9. http://jmlr.org/papers/v15/boumal14a.html.
Boyd, Stephen. 2010. â€śDistributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers.â€ť Foundations and TrendsÂ® in Machine Learning 3 (1): 1â€“122. https://doi.org/10.1561/2200000016.
Boyd, Stephen P., and Lieven Vandenberghe. 2004. Convex Optimization. Cambridge, UK ; New York: Cambridge University Press. https://web.stanford.edu/~boyd/cvxbook/.
Bubeck, SĂ©bastien. 2015. â€śConvex Optimization: Algorithms and Complexity.â€ť Foundations and TrendsÂ® in Machine Learning 8 (34): 231â€“357. https://doi.org/10.1561/2200000050.
Bubeck, SĂ©bastien, and Ronen Eldan. 2014. â€śThe Entropic Barrier: A Simple and Optimal Universal SelfConcordant Barrier,â€ť December. http://arxiv.org/abs/1412.1587.
Cevher, Volkan, Stephen Becker, and Mark Schmidt. 2014. â€śConvex Optimization for Big Data.â€ť IEEE Signal Processing Magazine 31 (5): 32â€“43. https://doi.org/10.1109/MSP.2014.2329397.
Chartrand, R., and Wotao Yin. 2008. â€śIteratively Reweighted Algorithms for Compressive Sensing.â€ť In IEEE International Conference on Acoustics, Speech and Signal Processing, 2008. ICASSP 2008, 3869â€“72. https://doi.org/10.1109/ICASSP.2008.4518498.
Chen, Xiaojun. 2012. â€śSmoothing Methods for Nonsmooth, Nonconvex Minimization.â€ť Mathematical Programming 134 (1): 71â€“99. https://doi.org/10.1007/s1010701205690.
Chen, Xiaojun, Dongdong Ge, Zizhuo Wang, and Yinyu Ye. 2012. â€śComplexity of Unconstrained L_2L_p.â€ť Mathematical Programming 143 (12): 371â€“83. https://doi.org/10.1007/s1010701206130.
Cho, Minhyung, Chandra Shekhar Dhir, and Jaehyung Lee. 2015. â€śHessianFree Optimization for Learning Deep Multidimensional Recurrent Neural Networks.â€ť In Advances in Neural Information Processing Systems. http://arxiv.org/abs/1509.03475.
Choromanska, Anna, MIkael Henaff, Michael Mathieu, Gerard Ben Arous, and Yann LeCun. 2015. â€śThe Loss Surfaces of Multilayer Networks.â€ť In Proceedings of the Eighteenth International Conference on Artificial Intelligence and Statistics, 192â€“204. http://proceedings.mlr.press/v38/choromanska15.html.
Chretien, Stephane. 2008. â€śAn Alternating L1 Approach to the Compressed Sensing Problem,â€ť September. http://arxiv.org/abs/0809.0660.
Combettes, Patrick L., and JeanChristophe Pesquet. 2008. â€śA Proximal Decomposition Method for Solving Convex Variational.â€ť Inverse Problems 24 (6): 065014. https://doi.org/10.1088/02665611/24/6/065014.
Dalalyan, Arnak S. 2017. â€śFurther and Stronger Analogy Between Sampling and Optimization: Langevin Monte Carlo and Gradient Descent,â€ť April. http://arxiv.org/abs/1704.04752.
Dauphin, Yann, Razvan Pascanu, Caglar Gulcehre, Kyunghyun Cho, Surya Ganguli, and Yoshua Bengio. 2014. â€śIdentifying and Attacking the Saddle Point Problem in HighDimensional NonConvex Optimization.â€ť In Advances in Neural Information Processing Systems 27, 2933â€“41. Curran Associates, Inc. http://arxiv.org/abs/1406.2572.
Defazio, Aaron, Francis Bach, and Simon LacosteJulien. 2014. â€śSAGA: A Fast Incremental Gradient Method with Support for NonStrongly Convex Composite Objectives.â€ť In Advances in Neural Information Processing Systems 27. http://arxiv.org/abs/1407.0202.
Dennis, J. E., and Robert B. Schnabel. 1989. â€śChapter I A View of Unconstrained Optimization.â€ť In Handbooks in Operations Research and Management Science, 1:1â€“72. Optimization. Elsevier. https://doi.org/10.1016/S09270507(89)010029.
Dennis, J., and R. Schnabel. 1996. Numerical Methods for Unconstrained Optimization and Nonlinear Equations. Vol. 16. Classics in Applied Mathematics. Society for Industrial and Applied Mathematics. https://doi.org/10.1137/1.9781611971200.
DeVore, Ronald A. 1998. â€śNonlinear Approximation.â€ť Acta Numerica 7 (January): 51â€“150. https://doi.org/10.1017/S0962492900002816.
Ding, Lijun, and Madeleine Udell. 2018. â€śFrankWolfe Style Algorithms for Large Scale Optimization,â€ť August. https://arxiv.org/abs/1808.05274v1.
Duchi, John, Elad Hazan, and Yoram Singer. 2011. â€śAdaptive Subgradient Methods for Online Learning and Stochastic Optimization.â€ť Journal of Machine Learning Research 12 (Jul): 2121â€“59. http://www.jmlr.org/papers/v12/duchi11a.html.
Fletcher, R., and M. J. D. Powell. 1963. â€śA Rapidly Convergent Descent Method for Minimization.â€ť The Computer Journal 6 (2): 163â€“68. https://doi.org/10.1093/comjnl/6.2.163.
Forrester, Alexander I. J., and Andy J. Keane. 2009. â€śRecent Advances in SurrogateBased Optimization.â€ť Progress in Aerospace Sciences 45 (1â€“3): 50â€“79. https://doi.org/10.1016/j.paerosci.2008.11.001.
Friedlander, Michael P., and Mark Schmidt. 2012. â€śHybrid DeterministicStochastic Methods for Data Fitting.â€ť SIAM Journal on Scientific Computing 34 (3): A1380â€“A1405. https://doi.org/10.1137/110830629.
Friedman, Jerome H. 2002. â€śStochastic Gradient Boosting.â€ť Computational Statistics & Data Analysis, Nonlinear Methods and Data Mining, 38 (4): 367â€“78. https://doi.org/10.1016/S01679473(01)000652.
Friedman, Jerome, Trevor Hastie, Holger HĂ¶fling, and Robert Tibshirani. 2007. â€śPathwise Coordinate Optimization.â€ť The Annals of Applied Statistics 1 (2): 302â€“32. https://doi.org/10.1214/07AOAS131.
Friedman, Jerome, Trevor Hastie, and Rob Tibshirani. 2010. â€śRegularization Paths for Generalized Linear Models via Coordinate Descent.â€ť Journal of Statistical Software 33 (1): 1â€“22. https://doi.org/10.18637/jss.v033.i01.
Gasso, G., A. Rakotomamonjy, and S. Canu. 2009. â€śRecovering Sparse Signals with a Certain Family of Nonconvex Penalties and DC Programming.â€ť IEEE Transactions on Signal Processing 57 (12): 4686â€“98. https://doi.org/10.1109/TSP.2009.2026004.
Ghadimi, Saeed, and Guanghui Lan. 2013a. â€śStochastic First and ZerothOrder Methods for Nonconvex Stochastic Programming.â€ť SIAM Journal on Optimization 23 (4): 2341â€“68. https://doi.org/10.1137/120880811.
â€”â€”â€”. 2013b. â€śAccelerated Gradient Methods for Nonconvex Nonlinear and Stochastic Programming,â€ť October. http://arxiv.org/abs/1310.3787.
Giles, Mike B. 2008. â€śCollected Matrix Derivative Results for Forward and Reverse Mode Algorithmic Differentiation.â€ť In Advances in Automatic Differentiation, edited by Christian H. Bischof, H. Martin BĂĽcker, Paul Hovland, Uwe Naumann, and Jean Utke, 64:35â€“44. Berlin, Heidelberg: Springer Berlin Heidelberg. http://eprints.maths.ox.ac.uk/1079/.
Goldstein, A. 1965. â€śOn Steepest Descent.â€ť Journal of the Society for Industrial and Applied Mathematics Series A Control 3 (1): 147â€“51. https://doi.org/10.1137/0303013.
Goldstein, Tom, Christoph Studer, and Richard Baraniuk. 2015. â€śFASTA: A Generalized Implementation of ForwardBackward Splitting,â€ť January. http://arxiv.org/abs/1501.04979.
Han, ZhongHua, and KeShi Zhang. 2012. â€śSurrogateBased Optimization.â€ť http://cdn.intechopen.com/pdfs/30305/InTechSurrogate_based_optimization.pdf.
Harchaoui, Zaid, Anatoli Juditsky, and Arkadi Nemirovski. 2015. â€śConditional Gradient Algorithms for NormRegularized Smooth Convex Optimization.â€ť Mathematical Programming 152 (12): 75â€“112. https://doi.org/10.1007/s1010701407789.
Hazan, Elad, Kfir Levy, and Shai ShalevShwartz. 2015. â€śBeyond Convexity: Stochastic QuasiConvex Optimization.â€ť In Advances in Neural Information Processing Systems 28, edited by C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, 1594â€“1602. Curran Associates, Inc. http://papers.nips.cc/paper/5718beyondconvexitystochasticquasiconvexoptimization.pdf.
Heyde, C. C. 1974. â€śOn Martingale Limit Theory and Strong Convergence Results for Stochastic Approximation Procedures.â€ť Stochastic Processes and Their Applications 2 (4): 359â€“70. https://doi.org/10.1016/03044149(74)900040.
Hinton, Geoffrey, Nitish Srivastava, and Kevin Swersky. n.d. â€śNeural Networks for Machine Learning.â€ť
Hosseini, Reshad, and Suvrit Sra. 2015. â€śManifold Optimization for Gaussian Mixture Models.â€ť arXiv Preprint arXiv:1506.07677. http://arxiv.org/abs/1506.07677.
Hu, Chonghai, Weike Pan, and James T. Kwok. 2009. â€śAccelerated Gradient Methods for Stochastic Optimization and Online Learning.â€ť In Advances in Neural Information Processing Systems, 781â€“89. Curran Associates, Inc. http://papers.nips.cc/paper/3817acceleratedgradientmethodsforstochasticoptimizationandonlinelearning.
Jaggi, Martin. 2013. â€śRevisiting FrankWolfe: ProjectionFree Sparse Convex Optimization.â€ť In Journal of Machine Learning Research, 427â€“35. http://jmlr.csail.mit.edu/proceedings/papers/v28/jaggi13.html.
Jakovetic, D., J.M. Freitas Xavier, and J.M.F. Moura. 2014. â€śConvergence Rates of Distributed NesterovLike Gradient Methods on Random Networks.â€ť IEEE Transactions on Signal Processing 62 (4): 868â€“82. https://doi.org/10.1109/TSP.2013.2291221.
Kim, Daeun, and Justin P. Haldar. 2016. â€śGreedy Algorithms for NonnegativityConstrained Simultaneous Sparse Recovery.â€ť Signal Processing 125 (August): 274â€“89. https://doi.org/10.1016/j.sigpro.2016.01.021.
Kingma, Diederik, and Jimmy Ba. 2015. â€śAdam: A Method for Stochastic Optimization.â€ť Proceeding of ICLR. http://arxiv.org/abs/1412.6980.
Klein, Dan. 2004. â€śLagrange Multipliers Without Permanent Scarring.â€ť University of California at Berkeley, Computer Science Division. http://www.ee.columbia.edu/~vittorio/LagrangeMultipliersKlein.pdf.
Lai, Tze Leung. 2003. â€śStochastic Approximation.â€ť The Annals of Statistics 31 (2): 391â€“406. https://doi.org/10.1214/aos/1051027873.
Langford, John, Lihong Li, and Tong Zhang. 2009. â€śSparse Online Learning via Truncated Gradient.â€ť In Advances in Neural Information Processing Systems 21, edited by D. Koller, D. Schuurmans, Y. Bengio, and L. Bottou, 905â€“12. Curran Associates, Inc. http://papers.nips.cc/paper/3585sparseonlinelearningviatruncatedgradient.pdf.
Levy, Kfir Y. 2016. â€śThe Power of Normalization: Faster Evasion of Saddle Points,â€ť November. http://arxiv.org/abs/1611.04831.
Li, Yuanzhi, Yingyu Liang, and Andrej Risteski. 2016. â€śRecovery Guarantee of NonNegative Matrix Factorization via Alternating Updates.â€ť In Advances in Neural Information Processing Systems 29, edited by D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, 4988â€“96. Curran Associates, Inc. http://papers.nips.cc/paper/6417recoveryguaranteeofnonnegativematrixfactorizationviaalternatingupdates.pdf.
Lin, Hongzhou, Julien Mairal, and Zaid Harchaoui. 2016. â€śQuickeNing: A Generic QuasiNewton Algorithm for Faster GradientBased Optimization *.â€ť In. http://arxiv.org/abs/1610.00960.
Lucchi, Aurelien, Brian McWilliams, and Thomas Hofmann. 2015. â€śA Variance Reduced Stochastic Newton Method,â€ť March. http://arxiv.org/abs/1503.08316.
Ma, Chenxin, Jakub KoneÄŤná»ł, Martin Jaggi, Virginia Smith, Michael I. Jordan, Peter RichtĂˇrik, and Martin TakĂˇÄŤ. 2015. â€śDistributed Optimization with Arbitrary Local Solvers.â€ť arXiv Preprint arXiv:1512.04039. http://arxiv.org/abs/1512.04039.
Ma, Siyuan, and Mikhail Belkin. 2017. â€śDiving into the Shallows: A Computational Perspective on LargeScale Shallow Learning,â€ť March. http://arxiv.org/abs/1703.10622.
Madsen, K, H.B. Nielsen, and O. Tingleff. 2004. â€śMethods for NonLinear Least Squares Problems.â€ť http://www2.imm.dtu.dk/pubdb/views/edoc_download.php/3215/pdf/imm3215.pdf,
Mairal, J. 2015. â€śIncremental MajorizationMinimization Optimization with Application to LargeScale Machine Learning.â€ť SIAM Journal on Optimization 25 (2): 829â€“55. https://doi.org/10.1137/140957639.
Mairal, Julien. 2013a. â€śStochastic MajorizationMinimization Algorithms for LargeScale Optimization.â€ť In Advances in Neural Information Processing Systems, 2283â€“91. http://papers.nips.cc/paper/5129stochasticmajorizationminimizationalgorithmsforlargescaleoptimization.
â€”â€”â€”. 2013b. â€śOptimization with FirstOrder Surrogate Functions.â€ť In International Conference on Machine Learning, 783â€“91. http://proceedings.mlr.press/v28/mairal13.html.
Mairal, Julien, Francis Bach, and Jean Ponce. 2014. â€śSparse Modeling for Image and Vision Processing.â€ť Foundations and TrendsÂ® in Comput Graph. Vis. 8 (23): 85â€“283. https://doi.org/10.1561/0600000058.
Martens, James. 2010. â€śDeep Learning via HessianFree Optimization.â€ť In Proceedings of the 27th International Conference on International Conference on Machine Learning, 735â€“42. ICMLâ€™10. USA: Omnipress. http://www.cs.utoronto.ca/~jmartens/docs/Deep_HessianFree.pdf.
Martens, James, and Ilya Sutskever. 2011. â€śLearning Recurrent Neural Networks with HessianFree Optimization.â€ť In Proceedings of the 28th International Conference on International Conference on Machine Learning, 1033â€“40. ICMLâ€™11. USA: Omnipress. http://dl.acm.org/citation.cfm?id=3104482.3104612.
â€”â€”â€”. 2012. â€śTraining Deep and Recurrent Networks with HessianFree Optimization.â€ť In Neural Networks: Tricks of the Trade, 479â€“535. Lecture Notes in Computer Science. Springer. http://www.cs.toronto.edu/~jmartens/docs/HF_book_chapter.pdf.
Mattingley, J., and S. Boyd. 2010. â€śRealTime Convex Optimization in Signal Processing.â€ť IEEE Signal Processing Magazine 27 (3): 50â€“61. https://doi.org/10.1109/MSP.2010.936020.
Mcleod, Doug, Garry Emmerson, Robert Kohn, and Geoff Kingston (universit. 2008. â€śFinding the Invisible Hand: An Objective Model of Financial Markets.â€ť
McMahan, H. Brendan, Gary Holt, D. Sculley, Michael Young, Dietmar Ebner, Julian Grady, Lan Nie, et al. 2013. â€śAd Click Prediction: A View from the Trenches.â€ť In Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 1222â€“30. KDD â€™13. New York, NY, USA: ACM. https://doi.org/10.1145/2487575.2488200.
Mei, Song, Yu Bai, and Andrea Montanari. 2016. â€śThe Landscape of Empirical Risk for NonConvex Losses,â€ť July. http://arxiv.org/abs/1607.06534.
Mitliagkas, Ioannis, Ce Zhang, Stefan Hadjis, and Christopher RĂ©. 2016. â€śAsynchrony Begets Momentum, with an Application to Deep Learning,â€ť May. http://arxiv.org/abs/1605.09774.
Nesterov, Y. 2012. â€śEfficiency of Coordinate Descent Methods on HugeScale Optimization Problems.â€ť SIAM Journal on Optimization 22 (2): 341â€“62. https://doi.org/10.1137/100802001.
Nesterov, Yu. 2007. â€śAccelerating the Cubic Regularization of Newtonâ€™s Method on Convex Problems.â€ť Mathematical Programming 112 (1): 159â€“81. https://doi.org/10.1007/s101070060089x.
â€”â€”â€”. 2012. â€śGradient Methods for Minimizing Composite Functions.â€ť Mathematical Programming 140 (1): 125â€“61. https://doi.org/10.1007/s1010701206295.
Nesterov, Yurii. 2004. Introductory Lectures on Convex Optimization. Vol. 87. Applied Optimization. Boston, MA: Springer US. http://link.springer.com/10.1007/9781441988539.
Nguyen, Lam M., Jie Liu, Katya Scheinberg, and Martin TakĂˇÄŤ. 2017. â€śStochastic Recursive Gradient Algorithm for Nonconvex Optimization,â€ť May. http://arxiv.org/abs/1705.07261.
Nocedal, Jorge, and S. Wright. 2006. Numerical Optimization. 2nd ed. Springer Series in Operations Research and Financial Engineering. New York: SpringerVerlag. https://www.springer.com/gp/book/9780387303031.
Parikh, Neal, and Stephen Boyd. 2014. â€śProximal Algorithms.â€ť Foundations and TrendsÂ® in Optimization 1 (3): 127â€“239. https://doi.org/10.1561/2400000003.
Patel, Vivak. 2017. â€śOn SGDâ€™s Failure in Practice: Characterizing and Overcoming Stalling,â€ť February. http://arxiv.org/abs/1702.00317.
Pilanci, Mert, and Martin J. Wainwright. 2016. â€śIterative Hessian Sketch: Fast and Accurate Solution Approximation for Constrained LeastSquares.â€ť Journal of Machine Learning Research 17 (53): 1â€“38. http://www.jmlr.org/papers/volume17/14460/14460.pdf.
Polyak, B. T., and A. B. Juditsky. 1992. â€śAcceleration of Stochastic Approximation by Averaging.â€ť SIAM Journal on Control and Optimization 30 (4): 838â€“55. https://doi.org/10.1137/0330046.
Portnoy, Stephen, and Roger Koenker. 1997. â€śThe Gaussian Hare and the Laplacian Tortoise: Computability of SquaredError Versus AbsoluteError Estimators.â€ť Statistical Science 12 (4): 279â€“300. https://doi.org/10.1214/ss/1030037960.
Queipo, Nestor V., Raphael T. Haftka, Wei Shyy, Tushar Goel, Rajkumar Vaidyanathan, and P. Kevin Tucker. 2005. â€śSurrogateBased Analysis and Optimization.â€ť Progress in Aerospace Sciences 41 (1): 1â€“28. https://doi.org/10.1016/j.paerosci.2005.02.001.
Reddi, Sashank J., Ahmed Hefny, Suvrit Sra, Barnabas Poczos, and Alex Smola. 2016. â€śStochastic Variance Reduction for Nonconvex Optimization.â€ť In PMLR, 1603:314â€“23. http://proceedings.mlr.press/v48/reddi16.html.
Robbins, Herbert, and Sutton Monro. 1951. â€śA Stochastic Approximation Method.â€ť The Annals of Mathematical Statistics 22 (3): 400â€“407. https://doi.org/10.1214/aoms/1177729586.
Robbins, H., and D. Siegmund. 1971. â€śA Convergence Theorem for Non Negative Almost Supermartingales and Some Applications.â€ť In Optimizing Methods in Statistics, edited by Jagdish S. Rustagi, 233â€“57. Academic Press. https://doi.org/10.1016/B9780126045505.500158.
Rosset, Saharon, and Ji Zhu. 2007. â€śPiecewise Linear Regularized Solution Paths.â€ť The Annals of Statistics 35 (3): 1012â€“30. https://doi.org/10.1214/009053606000001370.
Ruder, Sebastian. 2016. â€śAn Overview of Gradient Descent Optimization Algorithms,â€ť September. http://arxiv.org/abs/1609.04747.
Ruppert, David. 1985. â€śA NewtonRaphson Version of the Multivariate RobbinsMonro Procedure.â€ť The Annals of Statistics 13 (1): 236â€“45. https://doi.org/10.1214/aos/1176346589.
Sagun, Levent, V. Ugur Guney, Gerard Ben Arous, and Yann LeCun. 2014. â€śExplorations on High Dimensional Landscapes,â€ť December. http://arxiv.org/abs/1412.6615.
Salimans, Tim, and Diederik P Kingma. 2016. â€śWeight Normalization: A Simple Reparameterization to Accelerate Training of Deep Neural Networks.â€ť In Advances in Neural Information Processing Systems 29, edited by D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, 901â€“1. Curran Associates, Inc. http://papers.nips.cc/paper/6114weightnormalizationasimplereparameterizationtoacceleratetrainingofdeepneuralnetworks.pdf.
Sashank J. Reddi, Suvrit Sra, BarnabĂˇs PĂłczĂłs, and Alex Smola. 1995. â€śStochastic FrankWolfe Methods for Nonconvex Optimization.â€ť https://arxiv.org/abs/1607.08254.
Schmidt, Mark, Glenn Fung, and Romer Rosales. 2009. â€śOptimization Methods for L1Regularization.â€ť University of British Columbia, Technical Report TR2009 19.
Schraudolph, Nicol N. 2002. â€śFast Curvature MatrixVector Products for SecondOrder Gradient Descent.â€ť Neural Computation 14 (7): 1723â€“38. https://doi.org/10.1162/08997660260028683.
ShalevShwartz, Shai, and Ambuj Tewari. 2011. â€śStochastic Methods for L1Regularized Loss Minimization.â€ť Journal of Machine Learning Research 12 (July): 1865â€“92. http://www.machinelearning.org/archive/icml2009/papers/262.pdf.
Shi, HaoJun Michael, Shenyinying Tu, Yangyang Xu, and Wotao Yin. 2016. â€śA Primer on Coordinate Descent Algorithms,â€ť September. http://arxiv.org/abs/1610.00040.
Simon, Noah, Jerome Friedman, Trevor Hastie, and Rob Tibshirani. 2011. â€śRegularization Paths for Coxâ€™s Proportional Hazards Model via Coordinate Descent.â€ť Journal of Statistical Software 39 (5). http://www.jstatsoft.org/v39/i05/paper.
Smith, Virginia, Simone Forte, Michael I. Jordan, and Martin Jaggi. 2015. â€śL1Regularized Distributed Optimization: A CommunicationEfficient PrimalDual Framework,â€ť December. http://arxiv.org/abs/1512.04011.
Spall, J. C. 2000. â€śAdaptive Stochastic Approximation by the Simultaneous Perturbation Method.â€ť IEEE Transactions on Automatic Control 45 (10): 1839â€“53. https://doi.org/10.1109/TAC.2000.880982.
Straszak, Damian, and Nisheeth K. Vishnoi. 2016. â€śIRLS and Slime Mold: Equivalence and Convergence,â€ť January. http://arxiv.org/abs/1601.02712.
Thisted, Ronald A. 1997. â€ś[The Gaussian Hare and the Laplacian Tortoise: Computability of SquaredError Versus AbsoluteError Estimators]: Comment.â€ť Statistical Science 12 (4): 296â€“98.
Townsend, James, Niklas Koep, and Sebastian Weichwald. 2016. â€śPymanopt: A Python Toolbox for Optimization on Manifolds Using Automatic Differentiation.â€ť Journal of Machine Learning Research 17 (137): 1â€“5. http://jmlr.org/papers/v17/16177.html.
Vishwanathan, S.V. N., Nicol N. Schraudolph, Mark W. Schmidt, and Kevin P. Murphy. 2006. â€śAccelerated Training of Conditional Random Fields with Stochastic Gradient Methods.â€ť In Proceedings of the 23rd International Conference on Machine Learning.
Wainwright, Martin J. 2014. â€śStructured Regularizers for HighDimensional Problems: Statistical and Computational Issues.â€ť Annual Review of Statistics and Its Application 1 (1): 233â€“53. https://doi.org/10.1146/annurevstatistics022513115643.
Wibisono, Andre, and Ashia C. Wilson. 2015. â€śOn Accelerated Methods in Optimization,â€ť September. http://arxiv.org/abs/1509.03616.
Wibisono, Andre, Ashia C. Wilson, and Michael I. Jordan. 2016. â€śA Variational Perspective on Accelerated Methods in Optimization.â€ť Proceedings of the National Academy of Sciences 113 (47): E7351â€“E7358. https://doi.org/10.1073/pnas.1614734113.
Wipf, David, and Srikantan Nagarajan. 2016. â€śIterative Reweighted L1 and L2 Methods for Finding Sparse Solution.â€ť Microsoft Research, July. https://www.microsoft.com/enus/research/publication/iterativereweightedl1l2methodsfindingsparsesolution/.
Wright, S. J., R. D. Nowak, and M. A. T. Figueiredo. 2009. â€śSparse Reconstruction by Separable Approximation.â€ť IEEE Transactions on Signal Processing 57 (7): 2479â€“93. https://doi.org/10.1109/TSP.2009.2016892.
Wu, Tong Tong, and Kenneth Lange. 2008. â€śCoordinate Descent Algorithms for Lasso Penalized Regression.â€ť The Annals of Applied Statistics 2 (1): 224â€“44. https://doi.org/10.1214/07AOAS147.
Xu, Wei. 2011. â€śTowards Optimal One Pass Large Scale Learning with Averaged Stochastic Gradient Descent,â€ť July. http://arxiv.org/abs/1107.2490.
Yang, Jiyan, Xiangrui Meng, and Michael W. Mahoney. 2015. â€śImplementing Randomized Matrix Algorithms in Parallel and Distributed Environments,â€ť February. http://arxiv.org/abs/1502.03032.
Yun, Sangwoon, and KimChuan Toh. 2009. â€śA Coordinate Gradient Descent Method for â„“ 1Regularized Convex Minimization.â€ť Computational Optimization and Applications 48 (2): 273â€“307. https://doi.org/10.1007/s1058900992518.
Zhang, Lijun, Tianbao Yang, Rong Jin, and ZhiHua Zhou. 2015. â€śSparse Learning for LargeScale and HighDimensional Data: A Randomized ConvexConcave Optimization Approach,â€ť November. http://arxiv.org/abs/1511.03766.
Zhang, Xiao, Lingxiao Wang, and Quanquan Gu. 2017. â€śStochastic VarianceReduced Gradient Descent for LowRank Matrix Recovery from Linear Measurements,â€ť January. http://arxiv.org/abs/1701.00481.
Zinkevich, Martin. 2003. â€śOnline Convex Programming and Generalized Infinitesimal Gradient Ascent.â€ť In Proceedings of the Twentieth International Conference on International Conference on Machine Learning, 928â€“35. ICMLâ€™03. Washington, DC, USA: AAAI Press. http://dl.acm.org/citation.cfm?id=3041838.3041955.
Zinkevich, Martin, Markus Weimer, Lihong Li, and Alex J. Smola. 2010. â€śParallelized Stochastic Gradient Descent.â€ť In Advances in Neural Information Processing Systems 23, edited by J. D. Lafferty, C. K. I. Williams, J. ShaweTaylor, R. S. Zemel, and A. Culotta, 2595â€“2603. Curran Associates, Inc. http://papers.nips.cc/paper/4006parallelizedstochasticgradientdescent.pdf.