The Living Thing / Notebooks :

Optimisation

Usefulness: 🔧
Novelty: 💡
Uncertainty: 🤪 🤪
Incompleteness: 🚧 🚧 🚧

Crawling through alien landscapes in the fog, looking for mountain peaks.

I’m mostly interested in continuous optimisation, but, you know, combinatorial optimisation is a whole thing.

A vast topic, with many sub-topics. I have neither the time nor the expertise to construct a detailed map of these As Moritz Hardt observes (and this is just in the convex context),

It’s easy to spend a semester of convex optimization on various guises of gradient descent alone. Simply pick one of the following variants and work through the specifics of the analysis: conjugate, accelerated, projected, conditional, mirrored, stochastic, coordinate, online. This is to name a few. You may also choose various pairs of attributes such as “accelerated coordinate” descent. Many triples are also valid such as “online stochastic mirror” descent. An expert unlike me would know exactly which triples are admissible. You get extra credit when you use “subgradient” instead of “gradient”. This is really only the beginning of optimization and it might already seem confusing.

When I was even younger and yet more foolish I decided the divide was between online optimization and offline optimization, which in hindsight is neither a clear nor useful taxonomy for the problems facing me. Now there are more tightly topical pages, such as gradient descent, and 2nd order methods, surrogate optimisation, constrained optimisation, and I shall create additional such as circumstances demand.

Brief taxonomy here.

🚧 Diagram.

See Zeyuan Allen-Zhu and Elad Hazan on their teaching strategy which also gives a split into 16 different areas:

The following dilemma is encountered by many of my friends when teaching basic optimization: which variant/proof of gradient descent should one start with? Of course, one needs to decide on which depth of convex analysis one should dive into, and decide on issues such as “should I define strong-convexity?”, “discuss smoothness?”, “Nesterov acceleration?”, etc.

[…] If one wishes to go into more depth, usually in convex optimization courses, one covers the full spectrum of different smoothness/ strong-convexity/ acceleration/ stochasticity regimes, each with a separate analysis (a total of 16 possible configurations!)

This year I’ve tried something different in COS511 @ Princeton, which turns out also to have research significance. We’ve covered basic GD for well-conditioned functions, i.e. smooth and strongly-convex functions, and then extended these result by reduction to all other cases! A (simplified) outline of this teaching strategy is given in chapter 2 of Introduction to Online Convex Optimization.

Classical Strong-Convexity and Smoothness Reductions:

Given any optimization algorithm A for the well-conditioned case (i.e., strongly convex and smooth case), we can derive an algorithm for smooth but not strongly functions as follows.

Given a non-strongly convex but smooth objective \(f\), define a objective by \(f_1(x)=f(x)+e\|x\|^2\).

It is straightforward to see that \(f_1\) differs from \(f\) by at most ϵ times a distance factor, and in addition it is ϵ-strongly convex. Thus, one can apply A to minimize \(f_1\) and get a solution which is not too far from the optimal solution for \(f\) itself. This simplistic reduction yields an almost optimal rate, up to logarithmic factors.

Keywords: Complimentary slackness theorem, High or very high dimensional methods, approximate method, Lagrange multipliers, primal and dual problems, fixed point methods, gradient, subgradient, proximal gradient, optimal control problems, convexity, sparsity, ways to avoid wrecking finding the extrema of perfectly simple little 10000-parameter functions before everyone observes that you are a fool in the guise of a mathematician but everyone is not there because you wandered off the optimal path hours ago, and now you are alone and lost in a valley of lower-case Greek letters.

See also geometry of fitness landscapes, expectation maximisation, matrix factorisations, discrete optimisation, nature-inspired “meta-heuristic” optimisation.

General

Brief intro material

Textbooks

Whole free textbooks online. Mostly convex.

To file

Alternating Direction Method of Multipliers

Dunno. It’s everywhere, though. Maybe this is a problem of definition, though? (Boyd 2010)

In this review, we argue that the alternating direction method of multipliers is well suited to distributed convex optimization, and in particular to large-scale problems arising in statistics, machine learning, and related areas. The method was developed in the 1970s, with roots in the 1950s, and is equivalent or closely related to many other algorithms, such as dual decomposition, the method of multipliers, Douglas–Rachford splitting, Spingarn’s method of partial inverses, Dykstra’s alternating projections, Bregman iterative algorithms for \(\ell_1\) problems, proximal methods, and others. After briefly surveying the theory and history of the algorithm, we discuss applications to a wide variety of statistical and machine learning problems of recent interest, including the lasso, sparse logistic regression, basis pursuit, covariance selection, support vector machines, and many others. We also discuss general distributed optimization, extensions to the nonconvex setting, and efficient implementation, including some details on distributed MPI and Hadoop Map Reduce implementations.

Continuous approximations of iterations

Recent papers (WiWJ16 WiWi15) argue that the discrete time steps can be viewed as a discrete approximation to a continuous time ODE which approaches the optimum (which in itself is trivial), but moreover that many algorithms fit into the same families of ODEs, that these ODEs explain Nesterov acceleration and generate new, improved optimisation methods. (which is not trivial.)

🚧

Optimisation on manifolds

See Nicolas Boumen’s introductory blog post.

Optimization on manifolds is about solving problems of the form

\[\mathrm{minimize}_{x\in\mathcal{M}} f(x),\]

where \(\mathcal{M}\) is a nice, known manifold. By “nice”, I mean a smooth, finite-dimensional Riemannian manifold.

Practical examples include the following (and all possible products of these):

Conceptually, the key point is to think of optimization on manifolds as unconstrained optimization: we do not think of \(\mathcal{M}\) as being embedded in a Euclidean space. Rather, we think of \(\mathcal{M}\) as being “the only thing that exists,” and we strive for intrinsic methods. Besides making for elegant theory, it also makes it clear how to handle abstract spaces numerically (such as the Grassmann manifold for example); and it gives algorithms the “right” invariances (computations do not depend on an arbitrarily chosen representation of the manifold).

There are at least two reasons why this class of problems is getting much attention lately. First, it is because optimization problems over the aforementioned sets (mostly matrix sets) come up pervasively in applications, and at some point it became clear that the intrinsic viewpoint leads to better algorithms, as compared to general-purpose constrained optimization methods (where \(\mathcal{M}\) is considered as being inside a Euclidean space \(\mathcal{E}\), and algorithms move in \(\mathcal{E}\), while penalizing distance to \(\mathcal{M}\)). The second is that, as I will argue momentarily, Riemannian manifolds are “the right setting” to talk about unconstrained optimization. And indeed, there is a beautiful book by [Absil, Sepulchre, Mahony], called Optimization algorithms on matrix manifolds (freely available), that shows how the classical methods for unconstrained optimization (gradient descent, Newton, trust-regions, conjugate gradients…) carry over seamlessly to the more general Riemannian framework.

Gradient-free optimization

Not all the methods described here use gradient information, but it’s frequently assumed to be something you can access easily. It’s worth considering which objectives you can optimize easily

But not all objectives are easily differentiable, even when parameters are continuous. For example, if you are not getting your measurement from a mathematical model, but from a physical experiment you can’t differentiate it since reality itself is usually not analytically differentiable. In this latter case, you are getting close to a question of online experiment design, as in ANOVA, and a further constraint that your function evaluations are possibly stupendously expensive. See Bayesian optimisation for one approach to this i the context of experiment design.

In general situations like this we use gradient-free methods, such as simulated annealing or numerical gradient etc.

“Meta-heuristic” methods

Biologically-inspired or arbitrary. Evolutionary algorithms, particle swarm optimisation, ant colony optimisation, harmony search. A lot of the tricks from these are adopted into mainstream stochastic methods. Some not.

See biometic algorithms for the care and husbandry of such as those.

Annealing and Monte Carlo optimisation methods

Simulated annealing: Constructing a process to yield maximally-likely estimates for the parameters. This has a statistical mechanics justification that makes it attractive to physicists; But it’s generally useful. You don’t necessarily need a gradient here, just the ability to evaluate something interpretable as a “likelihood ratio”. Long story. I don’t yet cover this at Monte carlo methods but I should.

Elad Hazan The two cultures of optimization:

The standard curriculum in high school math includes elementary functional analysis, and methods for finding the minima, maxima and saddle points of a single dimensional function. When moving to high dimensions, this becomes beyond the reach of your typical high-school student: mathematical optimization theory spans a multitude of involved techniques in virtually all areas of computer science and mathematics.

Iterative methods, in particular, are the most successful algorithms for large-scale optimization and the most widely used in machine learning. Of these, most popular are first-order gradient-based methods due to their very low per-iteration complexity.

However, way before these became prominent, physicists needed to solve large scale optimization problems, since the time of the Manhattan project at Los Alamos. The problems that they faced looked very different, essentially simulation of physical experiments, as were the solutions they developed. The Metropolis algorithm is the basis for randomized optimization methods and Markov Chain Monte Carlo algorithms.[…]

In our recent paper (AbHa15), we show that for convex optimization, the heat path and central path for IPM for a particular barrier function (called the entropic barrier, following the terminology of the recent excellent work of Bubeck and Eldan) are identical! Thus, in some precise sense, the two cultures of optimization have been studied the same object in disguise and using different techniques.

Expectation maximization

See expectation maximisation.

Parallel

Classic, basic SGD takes walks through the data set example-wise or feature-wise – but this doesn’t work in parallel, so you tend to go for mini-batch gradient descent so that you can at least vectorize. Apparently you can make SGD work in “true” parallel across communication-constrained cores, but I don’t yet understand how.

Implementations

Specialised optimisation software.

See also statistical software, and online optimisation

Many of these solvers optionally use commercial backends such as Mosek.

To file

Miscellaneous optimisation techniques suggested on Linkedin

The whole world of exotic specialized optimisers. See, e.g. Nuit Blanche name-dropping Bregmann iteration, alternating method, augmented Lagrangian…

Primal/dual problems

🚧

Majorization-minorization

🚧

Difference-of-Convex-objectives

When your objective function is not convex but you can represent it in terms of convex functions, somehow or other, use DC optimisation. (Gasso, Rakotomamonjy, and Canu 2009) (I don’t think this guarantees you a global optimum, but rather faster convergence to a local one)

Refs

Abadi, Martın, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, and Matthieu Devin. 2016. “TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems.” arXiv Preprint arXiv:1603.04467. http://arxiv.org/abs/1603.04467.

Abernethy, Jacob, and Elad Hazan. 2015. “Faster Convex Optimization: Simulated Annealing with an Efficient Universal Barrier,” July. http://arxiv.org/abs/1507.02528.

Achlioptas, Dimitris, Assaf Naor, and Yuval Peres. 2005. “Rigorous Location of Phase Transitions in Hard Optimization Problems.” Nature 435 (7043): 759–64. https://doi.org/10.1038/nature03602.

Agarwal, Alekh, Olivier Chapelle, Miroslav Dudık, and John Langford. 2014. “A Reliable Effective Terascale Linear Learning System.” Journal of Machine Learning Research 15 (1): 1111–33. http://www.jmlr.org/papers/volume15/agarwal14a/agarwal14a.pdf.

Agarwal, Naman, Brian Bullins, and Elad Hazan. 2016. “Second Order Stochastic Optimization in Linear Time,” February. http://arxiv.org/abs/1602.03943.

Allen-Zhu, Zeyuan, and Elad Hazan. 2016. “Optimal Black-Box Reductions Between Optimization Objectives.” In Advances in Neural Information Processing Systems 29, edited by D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, 1606–14. Curran Associates, Inc. http://papers.nips.cc/paper/6364-optimal-black-box-reductions-between-optimization-objectives.pdf.

Arnold, Sébastien M. R., and Chunming Wang. 2017. “Accelerating SGD for Distributed Deep-Learning Using Approximated Hessian Matrix.” In. http://arxiv.org/abs/1709.05069.

Ba, Jimmy, Roger Grosse, and James Martens. 2016. “Distributed Second-Order Optimization Using Kronecker-Factored Approximations,” November. https://openreview.net/forum?id=SkkTMpjex.

Bach, Francis. 2013. “Convex Relaxations of Structured Matrix Factorizations,” September. http://arxiv.org/abs/1309.3117.

Bach, Francis, Rodolphe Jenatton, Julien Mairal, and Guillaume Obozinski. 2012. “Optimization with Sparsity-Inducing Penalties.” Foundations and Trends® in Machine Learning 4 (1): 1–106. https://doi.org/10.1561/2200000015.

Bach, Francis, and Eric Moulines. 2011. “Non-Asymptotic Analysis of Stochastic Approximation Algorithms for Machine Learning.” In Advances in Neural Information Processing Systems (NIPS), –. Spain. http://hal.archives-ouvertes.fr/hal-00608041.

Bach, Francis R., and Eric Moulines. 2013. “Non-Strongly-Convex Smooth Stochastic Approximation with Convergence Rate O(1/N).” In arXiv:1306.2119 [Cs, Math, Stat], 773–81. https://arxiv.org/abs/1306.2119v1.

Battiti, Roberto. 1992. “First-and Second-Order Methods for Learning: Between Steepest Descent and Newton’s Method.” Neural Computation 4 (2): 141–66. https://doi.org/10.1162/neco.1992.4.2.141.

Beck, Amir, and Marc Teboulle. 2009. “A Fast Iterative Shrinkage-Thresholding Algorithm for Linear Inverse Problems.” SIAM Journal on Imaging Sciences 2 (1): 183–202. https://doi.org/10.1137/080716542.

Bertsimas, D., and I. Popescu. 2005. “Optimal Inequalities in Probability Theory: A Convex Optimization Approach.” SIAM Journal on Optimization 15 (3): 780–804. https://doi.org/10.1137/S1052623401399903.

Bian, Wei, Xiaojun Chen, and Yinyu Ye. 2014. “Complexity Analysis of Interior Point Algorithms for Non-Lipschitz and Nonconvex Minimization.” Mathematical Programming 149 (1-2): 301–27. https://doi.org/10.1007/s10107-014-0753-5.

Bordes, Antoine, Léon Bottou, and Patrick Gallinari. 2009. “SGD-QN: Careful Quasi-Newton Stochastic Gradient Descent.” Journal of Machine Learning Research 10 (December): 1737–54. http://jmlr.org/papers/volume10/bordes09a/bordes09a.pdf.

Botev, Aleksandar, Guy Lever, and David Barber. 2016. “Nesterov’s Accelerated Gradient and Momentum as Approximations to Regularised Update Descent,” July. http://arxiv.org/abs/1607.01981.

Botev, Zdravko I., and Chris J. Lloyd. 2015. “Importance Accelerated Robbins-Monro Recursion with Applications to Parametric Confidence Limits.” Electronic Journal of Statistics 9 (2): 2058–75. https://doi.org/10.1214/15-EJS1071.

Bottou, Léon. 1991. “Stochastic Gradient Learning in Neural Networks.” In Proceedings of Neuro-Nîmes 91. Nimes, France: EC2. http://leon.bottou.org/papers/bottou-91c.

———. 1998. “Online Algorithms and Stochastic Approximations.” In Online Learning and Neural Networks, edited by David Saad, 17:142. Cambridge, UK: Cambridge University Press. http://leon.bottou.org/publications/pdf/online-1998.pdf.

———. 2010. “Large-Scale Machine Learning with Stochastic Gradient Descent.” In Proceedings of the 19th International Conference on Computational Statistics (COMPSTAT’2010), 177–86. Paris, France: Springer. http://leon.bottou.org/papers/bottou-2010.

———. 2012. “Stochastic Gradient Descent Tricks.” In Neural Networks: Tricks of the Trade, 421–36. Lecture Notes in Computer Science. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-35289-8_25.

Bottou, Léon, and Olivier Bousquet. 2008. “The Tradeoffs of Large Scale Learning.” In Advances in Neural Information Processing Systems, edited by J.C. Platt, D. Koller, Y. Singer, and S. Roweis, 20:161–68. NIPS Foundation (http://books.nips.cc). http://leon.bottou.org/papers/bottou-bousquet-2008.

Bottou, Léon, Frank E. Curtis, and Jorge Nocedal. 2016. “Optimization Methods for Large-Scale Machine Learning,” June. http://arxiv.org/abs/1606.04838.

Bottou, Léon, and Yann LeCun. 2004. “Large Scale Online Learning.” In Advances in Neural Information Processing Systems 16, edited by Sebastian Thrun, Lawrence Saul, and Bernhard Schölkopf. Cambridge, MA: MIT Press. http://leon.bottou.org/papers/bottou-lecun-2004.

Boumal, Nicolas, Bamdev Mishra, P.-A. Absil, and Rodolphe Sepulchre. 2014. “Manopt, a Matlab Toolbox for Optimization on Manifolds.” Journal of Machine Learning Research 15: 1455–9. http://jmlr.org/papers/v15/boumal14a.html.

Boyd, Stephen. 2010. “Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers.” Foundations and Trends® in Machine Learning 3 (1): 1–122. https://doi.org/10.1561/2200000016.

Boyd, Stephen P., and Lieven Vandenberghe. 2004. Convex Optimization. Cambridge, UK ; New York: Cambridge University Press. https://web.stanford.edu/~boyd/cvxbook/.

Bubeck, Sébastien. 2015. “Convex Optimization: Algorithms and Complexity.” Foundations and Trends® in Machine Learning 8 (3-4): 231–357. https://doi.org/10.1561/2200000050.

Bubeck, Sébastien, and Ronen Eldan. 2014. “The Entropic Barrier: A Simple and Optimal Universal Self-Concordant Barrier,” December. http://arxiv.org/abs/1412.1587.

Cevher, Volkan, Stephen Becker, and Mark Schmidt. 2014. “Convex Optimization for Big Data.” IEEE Signal Processing Magazine 31 (5): 32–43. https://doi.org/10.1109/MSP.2014.2329397.

Chartrand, R., and Wotao Yin. 2008. “Iteratively Reweighted Algorithms for Compressive Sensing.” In IEEE International Conference on Acoustics, Speech and Signal Processing, 2008. ICASSP 2008, 3869–72. https://doi.org/10.1109/ICASSP.2008.4518498.

Chen, Xiaojun. 2012. “Smoothing Methods for Nonsmooth, Nonconvex Minimization.” Mathematical Programming 134 (1): 71–99. https://doi.org/10.1007/s10107-012-0569-0.

Chen, Xiaojun, Dongdong Ge, Zizhuo Wang, and Yinyu Ye. 2012. “Complexity of Unconstrained L_2-L_p.” Mathematical Programming 143 (1-2): 371–83. https://doi.org/10.1007/s10107-012-0613-0.

Cho, Minhyung, Chandra Shekhar Dhir, and Jaehyung Lee. 2015. “Hessian-Free Optimization for Learning Deep Multidimensional Recurrent Neural Networks.” In Advances in Neural Information Processing Systems. http://arxiv.org/abs/1509.03475.

Choromanska, Anna, MIkael Henaff, Michael Mathieu, Gerard Ben Arous, and Yann LeCun. 2015. “The Loss Surfaces of Multilayer Networks.” In Proceedings of the Eighteenth International Conference on Artificial Intelligence and Statistics, 192–204. http://proceedings.mlr.press/v38/choromanska15.html.

Chretien, Stephane. 2008. “An Alternating L1 Approach to the Compressed Sensing Problem,” September. http://arxiv.org/abs/0809.0660.

Combettes, Patrick L., and Jean-Christophe Pesquet. 2008. “A Proximal Decomposition Method for Solving Convex Variational.” Inverse Problems 24 (6): 065014. https://doi.org/10.1088/0266-5611/24/6/065014.

Dalalyan, Arnak S. 2017. “Further and Stronger Analogy Between Sampling and Optimization: Langevin Monte Carlo and Gradient Descent,” April. http://arxiv.org/abs/1704.04752.

Dauphin, Yann, Razvan Pascanu, Caglar Gulcehre, Kyunghyun Cho, Surya Ganguli, and Yoshua Bengio. 2014. “Identifying and Attacking the Saddle Point Problem in High-Dimensional Non-Convex Optimization.” In Advances in Neural Information Processing Systems 27, 2933–41. Curran Associates, Inc. http://arxiv.org/abs/1406.2572.

Defazio, Aaron, Francis Bach, and Simon Lacoste-Julien. 2014. “SAGA: A Fast Incremental Gradient Method with Support for Non-Strongly Convex Composite Objectives.” In Advances in Neural Information Processing Systems 27. http://arxiv.org/abs/1407.0202.

Dennis, J. E., and Robert B. Schnabel. 1989. “Chapter I A View of Unconstrained Optimization.” In Handbooks in Operations Research and Management Science, 1:1–72. Optimization. Elsevier. https://doi.org/10.1016/S0927-0507(89)01002-9.

Dennis, J., and R. Schnabel. 1996. Numerical Methods for Unconstrained Optimization and Nonlinear Equations. Vol. 16. Classics in Applied Mathematics. Society for Industrial and Applied Mathematics. https://doi.org/10.1137/1.9781611971200.

DeVore, Ronald A. 1998. “Nonlinear Approximation.” Acta Numerica 7 (January): 51–150. https://doi.org/10.1017/S0962492900002816.

Ding, Lijun, and Madeleine Udell. 2018. “Frank-Wolfe Style Algorithms for Large Scale Optimization,” August. https://arxiv.org/abs/1808.05274v1.

Duchi, John, Elad Hazan, and Yoram Singer. 2011. “Adaptive Subgradient Methods for Online Learning and Stochastic Optimization.” Journal of Machine Learning Research 12 (Jul): 2121–59. http://www.jmlr.org/papers/v12/duchi11a.html.

Fletcher, R., and M. J. D. Powell. 1963. “A Rapidly Convergent Descent Method for Minimization.” The Computer Journal 6 (2): 163–68. https://doi.org/10.1093/comjnl/6.2.163.

Forrester, Alexander I. J., and Andy J. Keane. 2009. “Recent Advances in Surrogate-Based Optimization.” Progress in Aerospace Sciences 45 (1–3): 50–79. https://doi.org/10.1016/j.paerosci.2008.11.001.

Friedlander, Michael P., and Mark Schmidt. 2012. “Hybrid Deterministic-Stochastic Methods for Data Fitting.” SIAM Journal on Scientific Computing 34 (3): A1380–A1405. https://doi.org/10.1137/110830629.

Friedman, Jerome H. 2002. “Stochastic Gradient Boosting.” Computational Statistics & Data Analysis, Nonlinear Methods and Data Mining, 38 (4): 367–78. https://doi.org/10.1016/S0167-9473(01)00065-2.

Friedman, Jerome, Trevor Hastie, Holger Höfling, and Robert Tibshirani. 2007. “Pathwise Coordinate Optimization.” The Annals of Applied Statistics 1 (2): 302–32. https://doi.org/10.1214/07-AOAS131.

Friedman, Jerome, Trevor Hastie, and Rob Tibshirani. 2010. “Regularization Paths for Generalized Linear Models via Coordinate Descent.” Journal of Statistical Software 33 (1): 1–22. https://doi.org/10.18637/jss.v033.i01.

Gasso, G., A. Rakotomamonjy, and S. Canu. 2009. “Recovering Sparse Signals with a Certain Family of Nonconvex Penalties and DC Programming.” IEEE Transactions on Signal Processing 57 (12): 4686–98. https://doi.org/10.1109/TSP.2009.2026004.

Ghadimi, Saeed, and Guanghui Lan. 2013a. “Stochastic First- and Zeroth-Order Methods for Nonconvex Stochastic Programming.” SIAM Journal on Optimization 23 (4): 2341–68. https://doi.org/10.1137/120880811.

———. 2013b. “Accelerated Gradient Methods for Nonconvex Nonlinear and Stochastic Programming,” October. http://arxiv.org/abs/1310.3787.

Giles, Mike B. 2008. “Collected Matrix Derivative Results for Forward and Reverse Mode Algorithmic Differentiation.” In Advances in Automatic Differentiation, edited by Christian H. Bischof, H. Martin Bücker, Paul Hovland, Uwe Naumann, and Jean Utke, 64:35–44. Berlin, Heidelberg: Springer Berlin Heidelberg. http://eprints.maths.ox.ac.uk/1079/.

Goldstein, A. 1965. “On Steepest Descent.” Journal of the Society for Industrial and Applied Mathematics Series A Control 3 (1): 147–51. https://doi.org/10.1137/0303013.

Goldstein, Tom, Christoph Studer, and Richard Baraniuk. 2015. “FASTA: A Generalized Implementation of Forward-Backward Splitting,” January. http://arxiv.org/abs/1501.04979.

Han, Zhong-Hua, and Ke-Shi Zhang. 2012. “Surrogate-Based Optimization.” http://cdn.intechopen.com/pdfs/30305/InTech-Surrogate_based_optimization.pdf.

Harchaoui, Zaid, Anatoli Juditsky, and Arkadi Nemirovski. 2015. “Conditional Gradient Algorithms for Norm-Regularized Smooth Convex Optimization.” Mathematical Programming 152 (1-2): 75–112. https://doi.org/10.1007/s10107-014-0778-9.

Hazan, Elad, Kfir Levy, and Shai Shalev-Shwartz. 2015. “Beyond Convexity: Stochastic Quasi-Convex Optimization.” In Advances in Neural Information Processing Systems 28, edited by C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, 1594–1602. Curran Associates, Inc. http://papers.nips.cc/paper/5718-beyond-convexity-stochastic-quasi-convex-optimization.pdf.

Heyde, C. C. 1974. “On Martingale Limit Theory and Strong Convergence Results for Stochastic Approximation Procedures.” Stochastic Processes and Their Applications 2 (4): 359–70. https://doi.org/10.1016/0304-4149(74)90004-0.

Hinton, Geoffrey, Nitish Srivastava, and Kevin Swersky. n.d. “Neural Networks for Machine Learning.”

Hosseini, Reshad, and Suvrit Sra. 2015. “Manifold Optimization for Gaussian Mixture Models.” arXiv Preprint arXiv:1506.07677. http://arxiv.org/abs/1506.07677.

Hu, Chonghai, Weike Pan, and James T. Kwok. 2009. “Accelerated Gradient Methods for Stochastic Optimization and Online Learning.” In Advances in Neural Information Processing Systems, 781–89. Curran Associates, Inc. http://papers.nips.cc/paper/3817-accelerated-gradient-methods-for-stochastic-optimization-and-online-learning.

Jaggi, Martin. 2013. “Revisiting Frank-Wolfe: Projection-Free Sparse Convex Optimization.” In Journal of Machine Learning Research, 427–35. http://jmlr.csail.mit.edu/proceedings/papers/v28/jaggi13.html.

Jakovetic, D., J.M. Freitas Xavier, and J.M.F. Moura. 2014. “Convergence Rates of Distributed Nesterov-Like Gradient Methods on Random Networks.” IEEE Transactions on Signal Processing 62 (4): 868–82. https://doi.org/10.1109/TSP.2013.2291221.

Kim, Daeun, and Justin P. Haldar. 2016. “Greedy Algorithms for Nonnegativity-Constrained Simultaneous Sparse Recovery.” Signal Processing 125 (August): 274–89. https://doi.org/10.1016/j.sigpro.2016.01.021.

Kingma, Diederik, and Jimmy Ba. 2015. “Adam: A Method for Stochastic Optimization.” Proceeding of ICLR. http://arxiv.org/abs/1412.6980.

Klein, Dan. 2004. “Lagrange Multipliers Without Permanent Scarring.” University of California at Berkeley, Computer Science Division. http://www.ee.columbia.edu/~vittorio/LagrangeMultipliers-Klein.pdf.

Lai, Tze Leung. 2003. “Stochastic Approximation.” The Annals of Statistics 31 (2): 391–406. https://doi.org/10.1214/aos/1051027873.

Langford, John, Lihong Li, and Tong Zhang. 2009. “Sparse Online Learning via Truncated Gradient.” In Advances in Neural Information Processing Systems 21, edited by D. Koller, D. Schuurmans, Y. Bengio, and L. Bottou, 905–12. Curran Associates, Inc. http://papers.nips.cc/paper/3585-sparse-online-learning-via-truncated-gradient.pdf.

Levy, Kfir Y. 2016. “The Power of Normalization: Faster Evasion of Saddle Points,” November. http://arxiv.org/abs/1611.04831.

Li, Yuanzhi, Yingyu Liang, and Andrej Risteski. 2016. “Recovery Guarantee of Non-Negative Matrix Factorization via Alternating Updates.” In Advances in Neural Information Processing Systems 29, edited by D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, 4988–96. Curran Associates, Inc. http://papers.nips.cc/paper/6417-recovery-guarantee-of-non-negative-matrix-factorization-via-alternating-updates.pdf.

Lin, Hongzhou, Julien Mairal, and Zaid Harchaoui. 2016. “QuickeNing: A Generic Quasi-Newton Algorithm for Faster Gradient-Based Optimization *.” In. http://arxiv.org/abs/1610.00960.

Lucchi, Aurelien, Brian McWilliams, and Thomas Hofmann. 2015. “A Variance Reduced Stochastic Newton Method,” March. http://arxiv.org/abs/1503.08316.

Ma, Chenxin, Jakub Konečnỳ, Martin Jaggi, Virginia Smith, Michael I. Jordan, Peter Richtárik, and Martin Takáč. 2015. “Distributed Optimization with Arbitrary Local Solvers.” arXiv Preprint arXiv:1512.04039. http://arxiv.org/abs/1512.04039.

Ma, Siyuan, and Mikhail Belkin. 2017. “Diving into the Shallows: A Computational Perspective on Large-Scale Shallow Learning,” March. http://arxiv.org/abs/1703.10622.

Madsen, K, H.B. Nielsen, and O. Tingleff. 2004. “Methods for Non-Linear Least Squares Problems.” http://www2.imm.dtu.dk/pubdb/views/edoc_download.php/3215/pdf/imm3215.pdf,

Mairal, J. 2015. “Incremental Majorization-Minimization Optimization with Application to Large-Scale Machine Learning.” SIAM Journal on Optimization 25 (2): 829–55. https://doi.org/10.1137/140957639.

Mairal, Julien. 2013a. “Stochastic Majorization-Minimization Algorithms for Large-Scale Optimization.” In Advances in Neural Information Processing Systems, 2283–91. http://papers.nips.cc/paper/5129-stochastic-majorization-minimization-algorithms-for-large-scale-optimization.

———. 2013b. “Optimization with First-Order Surrogate Functions.” In International Conference on Machine Learning, 783–91. http://proceedings.mlr.press/v28/mairal13.html.

Mairal, Julien, Francis Bach, and Jean Ponce. 2014. “Sparse Modeling for Image and Vision Processing.” Foundations and Trends® in Comput Graph. Vis. 8 (2-3): 85–283. https://doi.org/10.1561/0600000058.

Martens, James. 2010. “Deep Learning via Hessian-Free Optimization.” In Proceedings of the 27th International Conference on International Conference on Machine Learning, 735–42. ICML’10. USA: Omnipress. http://www.cs.utoronto.ca/~jmartens/docs/Deep_HessianFree.pdf.

Martens, James, and Ilya Sutskever. 2011. “Learning Recurrent Neural Networks with Hessian-Free Optimization.” In Proceedings of the 28th International Conference on International Conference on Machine Learning, 1033–40. ICML’11. USA: Omnipress. http://dl.acm.org/citation.cfm?id=3104482.3104612.

———. 2012. “Training Deep and Recurrent Networks with Hessian-Free Optimization.” In Neural Networks: Tricks of the Trade, 479–535. Lecture Notes in Computer Science. Springer. http://www.cs.toronto.edu/~jmartens/docs/HF_book_chapter.pdf.

Mattingley, J., and S. Boyd. 2010. “Real-Time Convex Optimization in Signal Processing.” IEEE Signal Processing Magazine 27 (3): 50–61. https://doi.org/10.1109/MSP.2010.936020.

Mcleod, Doug, Garry Emmerson, Robert Kohn, and Geoff Kingston (universit. 2008. “Finding the Invisible Hand: An Objective Model of Financial Markets.”

McMahan, H. Brendan, Gary Holt, D. Sculley, Michael Young, Dietmar Ebner, Julian Grady, Lan Nie, et al. 2013. “Ad Click Prediction: A View from the Trenches.” In Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 1222–30. KDD ’13. New York, NY, USA: ACM. https://doi.org/10.1145/2487575.2488200.

Mei, Song, Yu Bai, and Andrea Montanari. 2016. “The Landscape of Empirical Risk for Non-Convex Losses,” July. http://arxiv.org/abs/1607.06534.

Mitliagkas, Ioannis, Ce Zhang, Stefan Hadjis, and Christopher Ré. 2016. “Asynchrony Begets Momentum, with an Application to Deep Learning,” May. http://arxiv.org/abs/1605.09774.

Nesterov, Y. 2012. “Efficiency of Coordinate Descent Methods on Huge-Scale Optimization Problems.” SIAM Journal on Optimization 22 (2): 341–62. https://doi.org/10.1137/100802001.

Nesterov, Yu. 2007. “Accelerating the Cubic Regularization of Newton’s Method on Convex Problems.” Mathematical Programming 112 (1): 159–81. https://doi.org/10.1007/s10107-006-0089-x.

———. 2012. “Gradient Methods for Minimizing Composite Functions.” Mathematical Programming 140 (1): 125–61. https://doi.org/10.1007/s10107-012-0629-5.

Nesterov, Yurii. 2004. Introductory Lectures on Convex Optimization. Vol. 87. Applied Optimization. Boston, MA: Springer US. http://link.springer.com/10.1007/978-1-4419-8853-9.

Nguyen, Lam M., Jie Liu, Katya Scheinberg, and Martin Takáč. 2017. “Stochastic Recursive Gradient Algorithm for Nonconvex Optimization,” May. http://arxiv.org/abs/1705.07261.

Nocedal, Jorge, and S. Wright. 2006. Numerical Optimization. 2nd ed. Springer Series in Operations Research and Financial Engineering. New York: Springer-Verlag. https://www.springer.com/gp/book/9780387303031.

Parikh, Neal, and Stephen Boyd. 2014. “Proximal Algorithms.” Foundations and Trends® in Optimization 1 (3): 127–239. https://doi.org/10.1561/2400000003.

Patel, Vivak. 2017. “On SGD’s Failure in Practice: Characterizing and Overcoming Stalling,” February. http://arxiv.org/abs/1702.00317.

Pilanci, Mert, and Martin J. Wainwright. 2016. “Iterative Hessian Sketch: Fast and Accurate Solution Approximation for Constrained Least-Squares.” Journal of Machine Learning Research 17 (53): 1–38. http://www.jmlr.org/papers/volume17/14-460/14-460.pdf.

Polyak, B. T., and A. B. Juditsky. 1992. “Acceleration of Stochastic Approximation by Averaging.” SIAM Journal on Control and Optimization 30 (4): 838–55. https://doi.org/10.1137/0330046.

Portnoy, Stephen, and Roger Koenker. 1997. “The Gaussian Hare and the Laplacian Tortoise: Computability of Squared-Error Versus Absolute-Error Estimators.” Statistical Science 12 (4): 279–300. https://doi.org/10.1214/ss/1030037960.

Queipo, Nestor V., Raphael T. Haftka, Wei Shyy, Tushar Goel, Rajkumar Vaidyanathan, and P. Kevin Tucker. 2005. “Surrogate-Based Analysis and Optimization.” Progress in Aerospace Sciences 41 (1): 1–28. https://doi.org/10.1016/j.paerosci.2005.02.001.

Reddi, Sashank J., Ahmed Hefny, Suvrit Sra, Barnabas Poczos, and Alex Smola. 2016. “Stochastic Variance Reduction for Nonconvex Optimization.” In PMLR, 1603:314–23. http://proceedings.mlr.press/v48/reddi16.html.

Robbins, Herbert, and Sutton Monro. 1951. “A Stochastic Approximation Method.” The Annals of Mathematical Statistics 22 (3): 400–407. https://doi.org/10.1214/aoms/1177729586.

Robbins, H., and D. Siegmund. 1971. “A Convergence Theorem for Non Negative Almost Supermartingales and Some Applications.” In Optimizing Methods in Statistics, edited by Jagdish S. Rustagi, 233–57. Academic Press. https://doi.org/10.1016/B978-0-12-604550-5.50015-8.

Rosset, Saharon, and Ji Zhu. 2007. “Piecewise Linear Regularized Solution Paths.” The Annals of Statistics 35 (3): 1012–30. https://doi.org/10.1214/009053606000001370.

Ruder, Sebastian. 2016. “An Overview of Gradient Descent Optimization Algorithms,” September. http://arxiv.org/abs/1609.04747.

Ruppert, David. 1985. “A Newton-Raphson Version of the Multivariate Robbins-Monro Procedure.” The Annals of Statistics 13 (1): 236–45. https://doi.org/10.1214/aos/1176346589.

Sagun, Levent, V. Ugur Guney, Gerard Ben Arous, and Yann LeCun. 2014. “Explorations on High Dimensional Landscapes,” December. http://arxiv.org/abs/1412.6615.

Salimans, Tim, and Diederik P Kingma. 2016. “Weight Normalization: A Simple Reparameterization to Accelerate Training of Deep Neural Networks.” In Advances in Neural Information Processing Systems 29, edited by D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, 901–1. Curran Associates, Inc. http://papers.nips.cc/paper/6114-weight-normalization-a-simple-reparameterization-to-accelerate-training-of-deep-neural-networks.pdf.

Sashank J. Reddi, Suvrit Sra, Barnabás Póczós, and Alex Smola. 1995. “Stochastic Frank-Wolfe Methods for Nonconvex Optimization.” https://arxiv.org/abs/1607.08254.

Schmidt, Mark, Glenn Fung, and Romer Rosales. 2009. “Optimization Methods for L1-Regularization.” University of British Columbia, Technical Report TR-2009 19.

Schraudolph, Nicol N. 2002. “Fast Curvature Matrix-Vector Products for Second-Order Gradient Descent.” Neural Computation 14 (7): 1723–38. https://doi.org/10.1162/08997660260028683.

Shalev-Shwartz, Shai, and Ambuj Tewari. 2011. “Stochastic Methods for L1-Regularized Loss Minimization.” Journal of Machine Learning Research 12 (July): 1865–92. http://www.machinelearning.org/archive/icml2009/papers/262.pdf.

Shi, Hao-Jun Michael, Shenyinying Tu, Yangyang Xu, and Wotao Yin. 2016. “A Primer on Coordinate Descent Algorithms,” September. http://arxiv.org/abs/1610.00040.

Simon, Noah, Jerome Friedman, Trevor Hastie, and Rob Tibshirani. 2011. “Regularization Paths for Cox’s Proportional Hazards Model via Coordinate Descent.” Journal of Statistical Software 39 (5). http://www.jstatsoft.org/v39/i05/paper.

Smith, Virginia, Simone Forte, Michael I. Jordan, and Martin Jaggi. 2015. “L1-Regularized Distributed Optimization: A Communication-Efficient Primal-Dual Framework,” December. http://arxiv.org/abs/1512.04011.

Spall, J. C. 2000. “Adaptive Stochastic Approximation by the Simultaneous Perturbation Method.” IEEE Transactions on Automatic Control 45 (10): 1839–53. https://doi.org/10.1109/TAC.2000.880982.

Straszak, Damian, and Nisheeth K. Vishnoi. 2016. “IRLS and Slime Mold: Equivalence and Convergence,” January. http://arxiv.org/abs/1601.02712.

Thisted, Ronald A. 1997. “[The Gaussian Hare and the Laplacian Tortoise: Computability of Squared-Error Versus Absolute-Error Estimators]: Comment.” Statistical Science 12 (4): 296–98.

Townsend, James, Niklas Koep, and Sebastian Weichwald. 2016. “Pymanopt: A Python Toolbox for Optimization on Manifolds Using Automatic Differentiation.” Journal of Machine Learning Research 17 (137): 1–5. http://jmlr.org/papers/v17/16-177.html.

Vishwanathan, S.V. N., Nicol N. Schraudolph, Mark W. Schmidt, and Kevin P. Murphy. 2006. “Accelerated Training of Conditional Random Fields with Stochastic Gradient Methods.” In Proceedings of the 23rd International Conference on Machine Learning.

Wainwright, Martin J. 2014. “Structured Regularizers for High-Dimensional Problems: Statistical and Computational Issues.” Annual Review of Statistics and Its Application 1 (1): 233–53. https://doi.org/10.1146/annurev-statistics-022513-115643.

Wibisono, Andre, and Ashia C. Wilson. 2015. “On Accelerated Methods in Optimization,” September. http://arxiv.org/abs/1509.03616.

Wibisono, Andre, Ashia C. Wilson, and Michael I. Jordan. 2016. “A Variational Perspective on Accelerated Methods in Optimization.” Proceedings of the National Academy of Sciences 113 (47): E7351–E7358. https://doi.org/10.1073/pnas.1614734113.

Wipf, David, and Srikantan Nagarajan. 2016. “Iterative Reweighted L1 and L2 Methods for Finding Sparse Solution.” Microsoft Research, July. https://www.microsoft.com/en-us/research/publication/iterative-reweighted-l1-l2-methods-finding-sparse-solution/.

Wright, S. J., R. D. Nowak, and M. A. T. Figueiredo. 2009. “Sparse Reconstruction by Separable Approximation.” IEEE Transactions on Signal Processing 57 (7): 2479–93. https://doi.org/10.1109/TSP.2009.2016892.

Wu, Tong Tong, and Kenneth Lange. 2008. “Coordinate Descent Algorithms for Lasso Penalized Regression.” The Annals of Applied Statistics 2 (1): 224–44. https://doi.org/10.1214/07-AOAS147.

Xu, Wei. 2011. “Towards Optimal One Pass Large Scale Learning with Averaged Stochastic Gradient Descent,” July. http://arxiv.org/abs/1107.2490.

Yang, Jiyan, Xiangrui Meng, and Michael W. Mahoney. 2015. “Implementing Randomized Matrix Algorithms in Parallel and Distributed Environments,” February. http://arxiv.org/abs/1502.03032.

Yun, Sangwoon, and Kim-Chuan Toh. 2009. “A Coordinate Gradient Descent Method for ℓ 1-Regularized Convex Minimization.” Computational Optimization and Applications 48 (2): 273–307. https://doi.org/10.1007/s10589-009-9251-8.

Zhang, Lijun, Tianbao Yang, Rong Jin, and Zhi-Hua Zhou. 2015. “Sparse Learning for Large-Scale and High-Dimensional Data: A Randomized Convex-Concave Optimization Approach,” November. http://arxiv.org/abs/1511.03766.

Zhang, Xiao, Lingxiao Wang, and Quanquan Gu. 2017. “Stochastic Variance-Reduced Gradient Descent for Low-Rank Matrix Recovery from Linear Measurements,” January. http://arxiv.org/abs/1701.00481.

Zinkevich, Martin. 2003. “Online Convex Programming and Generalized Infinitesimal Gradient Ascent.” In Proceedings of the Twentieth International Conference on International Conference on Machine Learning, 928–35. ICML’03. Washington, DC, USA: AAAI Press. http://dl.acm.org/citation.cfm?id=3041838.3041955.

Zinkevich, Martin, Markus Weimer, Lihong Li, and Alex J. Smola. 2010. “Parallelized Stochastic Gradient Descent.” In Advances in Neural Information Processing Systems 23, edited by J. D. Lafferty, C. K. I. Williams, J. Shawe-Taylor, R. S. Zemel, and A. Culotta, 2595–2603. Curran Associates, Inc. http://papers.nips.cc/paper/4006-parallelized-stochastic-gradient-descent.pdf.