# Optimisation, continuous, batch/offline

Crawling through alien landscapes in the fog, looking for mountain peaks, one step at a time.

Twin to online optimization.

A huge and varied research discipline; this notebook will need to be broken down into sub-sections as my understanding improves.

But for now, I don’t recommend that you read this! Other people are far more skilled in this area than me; This is not an educational resource so much as keyword-dump breadcrumb trail marking my plucking of results I need from someone else’s orchard.

Keywords: Complimentary slackness theorem, High or very high dimensional methods, approximate method, Lagrange multipliers, primal and dual problems, fixed point methods, gradient, subgradient, proximal gradient, optimal control problems, convexity, sparsity, ways to avoid wrecking finding the extrema of perfectly simple little 10000-parameter functions before everyone observes that you are a fool in the guise of a mathematician but everyone is not there because you wandered off the optimal path hours ago, and now you are alone and lost in a valley of lower-case Greek letters.

## Typology

TODO: Diagramm.

See Zeyuan Allen-Zhu and Elad Hazan on their teaching strategy which also gives a split into 16 different areas: (Of which I personally need a practical engagement only non-smooth, online variants.)

The following dilemma is encountered by many of my friends when teaching basic optimization: which variant/proof of gradient descent should one start with? Of course, one needs to decide on which depth of convex analysis one should dive into, and decide on issues such as “should I define strong-convexity?”, “discuss smoothness?”, “Nesterov acceleration?”, etc.

[…] If one wishes to go into more depth, usually in convex optimization courses, one covers the full spectrum of different smoothness/ strong-convexity/ acceleration/ stochasticity regimes, each with a separate analysis (a total of 16 possible configurations!)

This year I’ve tried something different in COS511 @ Princeton, which turns out also to have research significance. We’ve covered basic GD for well-conditioned functions, i.e. smooth and strongly-convex functions, and then extended these result by reduction to all other cases! A (simplified) outline of this teaching strategy is given in chapter 2 of Introduction to Online Convex Optimization.

Classical Strong-Convexity and Smoothness Reductions:

Given any optimization algorithm A for the well-conditioned case (i.e., strongly convex and smooth case), we can derive an algorithm for smooth but not strongly functions as follows.

Given a non-strongly convex but smooth objective $f$, define a objective by $f_1(x)=f(x)+e\|x\|^2$.

It is straightforward to see that $f_1$ differs from $f$ by at most ϵ times a distance factor, and in addition it is ϵ-strongly convex. Thus, one can apply A to minimize $f_1$ and get a solution which is not too far from the optimal solution for $f$ itself. This simplistic reduction yields an almost optimal rate, up to logarithmic factors.

See also geometry of fitness landscapes, expectation maximisation, matrix factorisations, discrete optimisation, nature-inspired “meta-heuristic” optimisation.

## General

### Brief intro material

• Zeyuan ALLEN-ZHU: Recent Advances in Stochastic Convex and Non-Convex Optimization. Clear, missing some details but has good pointers.

• basic but enlightening, John Nash’s graphical explanation of R’s optimization

• Martin Jaggi’s Optimization in two hours

• Celebrated union of optimisation, computational complexity and command-and-control-economics, by that showoff Cosma Shalizi: In Soviet Union, Optimization Problem Solves You

• Elad Hazan The two cultures of optimization:

The standard curriculum in high school math includes elementary functional analysis, and methods for finding the minima, maxima and saddle points of a single dimensional function. When moving to high dimensions, this becomes beyond the reach of your typical high-school student: mathematical optimization theory spans a multitude of involved techniques in virtually all areas of computer science and mathematics.

Iterative methods, in particular, are the most successful algorithms for large-scale optimization and the most widely used in machine learning. Of these, most popular are first-order gradient-based methods due to their very low per-iteration complexity.

However, way before these became prominent, physicists needed to solve large scale optimization problems, since the time of the Manhattan project at Los Alamos. The problems that they faced looked very different, essentially simulation of physical experiments, as were the solutions they developed. The Metropolis algorithm is the basis for randomized optimization methods and Markov Chain Monte Carlo algorithms.[…]

In our recent paper (AbHa15), we show that for convex optimization, the heat path and central path for IPM for a particular barrier function (called the entropic barrier, following the terminology of the recent excellent work of Bubeck and Eldan) are identical! Thus, in some precise sense, the two cultures of optimization have been studied the same object in disguise and using different techniques.

### Textbooks

Whole free textbooks online. Mostly convex.

## Constrained optimisation

Related: [constraint solvers](constraint_solvers.md)

### Lagrange multipliers

Constrained optimisation using Lagrange’s one weird trick, and the Karush–Kuhn–Tucker conditions. The search for saddle points and roots.

### Alternating Direction Method of Multipliers

Dunno. It’s everywhere, though. Maybe this is a problem of definition, though? (Boyd10)

In this review, we argue that the alternating direction method of multipliers is well suited to distributed convex optimization, and in particular to large-scale problems arising in statistics, machine learning, and related areas. The method was developed in the 1970s, with roots in the 1950s, and is equivalent or closely related to many other algorithms, such as dual decomposition, the method of multipliers, Douglas–Rachford splitting, Spingarn’s method of partial inverses, Dykstra’s alternating projections, Bregman iterative algorithms for $\ell_1$ problems, proximal methods, and others. After briefly surveying the theory and history of the algorithm, we discuss applications to a wide variety of statistical and machine learning problems of recent interest, including the lasso, sparse logistic regression, basis pursuit, covariance selection, support vector machines, and many others. We also discuss general distributed optimization, extensions to the nonconvex setting, and efficient implementation, including some details on distributed MPI and Hadoop Map Reduce implementations.

## Reductions

Oh crap, where did I rip this quote from?

The diverse world of machine learning applications has given rise to a plethora of algorithms and optimization methods, finely tuned to the specific regression or classification task at hand. We reduce the complexity of algorithm design for machine learning by reductions: we develop reductions that take a method developed for one setting and apply it to the entire spectrum of smoothness and strong-convexity in applications.

It will be from a Hazan and Annen-Zhu paper.

## Continuous approximations of iterations

Recent papers (WiWJ16 WiWi15) argue that the discrete time steps can be viewed as a discrete approximation to a continuous time ODE which approaches the optimum (which in itself is trivial), but moreover that many algorithms fit into the same families of ODEs, that these ODEs explain Nesterov acceleration and generate new, improved optimisation methods. (which is not trivial.)

## Continuous relaxations of parameters

Solving discrete problems with differentiable, continuous, surrogate parameters.

## …Of composite functions

Hip for sparse regression, compressed sensing. etc. “FISTA” is one option. (Bubeck explains.)

## Second order (Quasi-Newton)

If you have the second derivative you can be fancy when finding zeros.

## Least Squares

This particular objective function has some particular shortcuts; e.g. you don’t necessarily need a gradient to do it.

Sebastien Bubeck has a good write-up. Part 1, PArt 2.

Finding quadratic minima, as in Least Squares.

And also not-quite-linear uses of this.

## …on a manifold

What if your constraints are naturally represented as some kind of smooth manifold? Is that worth thinking about? Apparently sometimes it is. See, e.g. ToKW16, BMAS14.

## To file

### Miscellaneous optimisation techniques suggested on Linkedin

The whole world of exotic specialized optimisers. See, e.g. Nuit Blanche name-dropping Bregmann iteration, alternating method, augmented Lagrangian…

### Coordinate descent

Descent each coordinate individually.

Small clever hack: log gradient descent.

### Fixed point methods

Contraction maps are nice when you have them. TBD.

### Difference-of-Convex-objectives

When your objective function is not convex but you can represent it in terms of convex functions, use DC optimisation. (GaRC09.) (I don’t think this guarantees you a global optimum, but rather faster convergence to a local one)

### Non-convex optimisation

“How doomed are you?”

Not all the methods described here use gradient information, but it’s frequently assumed to be something you can access easily. It’s worth considering which objectives you can optimize easily

But not all objectives are easily differentiable, even when parameters are continuous. For example, if you are not getting your measurement from a mathematical model, but from a physical experiment you can’t differentiate it since reality itself is usually not anlytically differentiable. In this latter case, you are getting close to a question of online experiment design, as in ANOVA, and a further constraint that your function evaluations are possibly stupendously expensive. See Bayesian optimisation for one approach to this i the context of experiment design.

In general situations like this we use gradient-free methods, such as simulated annealing or metaheuristic methods or what-have-you.

### “Meta-heuristic” methods

Biologically-inspired or arbitrary. Evolutionary algorithms, particle swarm optimisation, ant colony optimisation, harmony search. A lot of the tricks from these are adopted into mainstream stochastic methods. Some not.

See biometic algorithms for the care and husbandry of such as those.

### Annealing and Monte Carlo optimisation methods

Simulated annealing: Constructing a process to yield maximally-likely estimates for the parameters. This has a statistical mechanics justification that makes it attractive to physicists; But it’s generally useful. You don’t necessarily need a gradient here, just the ability to evaluate something interpretable as a “likelihood ratio”. Long story. I don’t yet cover this at Monte carlo methods but I should.

## My problem: constrained, pathwise sparsifying-penalty optimisers for nonlinear problems

I’m trialling a bunch of sparse Lasso-like regression models. I want them to be fast-ish to run and fast to develop. I want them to go in python. I would like to be able to vary my regularisation parameter and warm-restart, like the glmnet Lasso. I would like to be able to handle constraints, especially component-wise non-negativity, and matrix non-negative-definiteness.

Notes on that here.

Ideas:

### use scipy.optimize

give up the idea of warm restarts, and enforce constraints in the callback.

### use cvxpy

Pretty good, but not that fast since it does not in general exploit gradient information. For some problems this is fine, though.

### use spams

Wonderful, but only if your problem fits one of their categories. Otherwise you can maybe extract some bits from their code and use them, but that is now a serious project. They have a passable LASSO.

### Roll my own algorithm

Potentially yak-shaving. But I can work from the example of my colleagues, which are special-purpose algorithms, usually reasonably fast ones.

pyspsolve
ADMM (Alternating Direction Method of Multipliers) for LASSO, which is a Python implementation for an original MATLAB version.
pytron (python/c++)
A Trust-Region Newton Method in Python using TRON optimization software (files src/tron.{h,cpp}) distributed from the LIBLINEAR sources (v1.93),
in the incoming scikit-learn looks likes a good reference.

## Implementations

Specialised optimisation software.

• SPORCO a Python package for solving optimisation problems with sparsity-inducing regularisation. These consist primarily of sparse coding and dictionary learning problems, including convolutional sparse coding and dictionary learning, but there is also support for other problems such as Total Variation regularisation and Robust PCA. In the current version, all of the optimisation algorithms are based on the Alternating Direction Method of Multipliers (ADMM).

• scipy.optimise.minimize: The python default. Includes many different algorithms than can do whatever you want. Failure modes are opaque, online-only and they don’t support warm-restarts, which is a thing for me, but a good starting point unless you have reason to prefer others. (i.e. if all your data does not fit in RAM, don’t bother.)

• spams

SPAMS (SPArse Modeling Software) is an optimization toolbox for solving various sparse estimation problems. Dictionary learning and matrix factorization (NMF, sparse PCA, …) Solving sparse decomposition problems with LARS, coordinate descent, OMP, SOMP, proximal methods Solving structured sparse decomposition problems (:mathell_1/ell_2, $\ell_1/\ell_\infty,$ sparse group lasso, tree-structured regularization structured sparsity with overlapping groups,…). It is developped by Julien Mairal, with the collaboration of Francis Bach, Jean Ponce, Guillermo Sapiro, Rodolphe Jenatton and Guillaume Obozinski. It is coded in C++ with a Matlab interface. Recently, interfaces for R and Python have been developed by Jean-Paul Chieze (INRIA), and archetypal analysis was written by Yuansi Chen (UC Berkeley).

• picos

…is a user friendly interface to several conic and integer programming solvers, very much like YALMIP or CVX under MATLAB.

The main motivation for PICOS is to have the possibility to enter an optimization problem as a high level model, and to be able to solve it with several different solvers. Multidimensional and matrix variables are handled in a natural fashion, which makes it painless to formulate a SDP or a SOCP. This is very useful for educational purposes, and to quickly implement some models and test their validity on simple examples.

also maintains a list of other solvers.

• Manifold optimisation implementations:

MATLAB: manopt, Python: pymanopt.

• cvxopt

… is a free software package for convex optimization based on the Python programming language. It can be used with the interactive Python interpreter, on the command line by executing Python scripts, or integrated in other software via Python extension modules. Its main purpose is to make the development of software for convex optimization applications straightforward by building on Python’s extensive standard library and on the strengths of Python as a high-level programming language. […]

• efficient Python classes for dense and sparse matrices (real and complex), with Python indexing and slicing and overloaded operations for matrix arithmetic
• an interface to most of the double-precision real and complex BLAS
• an interface to LAPACK routines for solving linear equations and least-squares problems, matrix factorisations (LU, Cholesky, LDLT and QR), symmetric eigenvalue and singular value decomposition, and Schur factorization
• an interface to the fast Fourier transform routines from FFTW
• interfaces to the sparse LU and Cholesky solvers from UMFPACK and CHOLMOD
• routines for linear, second-order cone, and semidefinite programming problems
• routines for nonlinear convex optimization
• interfaces to the linear programming solver in GLPK, the semidefinite programming solver in DSDP5, and the linear, quadratic and second-order cone programming solvers in MOSEK
• a modeling tool for specifying convex piecewise-linear optimization problems.

seems to reinvent half of numpy and scipy. Also seems to be used by the all the other python packages. Including…

• cvxpy

…is a Python-embedded modeling language for convex optimization problems. It allows you to express your problem in a natural way that follows the math, rather than in the restrictive standard form required by solvers.

So it’s a DSL for convex constraint programming. Can be extended heuristically to nonconvex constraints by…

• ncvx

… is a package for modeling and solving problems with convex objectives and decision variables from a nonconvex set. This package provides heuristic such as NC-ADMM (a variation of alternating direction method of multipliers for nonconvex problems) and relax-round-polish, which can be viewed as a majorization-minimization algorithm. The solver methods provided and the syntax for constructing problems are discussed in our associated paper.

• NLopt

… is a free/open-source library for nonlinear optimization, providing a common interface for a number of different free optimization routines available online as well as original implementations of various other algorithms. Its features include:

• Callable from C, C++, Fortran, Matlab or GNU Octave, Python, GNU Guile, Julia, GNU R, Lua, and OCaml.
• A common interface for many different algorithms—try a different algorithm just by changing one parameter.
• Support for large-scale optimization (some algorithms scalable to millions of parameters and thousands of constraints)…
• Algorithms using function values only (derivative-free) and also algorithms exploiting user-supplied gradients.
• …(pronounced tee-fox) provides a set of Matlab templates, or building blocks, that can be used to construct efficient, customized solvers for a variety of convex models, including in particular those employed in sparse recovery applications. It was conceived and written by Stephen Becker, Emmanuel J. Candès and Michael Grant.

• stan is famous for Monte Carlo sampling, but also does deterministic optimisation using automatic differentiation. this is a luxurious “full service” option, although with limited scope for customisation; Curious how it performs in very high dimensions, as L-BFGS does not scale forever.

Optimization algorithms:

• Limited-memory BFGS (Stan’s default optimization algorithm)
• BFGS
• Laplace’s method for classical standard error estimates and approximate Bayesian posteriors
• Optim.jl is a generic optimizer for julia

• JuMP.jl is a domain-specific modeling language for mathematical optimization embedded in Julia. It currently supports a number of open-source and commercial solvers (Bonmin, Cbc, Clp, Couenne, CPLEX, ECOS, FICO Xpress, GLPK, Gurobi, Ipopt, KNITRO, MOSEK, NLopt, SCS, BARON) for a variety of problem classes, including linear programming, (mixed) integer programming, second-order conic programming, semidefinite programming, and nonlinear programming.

• NLsolve.jl solves systems of nonlinear equations. […]

The package is also able to solve mixed complementarity problems, which are similar to systems of nonlinear equations, except that the equality to zero is allowed to become an inequality if some boundary condition is satisfied. See further below for a formal definition and the related commands.

Since there is some overlap between optimizers and nonlinear solvers, this package borrows some ideas from the Optim package, and depends on it for linesearch algorithms.

Many of these solvers optionally use commercial backends such as Mosek.

## Refs

AABB16
Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., … Devin, M. (2016) TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems. ArXiv Preprint ArXiv:1603.04467.
AbHa15
Abernethy, J., & Hazan, E. (2015) Faster Convex Optimization: Simulated Annealing with an Efficient Universal Barrier. ArXiv:1507.02528 [Cs, Math].
AcNP05
Achlioptas, D., Naor, A., & Peres, Y. (2005) Rigorous location of phase transitions in hard optimization problems. Nature, 435(7043), 759–764. DOI.
AlHa16
Allen-Zhu, Z., & Hazan, E. (2016) Optimal Black-Box Reductions Between Optimization Objectives. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, & R. Garnett (Eds.), Advances in Neural Information Processing Systems 29 (pp. 1606–1614). Curran Associates, Inc.
Bach13
Bach, F. (2013) Convex relaxations of structured matrix factorizations. ArXiv:1309.3117 [Cs, Math].
BJMO12
Bach, F., Jenatton, R., Mairal, J., & Obozinski, G. (2012) Optimization with Sparsity-Inducing Penalties. Foundations and Trends® in Machine Learning, 4(1), 1–106. DOI.
Batt92
Battiti, R. (1992) First-and second-order methods for learning: between steepest descent and Newton’s method. Neural Computation, 4(2), 141–166. DOI.
BeTe09
Beck, A., & Teboulle, M. (2009) A Fast Iterative Shrinkage-Thresholding Algorithm for Linear Inverse Problems. SIAM Journal on Imaging Sciences, 2(1), 183–202. DOI.
BePo05
Bertsimas, D., & Popescu, I. (2005) Optimal Inequalities in Probability Theory: A Convex Optimization Approach. SIAM Journal on Optimization, 15(3), 780–804. DOI.
BiCY14
Bian, W., Chen, X., & Ye, Y. (2014) Complexity analysis of interior point algorithms for non-Lipschitz and nonconvex minimization. Mathematical Programming, 149(1–2), 301–327. DOI.
BMAS14
Boumal, N., Mishra, B., Absil, P.-A., & Sepulchre, R. (2014) Manopt, a Matlab Toolbox for Optimization on Manifolds. Journal of Machine Learning Research, 15, 1455–1459.
Boyd10
Boyd, S. (2010) Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers. Foundations and Trends® in Machine Learning, 3(1), 1–122. DOI.
BoVa04
Boyd, S. P., & Vandenberghe, L. (2004) Convex optimization. . Cambridge, UK ; New York: Cambridge University Press
Bube15
Bubeck, S. (2015) Convex Optimization: Algorithms and Complexity. Foundations and Trends® in Machine Learning, 8(3–4), 231–357. DOI.
BuEl14
Bubeck, S., & Eldan, R. (2014) The entropic barrier: a simple and optimal universal self-concordant barrier. ArXiv:1412.1587 [Cs, Math].
CeBS14
Cevher, V., Becker, S., & Schmidt, M. (2014) Convex Optimization for Big Data. IEEE Signal Processing Magazine, 31(5), 32–43. DOI.
ChYi08
Chartrand, R., & Yin, W. (2008) Iteratively reweighted algorithms for compressive sensing. In IEEE International Conference on Acoustics, Speech and Signal Processing, 2008. ICASSP 2008 (pp. 3869–3872). DOI.
Chen12
Chen, X. (2012) Smoothing methods for nonsmooth, nonconvex minimization. Mathematical Programming, 134(1), 71–99. DOI.
CGWY12
Chen, X., Ge, D., Wang, Z., & Ye, Y. (2012) Complexity of unconstrained L_2-L_p. Mathematical Programming, 143(1–2), 371–383. DOI.
Chre08
Chretien, S. (2008) An Alternating l1 approach to the compressed sensing problem. ArXiv:0809.0660 [Stat].
CoPe08
Combettes, P. L., & Pesquet, J.-C. (2008) A proximal decomposition method for solving convex variational. Inverse Problems, 24(6), 065014. DOI.
Dala17
Dalalyan, A. S.(2017) Further and stronger analogy between sampling and optimization: Langevin Monte Carlo and gradient descent. ArXiv:1704.04752 [Math, Stat].
DeBL14
Defazio, A., Bach, F., & Lacoste-Julien, S. (2014) SAGA: A Fast Incremental Gradient Method With Support for Non-Strongly Convex Composite Objectives. In Advances in Neural Information Processing Systems 27.
FlPo63
Fletcher, R., & Powell, M. J. D.(1963) A Rapidly Convergent Descent Method for Minimization. The Computer Journal, 6(2), 163–168. DOI.
FoKe09
Forrester, A. I. J., & Keane, A. J.(2009) Recent advances in surrogate-based optimization. Progress in Aerospace Sciences, 45(1–3), 50–79. DOI.
FHHT07
Friedman, J., Hastie, T., Höfling, H., & Tibshirani, R. (2007) Pathwise coordinate optimization. The Annals of Applied Statistics, 1(2), 302–332. DOI.
FrHT10
Friedman, J., Hastie, T., & Tibshirani, R. (2010) Regularization Paths for Generalized Linear Models via Coordinate Descent. Journal of Statistical Software, 33(1), 1–22. DOI.
GaRC09
Gasso, G., Rakotomamonjy, A., & Canu, S. (2009) Recovering Sparse Signals With a Certain Family of Nonconvex Penalties and DC Programming. IEEE Transactions on Signal Processing, 57(12), 4686–4698. DOI.
Gile08
Giles, M. B.(2008) Collected Matrix Derivative Results for Forward and Reverse Mode Algorithmic Differentiation. In C. H. Bischof, H. M. Bücker, P. Hovland, U. Naumann, & J. Utke (Eds.), Advances in Automatic Differentiation (Vol. 64, pp. 35–44). Berlin, Heidelberg: Springer Berlin Heidelberg
Gold65
Goldstein, A. (1965) On Steepest Descent. Journal of the Society for Industrial and Applied Mathematics Series A Control, 3(1), 147–151. DOI.
GoSB15
Goldstein, T., Studer, C., & Baraniuk, R. (2015) FASTA: A Generalized Implementation of Forward-Backward Splitting. ArXiv:1501.04979 [Cs, Math].
HaZh12
Han, Z.-H., & Zhang, K.-S. (2012) Surrogate-based optimization.
HaJN15
Harchaoui, Z., Juditsky, A., & Nemirovski, A. (2015) Conditional gradient algorithms for norm-regularized smooth convex optimization. Mathematical Programming, 152(1–2), 75–112. DOI.
HoSr15
Hosseini, R., & Sra, S. (2015) Manifold Optimization for Gaussian Mixture Models. ArXiv Preprint ArXiv:1506.07677.
Jagg13
Jaggi, M. (2013) Revisiting Frank-Wolfe: Projection-Free Sparse Convex Optimization. In Journal of Machine Learning Research (pp. 427–435).
JaFM14
Jakovetic, D., Freitas Xavier, J. M., & Moura, J. M. F.(2014) Convergence Rates of Distributed Nesterov-Like Gradient Methods on Random Networks. IEEE Transactions on Signal Processing, 62(4), 868–882. DOI.
KiHa16
Kim, D., & Haldar, J. P.(2016) Greedy algorithms for nonnegativity-constrained simultaneous sparse recovery. Signal Processing, 125, 274–289. DOI.
Klei04
Klein, D. (2004) Lagrange multipliers without permanent scarring. University of California at Berkeley, Computer Science Division.
LiLR16
Li, Y., Liang, Y., & Risteski, A. (2016) Recovery Guarantee of Non-negative Matrix Factorization via Alternating Updates. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, & R. Garnett (Eds.), Advances in Neural Information Processing Systems 29 (pp. 4988–4996). Curran Associates, Inc.
LiMH16
Lin, H., Mairal, J., & Harchaoui, Z. (2016) QuickeNing: A Generic Quasi-Newton Algorithm for Faster Gradient-Based Optimization *. In arXiv:1610.00960 [math, stat].
MKJS15
Ma, C., Konečnỳ, J., Jaggi, M., Smith, V., Jordan, M. I., Richtárik, P., & Takáč, M. (2015) Distributed Optimization with Arbitrary Local Solvers. ArXiv Preprint ArXiv:1512.04039.
MaBe17
Ma, S., & Belkin, M. (2017) Diving into the shallows: a computational perspective on large-scale shallow learning. ArXiv:1703.10622 [Cs, Stat].
Macl16
Maclaurin, D. (2016) Modeling, Inference and Optimization with Composable Differentiable Procedures.
MaNT04
Madsen, K., Nielsen, H. B., & Tingleff, O. (2004) Methods for non-linear least squares problems.
Mair13
Mairal, J. (2013) Optimization with first-order surrogate functions. ArXiv Preprint ArXiv:1305.3120.
Mair15
Mairal, J. (2015) Incremental Majorization-Minimization Optimization with Application to Large-Scale Machine Learning. SIAM Journal on Optimization, 25(2), 829–855. DOI.
MaBP14
Mairal, J., Bach, F., & Ponce, J. (2014) Sparse Modeling for Image and Vision Processing. Foundations and Trends® in Comput Graph. Vis., 8(2–3), 85–283. DOI.
MaBo10
Mattingley, J., & Boyd, S. (2010) Real-Time Convex Optimization in Signal Processing. IEEE Signal Processing Magazine, 27(3), 50–61. DOI.
MEKU08
Mcleod, D., Emmerson, G., Kohn, R., & (universit, G. K.(2008) Finding the invisible hand: an objective model of financial markets.
MeBM16
Mei, S., Bai, Y., & Montanari, A. (2016) The Landscape of Empirical Risk for Non-convex Losses. ArXiv:1607.06534 [Stat].
Nest04
Nesterov, Y. (2004) Introductory Lectures on Convex Optimization. (Vol. 87). Boston, MA: Springer US
Nest07
Nesterov, Y. (2007) Accelerating the cubic regularization of Newton’s method on convex problems. Mathematical Programming, 112(1), 159–181. DOI.
Nest12
Nesterov, Y. (2012) Gradient methods for minimizing composite functions. Mathematical Programming, 140(1), 125–161. DOI.
PoJu92
Polyak, B. T., & Juditsky, A. B.(1992) Acceleration of Stochastic Approximation by Averaging. SIAM Journal on Control and Optimization, 30(4), 838–855. DOI.
PoKo97
Portnoy, S., & Koenker, R. (1997) The Gaussian hare and the Laplacian tortoise: computability of squared-error versus absolute-error estimators. Statistical Science, 12(4), 279–300. DOI.
QHSG05
Queipo, N. V., Haftka, R. T., Shyy, W., Goel, T., Vaidyanathan, R., & Kevin Tucker, P. (2005) Surrogate-based analysis and optimization. Progress in Aerospace Sciences, 41(1), 1–28. DOI.
RoZh07
Rosset, S., & Zhu, J. (2007) Piecewise linear regularized solution paths. The Annals of Statistics, 35(3), 1012–1030. DOI.
SGAL14
Sagun, L., Guney, V. U., Arous, G. B., & LeCun, Y. (2014) Explorations on high dimensional landscapes. ArXiv:1412.6615 [Cs, Stat].
SSBA95
Sashank J. Reddi, Suvrit Sra, Barnabás Póczós, & Alex Smola. (1995) Stochastic Frank-Wolfe Methods for Nonconvex Optimization.
ScFR09
Schmidt, M., Fung, G., & Rosales, R. (2009) Optimization methods for l1-regularization. University of British Columbia, Technical Report TR-2009, 19.
ShTe11
Shalev-Shwartz, S., & Tewari, A. (2011) Stochastic Methods for L1-regularized Loss Minimization. J. Mach. Learn. Res., 12, 1865–1892.
STXY16
Shi, H.-J. M., Tu, S., Xu, Y., & Yin, W. (2016) A Primer on Coordinate Descent Algorithms. ArXiv:1610.00040 [Math, Stat].
SFHT11
Simon, N., Friedman, J., Hastie, T., & Tibshirani, R. (2011) Regularization Paths for Cox’s Proportional Hazards Model via Coordinate Descent. Journal of Statistical Software, 39(5).
SFJJ15
Smith, V., Forte, S., Jordan, M. I., & Jaggi, M. (2015) L1-Regularized Distributed Optimization: A Communication-Efficient Primal-Dual Framework. ArXiv:1512.04011 [Cs].
StVi16
Straszak, D., & Vishnoi, N. K.(2016) IRLS and Slime Mold: Equivalence and Convergence. ArXiv:1601.02712 [Cs, Math, Stat].
This97
Thisted, R. A.(1997) [The Gaussian Hare and the Laplacian Tortoise: Computability of Squared-Error versus Absolute-Error Estimators]: Comment. Statistical Science, 12(4), 296–298.
ToKW16
Townsend, J., Koep, N., & Weichwald, S. (2016) Pymanopt: A Python Toolbox for Optimization on Manifolds using Automatic Differentiation. Journal of Machine Learning Research, 17(137), 1–5.
Wain14
Wainwright, M. J.(2014) Structured Regularizers for High-Dimensional Problems: Statistical and Computational Issues. Annual Review of Statistics and Its Application, 1(1), 233–253. DOI.
WiWi15
Wibisono, A., & Wilson, A. C.(2015) On Accelerated Methods in Optimization. ArXiv:1509.03616 [Math].
WiWJ16
Wibisono, A., Wilson, A. C., & Jordan, M. I.(2016) A variational perspective on accelerated methods in optimization. Proceedings of the National Academy of Sciences, 113(47), E7351–E7358. DOI.
WiNa16
Wipf, D., & Nagarajan, S. (2016) Iterative Reweighted l1 and l2 Methods for Finding Sparse Solution. Microsoft Research.
WrNF09
Wright, S. J., Nowak, R. D., & Figueiredo, M. A. T.(2009) Sparse Reconstruction by Separable Approximation. IEEE Transactions on Signal Processing, 57(7), 2479–2493. DOI.
WuLa08
Wu, T. T., & Lange, K. (2008) Coordinate descent algorithms for lasso penalized regression. The Annals of Applied Statistics, 2(1), 224–244. DOI.
YaMM15
Yang, J., Meng, X., & Mahoney, M. W.(2015) Implementing Randomized Matrix Algorithms in Parallel and Distributed Environments. ArXiv:1502.03032 [Cs, Math, Stat].
YuTo09
Yun, S., & Toh, K.-C. (2009) A coordinate gradient descent method for ℓ 1-regularized convex minimization. Computational Optimization and Applications, 48(2), 273–307. DOI.
ZYJZ15
Zhang, L., Yang, T., Jin, R., & Zhou, Z.-H. (2015) Sparse Learning for Large-scale and High-dimensional Data: A Randomized Convex-concave Optimization Approach. ArXiv:1511.03766 [Cs].