The Living Thing / Notebooks :

Optimisation, continuous, batch/offline

Crawling through alien landscapes in the fog, looking for mountain peaks.

A huge and varied research discipline; this notebook will need to be broken apart This is not an educational resource so much as keyword-dump breadcrumb trail marking my plucking of results I need from someone else’s orchard.

This page is deprecated! It was an awful way of organising optimization fields, and didn’t even meet my study needs. I will gradually be salvaging the good bits and deleting it.

Keywords: Complimentary slackness theorem, High or very high dimensional methods, approximate method, Lagrange multipliers, primal and dual problems, fixed point methods, gradient, subgradient, proximal gradient, optimal control problems, convexity, sparsity, ways to avoid wrecking finding the extrema of perfectly simple little 10000-parameter functions before everyone observes that you are a fool in the guise of a mathematician but everyone is not there because you wandered off the optimal path hours ago, and now you are alone and lost in a valley of lower-case Greek letters.

See also geometry of fitness landscapes, expectation maximisation, matrix factorisations, discrete optimisation, nature-inspired “meta-heuristic” optimisation.


Brief intro material

  • Zeyuan ALLEN-ZHU: Recent Advances in Stochastic Convex and Non-Convex Optimization. Clear, missing some details but has good pointers.

  • basic but enlightening, John Nash’s graphical explanation of R’s optimization

  • Martin Jaggi’s Optimization in two hours

  • Celebrated union of optimisation, computational complexity and command-and-control-economics, by that showoff Cosma Shalizi: In Soviet Union, Optimization Problem Solves You

  • Elad Hazan The two cultures of optimization:

    The standard curriculum in high school math includes elementary functional analysis, and methods for finding the minima, maxima and saddle points of a single dimensional function. When moving to high dimensions, this becomes beyond the reach of your typical high-school student: mathematical optimization theory spans a multitude of involved techniques in virtually all areas of computer science and mathematics.

    Iterative methods, in particular, are the most successful algorithms for large-scale optimization and the most widely used in machine learning. Of these, most popular are first-order gradient-based methods due to their very low per-iteration complexity.

    However, way before these became prominent, physicists needed to solve large scale optimization problems, since the time of the Manhattan project at Los Alamos. The problems that they faced looked very different, essentially simulation of physical experiments, as were the solutions they developed. The Metropolis algorithm is the basis for randomized optimization methods and Markov Chain Monte Carlo algorithms.[…]

    In our recent paper (AbHa15), we show that for convex optimization, the heat path and central path for IPM for a particular barrier function (called the entropic barrier, following the terminology of the recent excellent work of Bubeck and Eldan) are identical! Thus, in some precise sense, the two cultures of optimization have been studied the same object in disguise and using different techniques.

  • Accelerated gradient descent 1 and 2


Whole free textbooks online. Mostly convex.

Constrained optimisation

Related: constraint solvers.

Lagrange multipliers

Constrained optimisation using Lagrange’s one weird trick, and the Karush–Kuhn–Tucker conditions. The search for saddle points and roots.

Alternating Direction Method of Multipliers

Dunno. It’s everywhere, though. Maybe this is a problem of definition, though? (Boyd10)

In this review, we argue that the alternating direction method of multipliers is well suited to distributed convex optimization, and in particular to large-scale problems arising in statistics, machine learning, and related areas. The method was developed in the 1970s, with roots in the 1950s, and is equivalent or closely related to many other algorithms, such as dual decomposition, the method of multipliers, Douglas–Rachford splitting, Spingarn’s method of partial inverses, Dykstra’s alternating projections, Bregman iterative algorithms for \(\ell_1\) problems, proximal methods, and others. After briefly surveying the theory and history of the algorithm, we discuss applications to a wide variety of statistical and machine learning problems of recent interest, including the lasso, sparse logistic regression, basis pursuit, covariance selection, support vector machines, and many others. We also discuss general distributed optimization, extensions to the nonconvex setting, and efficient implementation, including some details on distributed MPI and Hadoop Map Reduce implementations.


Oh crap, where did I rip this quote from?

The diverse world of machine learning applications has given rise to a plethora of algorithms and optimization methods, finely tuned to the specific regression or classification task at hand. We reduce the complexity of algorithm design for machine learning by reductions: we develop reductions that take a method developed for one setting and apply it to the entire spectrum of smoothness and strong-convexity in applications.

It will be from a Hazan and Allen-Zhu paper.

Continuous approximations of iterations

Recent papers (WiWJ16 WiWi15) argue that the discrete time steps can be viewed as a discrete approximation to a continuous time ODE which approaches the optimum (which in itself is trivial), but moreover that many algorithms fit into the same families of ODEs, that these ODEs explain Nesterov acceleration and generate new, improved optimisation methods. (which is not trivial.)

Continuous relaxations of parameters

Solving discrete problems with differentiable, continuous, surrogate parameters.


…Of composite functions

Hip for sparse regression, compressed sensing. etc. “FISTA” is one option. (Bubeck explains.)

Second order (Quasi-Newton)

If you have the second derivative you can be fancy when finding zeros.

Trust region

Pretend the gradient is locally quadratic, then seeing how bad that pretense was.

Least Squares

This particular objective function has some particular shortcuts; e.g. you don’t necessarily need a gradient to do it.

Sebastien Bubeck has a good write-up. Part 1, PArt 2.

Conjugate gradient method

Finding quadratic minima, as in Least Squares.

And also not-quite-linear uses of this.

A first order method.

…on a manifold

What if your constraints are naturally represented as some kind of smooth manifold? Is that worth thinking about? Apparently sometimes it is. See, e.g. ToKW16, BMAS14, or the free textbook on this theme.

See also Nicolas Boumen’s intorductory blog post.

Optimization on manifolds is about solving problems of the form

\begin{equation*} \mathrm{minimize}_{x\in\mathcal{M}} f(x), \end{equation*}

where \(\mathcal{M}\) is a nice, known manifold. By “nice”, I mean a smooth, finite-dimensional Riemannian manifold.

Practical examples include the following (and all possible products of these):

  • Euclidean spaces
  • The sphere (set of vectors or matrices with unit Euclidean norm)
  • The Stiefel manifold (set of orthonormal matrices)
  • The Grassmann manifold (set of linear subspaces of a given dimension; this is a quotient space)
  • The rotation group (set of orthogonal matrices with determinant +1)
  • The manifold of fixed-rank matrices
  • The same, further restricted to positive semidefinite matrices
  • The cone of (strictly) positive definite matrices

Conceptually, the key point is to think of optimization on manifolds as unconstrained optimization: we do not think of mathcal{M} as being embedded in a Euclidean space. Rather, we think of mathcal{M} as being “the only thing that exists,” and we strive for intrinsic methods. Besides making for elegant theory, it also makes it clear how to handle abstract spaces numerically (such as the Grassmann manifold for example); and it gives algorithms the “right” invariances (computations do not depend on an arbitrarily chosen representation of the manifold).

There are at least two reasons why this class of problems is getting much attention lately. First, it is because optimization problems over the aforementioned sets (mostly matrix sets) come up pervasively in applications, and at some point it became clear that the intrinsic viewpoint leads to better algorithms, as compared to general-purpose constrained optimization methods (where mathcal{M} is considered as being inside a Euclidean space mathcal{E}, and algorithms move in mathcal{E}, while penalizing distance to mathcal{M}). The second is that, as I will argue momentarily, Riemannian manifolds are “the right setting” to talk about unconstrained optimization. And indeed, there is a beautiful book by P.-A. Absil (my PhD advisor), R. Sepulchre (his advisor) and R. Mahony, called Optimization algorithms on matrix manifolds (freely available), that shows how the classical methods for unconstrained optimization (gradient descent, Newton, trust-regions, conjugate gradients…) carry over seamlessly to the more general Riemannian framework. So we can have one theory to cover optimization over a huge class of sets.

To file

Miscellaneous optimisation techniques suggested on Linkedin

The whole world of exotic specialized optimisers. See, e.g. Nuit Blanche name-dropping Bregmann iteration, alternating method, augmented Lagrangian…

Coordinate descent

Descent each coordinate individually.

Small clever hack for certain domains: log gradient descent.

Fixed point methods

Contraction maps are nice when you have them. TBD.

Primal/dual problems



When your objective function is not convex but you can represent it in terms of convex functions, somehow or other, use DC optimisation. (GaRC09.) (I don’t think this guarantees you a global optimum, but rather faster convergence to a local one)

Non-convex optimisation

“How doomed are you?”

Gradient-free optimization

Not all the methods described here use gradient information, but it’s frequently assumed to be something you can access easily. It’s worth considering which objectives you can optimize easily

But not all objectives are easily differentiable, even when parameters are continuous. For example, if you are not getting your measurement from a mathematical model, but from a physical experiment you can’t differentiate it since reality itself is usually not analytically differentiable. In this latter case, you are getting close to a question of online experiment design, as in ANOVA, and a further constraint that your function evaluations are possibly stupendously expensive. See Bayesian optimisation for one approach to this i the context of experiment design.

In general situations like this we use gradient-free methods, such as simulated annealing or numerical gradient etc. “Meta-heuristic” methods ——————————————-

Biologically-inspired or arbitrary. Evolutionary algorithms, particle swarm optimisation, ant colony optimisation, harmony search. A lot of the tricks from these are adopted into mainstream stochastic methods. Some not.

See biometic algorithms for the care and husbandry of such as those.

Annealing and Monte Carlo optimisation methods

Simulated annealing: Constructing a process to yield maximally-likely estimates for the parameters. This has a statistical mechanics justification that makes it attractive to physicists; But it’s generally useful. You don’t necessarily need a gradient here, just the ability to evaluate something interpretable as a “likelihood ratio”. Long story. I don’t yet cover this at Monte carlo methods but I should.

Expectation maximization

See expectation maximisation.

My problem: constrained, pathwise sparsifying-penalty optimisers for nonlinear problems

I’m trialling a bunch of sparse Lasso-like regression models. I want them to be fast-ish to run and fast to develop. I want them to go in python. I would like to be able to vary my regularisation parameter and warm-restart, like the glmnet Lasso. I would like to be able to handle constraints, especially component-wise non-negativity, and matrix non-negative-definiteness.

Notes on that here.


use scipy.optimize

give up the idea of warm restarts, and enforce constraints in the callback.

use cvxpy

Pretty good, but not that fast since it does not in general exploit gradient information. For some problems this is fine, though.

use spams

Wonderful, but only if your problem fits one of their categories. Otherwise you can maybe extract some bits from their code and use them, but that is now a serious project. They have a passable LASSO.

Roll my own algorithm

Potentially yak-shaving. But I can work from the examples of my colleagues, which are special-purpose algorithms, usually reasonably fast ones.

ADMM (Alternating Direction Method of Multipliers) for LASSO, which is a Python implementation for an original MATLAB version.
pytron (python/c++)
A Trust-Region Newton Method in Python using TRON optimization software (files src/tron.{h,cpp}) distributed from the LIBLINEAR sources (v1.93),
in the incoming scikit-learn looks likes a good reference.


Specialised optimisation software.

See also statistical software, and online optimisation

Many of these solvers optionally use commercial backends such as Mosek.


Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., … Devin, M. (2016) TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems. ArXiv Preprint ArXiv:1603.04467.
Abernethy, J., & Hazan, E. (2015) Faster Convex Optimization: Simulated Annealing with an Efficient Universal Barrier. ArXiv:1507.02528 [Cs, Math].
Achlioptas, D., Naor, A., & Peres, Y. (2005) Rigorous location of phase transitions in hard optimization problems. Nature, 435(7043), 759–764. DOI.
Allen-Zhu, Z., & Hazan, E. (2016) Optimal Black-Box Reductions Between Optimization Objectives. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, & R. Garnett (Eds.), Advances in Neural Information Processing Systems 29 (pp. 1606–1614). Curran Associates, Inc.
Bach, F. (2013) Convex relaxations of structured matrix factorizations. ArXiv:1309.3117 [Cs, Math].
Bach, F., Jenatton, R., Mairal, J., & Obozinski, G. (2012) Optimization with Sparsity-Inducing Penalties. Foundations and Trends® in Machine Learning, 4(1), 1–106. DOI.
Battiti, R. (1992) First-and second-order methods for learning: between steepest descent and Newton’s method. Neural Computation, 4(2), 141–166. DOI.
Beck, A., & Teboulle, M. (2009) A Fast Iterative Shrinkage-Thresholding Algorithm for Linear Inverse Problems. SIAM Journal on Imaging Sciences, 2(1), 183–202. DOI.
Bertsimas, D., & Popescu, I. (2005) Optimal Inequalities in Probability Theory: A Convex Optimization Approach. SIAM Journal on Optimization, 15(3), 780–804. DOI.
Bian, W., Chen, X., & Ye, Y. (2014) Complexity analysis of interior point algorithms for non-Lipschitz and nonconvex minimization. Mathematical Programming, 149(1–2), 301–327. DOI.
Boumal, N., Mishra, B., Absil, P.-A., & Sepulchre, R. (2014) Manopt, a Matlab Toolbox for Optimization on Manifolds. Journal of Machine Learning Research, 15, 1455–1459.
Boyd, S. (2010) Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers. Foundations and Trends® in Machine Learning, 3(1), 1–122. DOI.
Boyd, S. P., & Vandenberghe, L. (2004) Convex optimization. . Cambridge, UK ; New York: Cambridge University Press
Bubeck, S. (2015) Convex Optimization: Algorithms and Complexity. Foundations and Trends® in Machine Learning, 8(3–4), 231–357. DOI.
Bubeck, S., & Eldan, R. (2014) The entropic barrier: a simple and optimal universal self-concordant barrier. ArXiv:1412.1587 [Cs, Math].
Cevher, V., Becker, S., & Schmidt, M. (2014) Convex Optimization for Big Data. IEEE Signal Processing Magazine, 31(5), 32–43. DOI.
Chartrand, R., & Yin, W. (2008) Iteratively reweighted algorithms for compressive sensing. In IEEE International Conference on Acoustics, Speech and Signal Processing, 2008. ICASSP 2008 (pp. 3869–3872). DOI.
Chen, X. (2012) Smoothing methods for nonsmooth, nonconvex minimization. Mathematical Programming, 134(1), 71–99. DOI.
Chen, X., Ge, D., Wang, Z., & Ye, Y. (2012) Complexity of unconstrained L_2-L_p. Mathematical Programming, 143(1–2), 371–383. DOI.
Chretien, S. (2008) An Alternating l1 approach to the compressed sensing problem. ArXiv:0809.0660 [Stat].
Combettes, P. L., & Pesquet, J.-C. (2008) A proximal decomposition method for solving convex variational. Inverse Problems, 24(6), 065014. DOI.
Dalalyan, A. S.(2017) Further and stronger analogy between sampling and optimization: Langevin Monte Carlo and gradient descent. ArXiv:1704.04752 [Math, Stat].
Defazio, A., Bach, F., & Lacoste-Julien, S. (2014) SAGA: A Fast Incremental Gradient Method With Support for Non-Strongly Convex Composite Objectives. In Advances in Neural Information Processing Systems 27.
Fletcher, R., & Powell, M. J. D.(1963) A Rapidly Convergent Descent Method for Minimization. The Computer Journal, 6(2), 163–168. DOI.
Forrester, A. I. J., & Keane, A. J.(2009) Recent advances in surrogate-based optimization. Progress in Aerospace Sciences, 45(1–3), 50–79. DOI.
Friedman, J., Hastie, T., Höfling, H., & Tibshirani, R. (2007) Pathwise coordinate optimization. The Annals of Applied Statistics, 1(2), 302–332. DOI.
Friedman, J., Hastie, T., & Tibshirani, R. (2010) Regularization Paths for Generalized Linear Models via Coordinate Descent. Journal of Statistical Software, 33(1), 1–22. DOI.
Gasso, G., Rakotomamonjy, A., & Canu, S. (2009) Recovering Sparse Signals With a Certain Family of Nonconvex Penalties and DC Programming. IEEE Transactions on Signal Processing, 57(12), 4686–4698. DOI.
Giles, M. B.(2008) Collected Matrix Derivative Results for Forward and Reverse Mode Algorithmic Differentiation. In C. H. Bischof, H. M. Bücker, P. Hovland, U. Naumann, & J. Utke (Eds.), Advances in Automatic Differentiation (Vol. 64, pp. 35–44). Berlin, Heidelberg: Springer Berlin Heidelberg
Goldstein, A. (1965) On Steepest Descent. Journal of the Society for Industrial and Applied Mathematics Series A Control, 3(1), 147–151. DOI.
Goldstein, T., Studer, C., & Baraniuk, R. (2015) FASTA: A Generalized Implementation of Forward-Backward Splitting. ArXiv:1501.04979 [Cs, Math].
Han, Z.-H., & Zhang, K.-S. (2012) Surrogate-based optimization.
Harchaoui, Z., Juditsky, A., & Nemirovski, A. (2015) Conditional gradient algorithms for norm-regularized smooth convex optimization. Mathematical Programming, 152(1–2), 75–112. DOI.
Hosseini, R., & Sra, S. (2015) Manifold Optimization for Gaussian Mixture Models. ArXiv Preprint ArXiv:1506.07677.
Jaggi, M. (2013) Revisiting Frank-Wolfe: Projection-Free Sparse Convex Optimization. In Journal of Machine Learning Research (pp. 427–435).
Jakovetic, D., Freitas Xavier, J. M., & Moura, J. M. F.(2014) Convergence Rates of Distributed Nesterov-Like Gradient Methods on Random Networks. IEEE Transactions on Signal Processing, 62(4), 868–882. DOI.
Kim, D., & Haldar, J. P.(2016) Greedy algorithms for nonnegativity-constrained simultaneous sparse recovery. Signal Processing, 125, 274–289. DOI.
Klein, D. (2004) Lagrange multipliers without permanent scarring. University of California at Berkeley, Computer Science Division.
Li, Y., Liang, Y., & Risteski, A. (2016) Recovery Guarantee of Non-negative Matrix Factorization via Alternating Updates. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, & R. Garnett (Eds.), Advances in Neural Information Processing Systems 29 (pp. 4988–4996). Curran Associates, Inc.
Lin, H., Mairal, J., & Harchaoui, Z. (2016) QuickeNing: A Generic Quasi-Newton Algorithm for Faster Gradient-Based Optimization *. In arXiv:1610.00960 [math, stat].
Ma, C., Konečnỳ, J., Jaggi, M., Smith, V., Jordan, M. I., Richtárik, P., & Takáč, M. (2015) Distributed Optimization with Arbitrary Local Solvers. ArXiv Preprint ArXiv:1512.04039.
Ma, S., & Belkin, M. (2017) Diving into the shallows: a computational perspective on large-scale shallow learning. ArXiv:1703.10622 [Cs, Stat].
Maclaurin, D. (2016) Modeling, Inference and Optimization with Composable Differentiable Procedures.
Madsen, K., Nielsen, H. B., & Tingleff, O. (2004) Methods for non-linear least squares problems.
Mairal, J. (2013) Optimization with first-order surrogate functions. ArXiv Preprint ArXiv:1305.3120.
Mairal, J. (2015) Incremental Majorization-Minimization Optimization with Application to Large-Scale Machine Learning. SIAM Journal on Optimization, 25(2), 829–855. DOI.
Mairal, J., Bach, F., & Ponce, J. (2014) Sparse Modeling for Image and Vision Processing. Foundations and Trends® in Comput Graph. Vis., 8(2–3), 85–283. DOI.
Mattingley, J., & Boyd, S. (2010) Real-Time Convex Optimization in Signal Processing. IEEE Signal Processing Magazine, 27(3), 50–61. DOI.
Mcleod, D., Emmerson, G., Kohn, R., & (universit, G. K.(2008) Finding the invisible hand: an objective model of financial markets.
Mei, S., Bai, Y., & Montanari, A. (2016) The Landscape of Empirical Risk for Non-convex Losses. ArXiv:1607.06534 [Stat].
Nesterov, Y. (2004) Introductory Lectures on Convex Optimization. (Vol. 87). Boston, MA: Springer US
Nesterov, Y. (2007) Accelerating the cubic regularization of Newton’s method on convex problems. Mathematical Programming, 112(1), 159–181. DOI.
Nesterov, Y. (2012) Gradient methods for minimizing composite functions. Mathematical Programming, 140(1), 125–161. DOI.
Polyak, B. T., & Juditsky, A. B.(1992) Acceleration of Stochastic Approximation by Averaging. SIAM Journal on Control and Optimization, 30(4), 838–855. DOI.
Portnoy, S., & Koenker, R. (1997) The Gaussian hare and the Laplacian tortoise: computability of squared-error versus absolute-error estimators. Statistical Science, 12(4), 279–300. DOI.
Queipo, N. V., Haftka, R. T., Shyy, W., Goel, T., Vaidyanathan, R., & Kevin Tucker, P. (2005) Surrogate-based analysis and optimization. Progress in Aerospace Sciences, 41(1), 1–28. DOI.
Rosset, S., & Zhu, J. (2007) Piecewise linear regularized solution paths. The Annals of Statistics, 35(3), 1012–1030. DOI.
Sagun, L., Guney, V. U., Arous, G. B., & LeCun, Y. (2014) Explorations on high dimensional landscapes. ArXiv:1412.6615 [Cs, Stat].
Sashank J. Reddi, Suvrit Sra, Barnabás Póczós, & Alex Smola. (1995) Stochastic Frank-Wolfe Methods for Nonconvex Optimization.
Schmidt, M., Fung, G., & Rosales, R. (2009) Optimization methods for l1-regularization. University of British Columbia, Technical Report TR-2009, 19.
Shalev-Shwartz, S., & Tewari, A. (2011) Stochastic Methods for L1-regularized Loss Minimization. J. Mach. Learn. Res., 12, 1865–1892.
Shi, H.-J. M., Tu, S., Xu, Y., & Yin, W. (2016) A Primer on Coordinate Descent Algorithms. ArXiv:1610.00040 [Math, Stat].
Simon, N., Friedman, J., Hastie, T., & Tibshirani, R. (2011) Regularization Paths for Cox’s Proportional Hazards Model via Coordinate Descent. Journal of Statistical Software, 39(5).
Smith, V., Forte, S., Jordan, M. I., & Jaggi, M. (2015) L1-Regularized Distributed Optimization: A Communication-Efficient Primal-Dual Framework. ArXiv:1512.04011 [Cs].
Straszak, D., & Vishnoi, N. K.(2016) IRLS and Slime Mold: Equivalence and Convergence. ArXiv:1601.02712 [Cs, Math, Stat].
Thisted, R. A.(1997) [The Gaussian Hare and the Laplacian Tortoise: Computability of Squared-Error versus Absolute-Error Estimators]: Comment. Statistical Science, 12(4), 296–298.
Townsend, J., Koep, N., & Weichwald, S. (2016) Pymanopt: A Python Toolbox for Optimization on Manifolds using Automatic Differentiation. Journal of Machine Learning Research, 17(137), 1–5.
Wainwright, M. J.(2014) Structured Regularizers for High-Dimensional Problems: Statistical and Computational Issues. Annual Review of Statistics and Its Application, 1(1), 233–253. DOI.
Wibisono, A., & Wilson, A. C.(2015) On Accelerated Methods in Optimization. ArXiv:1509.03616 [Math].
Wibisono, A., Wilson, A. C., & Jordan, M. I.(2016) A variational perspective on accelerated methods in optimization. Proceedings of the National Academy of Sciences, 113(47), E7351–E7358. DOI.
Wipf, D., & Nagarajan, S. (2016) Iterative Reweighted l1 and l2 Methods for Finding Sparse Solution. Microsoft Research.
Wright, S. J., Nowak, R. D., & Figueiredo, M. A. T.(2009) Sparse Reconstruction by Separable Approximation. IEEE Transactions on Signal Processing, 57(7), 2479–2493. DOI.
Wu, T. T., & Lange, K. (2008) Coordinate descent algorithms for lasso penalized regression. The Annals of Applied Statistics, 2(1), 224–244. DOI.
Yang, J., Meng, X., & Mahoney, M. W.(2015) Implementing Randomized Matrix Algorithms in Parallel and Distributed Environments. ArXiv:1502.03032 [Cs, Math, Stat].
Yun, S., & Toh, K.-C. (2009) A coordinate gradient descent method for ℓ 1-regularized convex minimization. Computational Optimization and Applications, 48(2), 273–307. DOI.
Zhang, L., Yang, T., Jin, R., & Zhou, Z.-H. (2015) Sparse Learning for Large-scale and High-dimensional Data: A Randomized Convex-concave Optimization Approach. ArXiv:1511.03766 [Cs].