Notes here iff I need ’em.

Newton-type optimization uses 2nd-order gradient information (i.e. a Hessian matrix) to solve optimiztion problems.

Can one do higher order optimisation? Of course you can, but in practice second order is already challenging, so let’s pause to examin it.

## Hessian free

Second order optimisation that does not require the Hessian matrix to be given explicitly.

Andre Gibiansky’s example for coders.

## Stochastic

LiSSA attempts to make 2nd order gradient descent methods practical (AgBH16):

linear time stochastic second order algorithm that achieves linear convergence for typical problems in machine learning while still maintaining run-times theoretically comparable to state-of-the-art first order algorithms. This relies heavily on the special structure of the optimization problem that allows our unbiased hessian estimator to be implemented efficiently, using only vector-vector products.

David McAllester observes:

Since \(H^{t+1}y^t\) can be computed efficiently whenever we can run backpropagation, the conditions under which the LiSSA algorithm can be run are actually much more general than the paper suggests. Backpropagation can be run on essentially any natural loss function.

What is Francis Bach’s new baby? Finite sample guarantees for certain Newton-like treatments of SGD for certain problems: (BaMo11, BaMo13)

Beyond stochastic gradient descent for large-scale machine learning

Many machine learning and signal processing problems are traditionally cast as convex optimization problems. A common difficulty in solving these problems is the size of the data, where there are many observations (‘large n’) and each of these is large (‘large p’). In this setting, online algorithms such as stochastic gradient descent which pass over the data only once, are usually preferred over batch algorithms, which require multiple passes over the data. In this talk, I will show how the smoothness of loss functions may be used to design novel algorithms with improved behavior, both in theory and practice: in the ideal infinite-data setting, an efficient novel Newton-based stochastic approximation algorithm leads to a convergence rate of O(1/n) without strong convexity assumptions, while in the practical finite-data setting, an appropriate combination of batch and online algorithms leads to unexpected behaviors, such as a linear convergence rate for strongly convex problems, with an iteration cost similar to stochastic gradient descent. (joint work with Nicolas Le Roux, Eric Moulines and Mark Schmidt).

## Secant conditions and update designs

Let’s say we are designing a second-order update method.

See e.g. Nocedal and Wright.

the BFGS update satisfies the secant condition \[H_k s=y\] i.e. \[H_k(x^{(k)} − x^{(k−1)}) = \nabla f(x^{(k)}) − \nabla f(x^{(k−1)})\]

TBC

## Refs

- ArWa17: Sébastien M. R. Arnold, Chunming Wang (2017) Accelerating SGD for Distributed Deep-Learning Using Approximated Hessian Matrix. In arXiv:1709.05069 [cs].
- Nest07: Yu Nesterov (2007) Accelerating the cubic regularization of Newton’s method on convex problems.
*Mathematical Programming*, 112(1), 159–181. DOI - DeSc89: J. E. Dennis, Robert B. Schnabel (1989) Chapter I A view of unconstrained optimization. In Handbooks in Operations Research and Management Science (Vol. 1, pp. 1–72). Elsevier DOI
- Mart10: James Martens (2010) Deep Learning via Hessian-free Optimization. In Proceedings of the 27th International Conference on International Conference on Machine Learning (pp. 735–742). USA: Omnipress
- BaGM16: Jimmy Ba, Roger Grosse, James Martens (2016) Distributed Second-Order Optimization using Kronecker-Factored Approximations.
- Schr02: Nicol N. Schraudolph (2002) Fast Curvature Matrix-Vector Products for Second-Order Gradient Descent.
*Neural Computation*, 14(7), 1723–1738. DOI - ChDL15: Minhyung Cho, Chandra Shekhar Dhir, Jaehyung Lee (2015) Hessian-free Optimization for Learning Deep Multidimensional Recurrent Neural Networks. In Advances In Neural Information Processing Systems.
- PiWa16: Mert Pilanci, Martin J. Wainwright (2016) Iterative Hessian sketch: Fast and accurate solution approximation for constrained least-squares.
*Journal of Machine Learning Research*, 17(53), 1–38. - MaSu11: James Martens, Ilya Sutskever (2011) Learning Recurrent Neural Networks with Hessian-free Optimization. In Proceedings of the 28th International Conference on International Conference on Machine Learning (pp. 1033–1040). USA: Omnipress
- BMAS14: Nicolas Boumal, Bamdev Mishra, P.-A. Absil, Rodolphe Sepulchre (2014) Manopt, a Matlab Toolbox for Optimization on Manifolds.
*Journal of Machine Learning Research*, 15, 1455–1459. - MaNT04: K Madsen, H.B. Nielsen, O. Tingleff (2004) Methods for non-linear least squares problems
- BaMo11: Francis Bach, Eric Moulines (2011) Non-Asymptotic Analysis of Stochastic Approximation Algorithms for Machine Learning. In Advances in Neural Information Processing Systems (NIPS) (p. ). Spain
- BaMo13: Francis R. Bach, Eric Moulines (2013) Non-strongly-convex smooth stochastic approximation with convergence rate O(1/n). In arXiv:1306.2119 [cs, math, stat] (pp. 773–781).
- ScFR09: Mark Schmidt, Glenn Fung, Romer Rosales (2009) Optimization methods for l1-regularization.
*University of British Columbia, Technical Report TR-2009*, 19. - BoCN16: Léon Bottou, Frank E. Curtis, Jorge Nocedal (2016) Optimization Methods for Large-Scale Machine Learning.
*ArXiv:1606.04838 [Cs, Math, Stat]*. - BJMO12: Francis Bach, Rodolphe Jenatton, Julien Mairal, Guillaume Obozinski (2012) Optimization with Sparsity-Inducing Penalties.
*Foundations and Trends® in Machine Learning*, 4(1), 1–106. DOI - AgBH16: Naman Agarwal, Brian Bullins, Elad Hazan (2016) Second Order Stochastic Optimization in Linear Time.
*ArXiv:1602.03943 [Cs, Stat]*. - MaSu12: James Martens, Ilya Sutskever (2012) Training deep and recurrent networks with hessian-free optimization. In Neural networks: Tricks of the trade (pp. 479–535). Springer