The hip high-dimensionally-tractable version of classic offline optimisation, one step at a time.

Something I am learning about belatedly; be warned. These are not reference-grade notes.

Mini-batch and stochastic methods for minimising loss when you have a lot of data, or a lot of parameters, and using it all at once is silly, or when you want to iteratively improve your solution as data comes in, and you have access to a gradient for your loss, ideally automatically calculated. It’s not clear at all that it should work, except by collating all your data and optimising offline, except that much of modern machine learning shows that it does.

Sometimes this apparently stupid trick it might even be fast for small-dimensional cases, so you may as well try.

Technically, “online” optimisation in bandit/RL problems might imply that you have to “minimise regret online”, which has a slightly different meaning and, e.g. involves seeing each training only as it arrives along some notional arrow of time, yet wishing to make the “best” decision at the next time, and possibly choosing your next experiment in order to trade-off exploration versus exploitation etc.

In, e.g., SGD you can see your data as often as you want and in whatever order, but you only look at a bit at a time. Usually the data is given and predictions make no difference to what information is available to you.

Some of the same technology pops up in each of these notions of online optimisation, but I am really thinking about SGD here.

There are many more permutations and variations used in practice.

## To file

- Elad Hazan and Satyan Kale’s tutorial on online convex optimisation.
- Geoffrey Hinton’s slides are good overview from the artificial neural network perspective, where “good in messy circumstances” is the rule and convexity is not assumed.
- Elad Hazan’s Introduction to Online Convex Optimization.
- Suvrit Sra’s eye-bleeding ugly introduction to this stuff
- Francis Bach’s slides on practical ML SGD.

### Gradient Descent

The workhorse of many large-ish scale machine-learning techniques and especially deep neural networks. See Sebastian Ruder’s explanation and his comparison.

## Conditional Gradient

a.k.a. Frank-Wolfe algorithm: Don’t know much about this.

## Stochastic

## Variance-reduced

Zeyuan Allen-Zhu : Faster Than SGD 1: Variance Reduction:

SGD is well-known for large-scale optimization. In my mind, there are two (and only two) fundamental improvements since the original introduction of SGD: (1) variance reduction, and (2) acceleration. In this post I’d love to conduct a survey regarding (1),

## Second order (Quasi-newton at scale)

LiSSA attempts to make 2nd order gradient descent methods practical (AgBH16):

linear time stochastic second order algorithm that achieves linear convergence for typical problems in machine learning while still maintaining run-times theoretically comparable to state-of-the-art first order algorithms. This relies heavily on the special structure of the optimization problem that allows our unbiased hessian estimator to be implemented efficiently, using only vector-vector products.

David McAllester observes:

Since \(H^{t+1}y^t\) can be computed efficiently whenever we can run backpropagation, the conditions under which the LiSSA algorithm can be run are actually much more general than the paper suggests. Backpropagation can be run on essentially any natural loss function.

What is Francis Bach’s new baby?

Beyond stochastic gradient descent for large-scale machine learning

Many machine learning and signal processing problems are traditionally cast as convex optimization problems. A common difficulty in solving these problems is the size of the data, where there are many observations (‘large n’) and each of these is large (‘large p’). In this setting, online algorithms such as stochastic gradient descent which pass over the data only once, are usually preferred over batch algorithms, which require multiple passes over the data. In this talk, I will show how the smoothness of loss functions may be used to design novel algorithms with improved behavior, both in theory and practice: in the ideal infinite-data setting, an efficient novel Newton-based stochastic approximation algorithm leads to a convergence rate of O(1/n) without strong convexity assumptions, while in the practical finite-data setting, an appropriate combination of batch and online algorithms leads to unexpected behaviors, such as a linear convergence rate for strongly convex problems, with an iteration cost similar to stochastic gradient descent. (joint work with Nicolas Le Roux, Eric Moulines and Mark Schmidt).

### Momentum, Nesterov Accelerated

How is this different than Newton-esque?

Firstly there is momentum gradient descent (TBD)

then Nesterov, giving Magical high speed convergence for smooth functions very simply, but with an esoteric explanation: Improved Nesterov accelaration and a lovely explanation of what the hell it is.

The summary from Goodfellow and Bengios deep learning book:

With Nesterov momentum the gradient is evaluated after the current velocity is applied. Thus one can interpret Nesterov momentum as attempting to add a correction factor to the standard method of momentum.

Normally considered for Gradient Descent, but also works for coordinate descent (Nest12a) and second order (Nest07).

Do momentum methods in the SGD setting fit here?

### …With sparsity

Hmmm.

### Parallel

Classic, basic SGD takes walks through the data set example-wise or feature-wise - but this doesn’t work in parallel, so you tend to go for mini-batch gradient descent so that you can at least vectorize. Apparently you can make SGD work in “true” parallel across communication-constrained cores, but I don’t yet understand how.

### Implementations

Specialized optimisation software.

See also statistical software.

- vowpal wabbit solve a bunch of optimisations well in an online setting.
- keras.optimizers support varied SGD-type online/minibatch optimizers for python + Tensorflow/Theano.
- tensorflow.train is a very similar list of online/SGD optimisers

### Refs

- ACDL14
- Agarwal, A., Chapelle, O., Dudık, M., & Langford, J. (2014) A Reliable Effective Terascale Linear Learning System.
*Journal of Machine Learning Research*, 15(1), 1111–1133. - AgBH16
- Agarwal, N., Bullins, B., & Hazan, E. (2016) Second Order Stochastic Optimization in Linear Time.
*ArXiv:1602.03943 [Cs, Stat]*. - AlHa16
- Allen-Zhu, Z., & Hazan, E. (2016) Optimal Black-Box Reductions Between Optimization Objectives. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, & R. Garnett (Eds.), Advances in Neural Information Processing Systems 29 (pp. 1606–1614). Curran Associates, Inc.
- BaMo11
- Bach, F., & Moulines, E. (2011) Non-Asymptotic Analysis of Stochastic Approximation Algorithms for Machine Learning. In Advances in Neural Information Processing Systems (NIPS) (p. ). Spain
- BaMo00
- Bach, F. R., & Moulines, E. (n.d.) Non-strongly-convex smooth stochastic approximation with convergence rate O(1/n).
- Batt92
- Battiti, R. (1992) First-and second-order methods for learning: between steepest descent and Newton’s method.
*Neural Computation*, 4(2), 141–166. DOI. - BoBG09
- Bordes, A., Bottou, L., & Gallinari, P. (2009) SGD-QN: Careful Quasi-Newton Stochastic Gradient Descent.
*Journal of Machine Learning Research*, 10, 1737–1754. - BoLB16
- Botev, A., Lever, G., & Barber, D. (2016) Nesterov’s Accelerated Gradient and Momentum as approximations to Regularised Update Descent.
*ArXiv:1607.01981 [Cs, Stat]*. - Bott91
- Bottou, L. (1991) Stochastic Gradient Learning in Neural Networks. In Proceedings of Neuro-Nîmes 91. Nimes, France: EC2
- Bott98
- Bottou, L. (1998) Online Algorithms and Stochastic Approximations. In D. Saad (Ed.), Online Learning and Neural Networks (Vol. 17, p. 142). Cambridge, UK: Cambridge University Press
- Bott10
- Bottou, L. (2010) Large-scale machine learning with stochastic gradient descent. In Proceedings of the 19th International Conference on Computational Statistics (COMPSTAT’2010) (pp. 177–186). Paris, France: Springer
- Bott12
- Bottou, L. (2012) Stochastic Gradient Descent Tricks. In Neural Networks: Tricks of the Trade (pp. 421–436). Springer, Berlin, Heidelberg DOI.
- BoBo08
- Bottou, L., & Bousquet, O. (2008) The Tradeoffs of Large Scale Learning. In J. C. Platt, D. Koller, Y. Singer, & S. Roweis (Eds.), Advances in Neural Information Processing Systems (Vol. 20, pp. 161–168). NIPS Foundation (http://books.nips.cc)
- BoCN16
- Bottou, L., Curtis, F. E., & Nocedal, J. (2016) Optimization Methods for Large-Scale Machine Learning.
*ArXiv:1606.04838 [Cs, Math, Stat]*. - BoLe04
- Bottou, L., & LeCun, Y. (2004) Large Scale Online Learning. In S. Thrun, L. Saul, & B. Schölkopf (Eds.), Advances in Neural Information Processing Systems 16. Cambridge, MA: MIT Press
- Bube15
- Bubeck, S. (2015) Convex Optimization: Algorithms and Complexity.
*Foundations and Trends® in Machine Learning*, 8(3–4), 231–357. DOI. - CeBS14
- Cevher, V., Becker, S., & Schmidt, M. (2014) Convex Optimization for Big Data.
*IEEE Signal Processing Magazine*, 31(5), 32–43. DOI. - Chen12
- Chen, X. (2012) Smoothing methods for nonsmooth, nonconvex minimization.
*Mathematical Programming*, 134(1), 71–99. DOI. - CHMB15
- Choromanska, A., Henaff, Mi., Mathieu, M., Ben Arous, G., & LeCun, Y. (2015) The Loss Surfaces of Multilayer Networks. In Proceedings of the Eighteenth International Conference on Artificial Intelligence and Statistics (pp. 192–204).
- DPGC14
- Dauphin, Y., Pascanu, R., Gulcehre, C., Cho, K., Ganguli, S., & Bengio, Y. (2014) Identifying and attacking the saddle point problem in high-dimensional non-convex optimization. In Advances in Neural Information Processing Systems 27 (pp. 2933–2941). Curran Associates, Inc.
- DeBL14
- Defazio, A., Bach, F., & Lacoste-Julien, S. (2014) SAGA: A Fast Incremental Gradient Method With Support for Non-Strongly Convex Composite Objectives. In Advances in Neural Information Processing Systems 27.
- Devo98
- DeVore, R. A.(1998) Nonlinear approximation.
*Acta Numerica*, 7, 51–150. DOI. - DuHS11
- Duchi, J., Hazan, E., & Singer, Y. (2011) Adaptive Subgradient Methods for Online Learning and Stochastic Optimization.
*Journal of Machine Learning Research*, 12(Jul), 2121–2159. - FrSc12
- Friedlander, M. P., & Schmidt, M. (2012) Hybrid Deterministic-Stochastic Methods for Data Fitting.
*SIAM Journal on Scientific Computing*, 34(3), A1380–A1405. DOI. - Frie02
- Friedman, J. H.(2002) Stochastic gradient boosting.
*Computational Statistics & Data Analysis*, 38(4), 367–378. DOI. - GhLa13a
- Ghadimi, S., & Lan, G. (2013a) Accelerated Gradient Methods for Nonconvex Nonlinear and Stochastic Programming.
*ArXiv:1310.3787 [Math]*. - GhLa13b
- Ghadimi, S., & Lan, G. (2013b) Stochastic First- and Zeroth-order Methods for Nonconvex Stochastic Programming.
*SIAM Journal on Optimization*, 23(4), 2341–2368. DOI. - Heyd74
- Heyde, C. C.(1974) On martingale limit theory and strong convergence results for stochastic approximation procedures.
*Stochastic Processes and Their Applications*, 2(4), 359–370. DOI. - HiSK00
- Hinton, G., Srivastava, N., & Kevin Swersky. (n.d.) Neural Networks for Machine Learning.
- HuPK09
- Hu, C., Pan, W., & Kwok, J. T.(2009) Accelerated gradient methods for stochastic optimization and online learning. In Advances in Neural Information Processing Systems (pp. 781–789). Curran Associates, Inc.
- JaFM14
- Jakovetic, D., Freitas Xavier, J. M., & Moura, J. M. F.(2014) Convergence Rates of Distributed Nesterov-Like Gradient Methods on Random Networks.
*IEEE Transactions on Signal Processing*, 62(4), 868–882. DOI. - KiBa15
- Kingma, D., & Ba, J. (2015) Adam: A Method for Stochastic Optimization.
*Proceeding of ICLR*. - Lai03
- Lai, T. L.(2003) Stochastic Approximation.
*The Annals of Statistics*, 31(2), 391–406. DOI. - LaLZ09
- Langford, J., Li, L., & Zhang, T. (2009) Sparse Online Learning via Truncated Gradient. In D. Koller, D. Schuurmans, Y. Bengio, & L. Bottou (Eds.), Advances in Neural Information Processing Systems 21 (pp. 905–912). Curran Associates, Inc.
- Levy16
- Levy, K. Y.(2016) The Power of Normalization: Faster Evasion of Saddle Points.
*ArXiv:1611.04831 [Cs, Math, Stat]*. - LiLR16
- Li, Y., Liang, Y., & Risteski, A. (2016) Recovery Guarantee of Non-negative Matrix Factorization via Alternating Updates. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, & R. Garnett (Eds.), Advances in Neural Information Processing Systems 29 (pp. 4988–4996). Curran Associates, Inc.
- LuMH15
- Lucchi, A., McWilliams, B., & Hofmann, T. (2015) A Variance Reduced Stochastic Newton Method.
*ArXiv:1503.08316 [Cs]*. - MaBe17
- Ma, S., & Belkin, M. (2017) Diving into the shallows: a computational perspective on large-scale shallow learning.
*ArXiv:1703.10622 [Cs, Stat]*. - Mair13
- Mairal, J. (2013) Stochastic majorization-minimization algorithms for large-scale optimization. In Advances in Neural Information Processing Systems (pp. 2283–2291).
- MHSY13
- McMahan, H. B., Holt, G., Sculley, D., Young, M., Ebner, D., Grady, J., … Kubica, J. (2013) Ad Click Prediction: A View from the Trenches. In Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 1222–1230). New York, NY, USA: ACM DOI.
- MZHR16
- Mitliagkas, I., Zhang, C., Hadjis, S., & Ré, C. (2016) Asynchrony begets Momentum, with an Application to Deep Learning.
*ArXiv:1605.09774 [Cs, Math, Stat]*. - Nest07
- Nesterov, Y. (2007) Accelerating the cubic regularization of Newton’s method on convex problems.
*Mathematical Programming*, 112(1), 159–181. DOI. - Nest12a
- Nesterov, Y. (2012a) Efficiency of Coordinate Descent Methods on Huge-Scale Optimization Problems.
*SIAM Journal on Optimization*, 22(2), 341–362. DOI. - Nest12b
- Nesterov, Y. (2012b) Gradient methods for minimizing composite functions.
*Mathematical Programming*, 140(1), 125–161. DOI. - NLST17
- Nguyen, L. M., Liu, J., Scheinberg, K., & Takáč, M. (2017) Stochastic Recursive Gradient Algorithm for Nonconvex Optimization.
*ArXiv:1705.07261 [Cs, Math, Stat]*. - Pate17
- Patel, V. (2017) On SGD’s Failure in Practice: Characterizing and Overcoming Stalling.
*ArXiv:1702.00317 [Cs, Math, Stat]*. - PoJu92
- Polyak, B. T., & Juditsky, A. B.(1992) Acceleration of Stochastic Approximation by Averaging.
*SIAM Journal on Control and Optimization*, 30(4), 838–855. DOI. - RHSP16
- Reddi, S. J., Hefny, A., Sra, S., Poczos, B., & Smola, A. (2016) Stochastic Variance Reduction for Nonconvex Optimization. In PMLR (Vol. 1603, pp. 314–323).
- RoMo51
- Robbins, H., & Monro, S. (1951) A Stochastic Approximation Method.
*The Annals of Mathematical Statistics*, 22(3), 400–407. DOI. - RoSi85
- Robbins, H., & Siegmund, D. (1985) A convergence theorem for non negative almost supermartingales and some applications. In Herbert Robbins Selected Papers (pp. 111–135). Springer DOI.
- Rude16
- Ruder, S. (2016) An overview of gradient descent optimization algorithms.
*ArXiv:1609.04747 [Cs]*. - Rupp85
- Ruppert, D. (1985) A Newton-Raphson Version of the Multivariate Robbins-Monro Procedure.
*The Annals of Statistics*, 13(1), 236–245. DOI. - SGAL14
- Sagun, L., Guney, V. U., Arous, G. B., & LeCun, Y. (2014) Explorations on high dimensional landscapes.
*ArXiv:1412.6615 [Cs, Stat]*. - SaKi16
- Salimans, T., & Kingma, D. P.(2016) Weight Normalization: A Simple Reparameterization to Accelerate Training of Deep Neural Networks. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, & R. Garnett (Eds.), Advances in Neural Information Processing Systems 29 (pp. 901–901). Curran Associates, Inc.
- ShTe11
- Shalev-Shwartz, S., & Tewari, A. (2011) Stochastic Methods for L1-regularized Loss Minimization.
*J. Mach. Learn. Res.*, 12, 1865–1892. - Spal00
- Spall, J. C.(2000) Adaptive stochastic approximation by the simultaneous perturbation method.
*IEEE Transactions on Automatic Control*, 45(10), 1839–1853. DOI. - VSSM06
- Vishwanathan, S. V. N., Schraudolph, N. N., Schmidt, M. W., & Murphy, K. P.(2006) Accelerated Training of Conditional Random Fields with Stochastic Gradient Methods. In Proceedings of the 23rd International Conference on Machine Learning.
- Wain14
- Wainwright, M. J.(2014) Structured Regularizers for High-Dimensional Problems: Statistical and Computational Issues.
*Annual Review of Statistics and Its Application*, 1(1), 233–253. DOI. - Xu11
- Xu, W. (2011) Towards Optimal One Pass Large Scale Learning with Averaged Stochastic Gradient Descent.
*ArXiv:1107.2490 [Cs]*. - ZhWG17
- Zhang, X., Wang, L., & Gu, Q. (2017) Stochastic Variance-reduced Gradient Descent for Low-rank Matrix Recovery from Linear Measurements.
*ArXiv:1701.00481 [Stat]*. - ZWLS10
- Zinkevich, M., Weimer, M., Li, L., & Smola, A. J.(2010) Parallelized Stochastic Gradient Descent. In J. D. Lafferty, C. K. I. Williams, J. Shawe-Taylor, R. S. Zemel, & A. Culotta (Eds.), Advances in Neural Information Processing Systems 23 (pp. 2595–2603). Curran Associates, Inc.