# Optimisation, continuous, online, possibly stochastic

### looking at random things to find the hidden thing

The hip high-dimensionally-tractable version of classic offline optimisation, one step at a time.

Something I am learning about belatedly; be warned. These are not reference-grade notes.

Mini-batch and stochastic methods for minimising loss when you have a lot of data, or a lot of parameters, and using it all at once is silly, or when you want to iteratively improve your solution as data comes in, and you have access to a gradient for your loss, ideally automatically calculated. It’s not clear at all that it should work, except by collating all your data and optimising offline, except that much of modern machine learning shows that it does.

Sometimes this apparently stupid trick it might even be fast for small-dimensional cases, so you may as well try.

Technically, “online” optimisation in bandit/RL problems might imply that you have to “minimise regret online”, which has a slightly different meaning and, e.g. involves seeing each training only as it arrives along some notional arrow of time, yet wishing to make the “best” decision at the next time, and possibly choosing your next experiment in order to trade-off exploration versus exploitation etc.

In SGD you can see your data as often as you want and in whatever order, but you only look at a bit at a time. Usually the data is given and predictions make no difference to what information is available to you.

Some of the same technology pops up in each of these notions of online optimisation, but I am really thinking about SGD here.

There are many more permutations and variations used in practice.

## To file

The workhorse of many large-ish scale machine-learning techniques and especially deep neural networks. See Sebastian Ruder’s explanation and his comparison.

a.k.a. Frank-Wolfe algorithm: Don’t know much about this.

## Variance-reduced

Zeyuan Allen-Zhu : Faster Than SGD 1: Variance Reduction:

SGD is well-known for large-scale optimization. In my mind, there are two (and only two) fundamental improvements since the original introduction of SGD: (1) variance reduction, and (2) acceleration. In this post I’d love to conduct a survey regarding (1),

## Second order (Quasi-newton at scale)

LiSSA attempts to make 2nd order gradient descent methods practical (AgBH16):

linear time stochastic second order algorithm that achieves linear convergence for typical problems in machine learning while still maintaining run-times theoretically comparable to state-of-the-art first order algorithms. This relies heavily on the special structure of the optimization problem that allows our unbiased hessian estimator to be implemented efficiently, using only vector-vector products.

David McAllester observes:

Since $H^{t+1}y^t$ can be computed efficiently whenever we can run backpropagation, the conditions under which the LiSSA algorithm can be run are actually much more general than the paper suggests. Backpropagation can be run on essentially any natural loss function.

What is Francis Bach’s new baby? Finite sample guarnatees for certain unusual treatemetns of SGD for certain problems: (BaMo11, BaMo13)

Beyond stochastic gradient descent for large-scale machine learning

Many machine learning and signal processing problems are traditionally cast as convex optimization problems. A common difficulty in solving these problems is the size of the data, where there are many observations (‘large n’) and each of these is large (‘large p’). In this setting, online algorithms such as stochastic gradient descent which pass over the data only once, are usually preferred over batch algorithms, which require multiple passes over the data. In this talk, I will show how the smoothness of loss functions may be used to design novel algorithms with improved behavior, both in theory and practice: in the ideal infinite-data setting, an efficient novel Newton-based stochastic approximation algorithm leads to a convergence rate of O(1/n) without strong convexity assumptions, while in the practical finite-data setting, an appropriate combination of batch and online algorithms leads to unexpected behaviors, such as a linear convergence rate for strongly convex problems, with an iteration cost similar to stochastic gradient descent. (joint work with Nicolas Le Roux, Eric Moulines and Mark Schmidt).

### Momentum, Nesterov Accelerated

How is this different than Newton-esque?

Firstly there is momentum gradient descent (TBD)

then Nesterov, giving Magical high speed convergence for smooth functions very simply, but with an esoteric explanation: Improved Nesterov accelaration and a lovely explanation of what the hell it is.

The summary from Goodfellow and Bengios deep learning book:

With Nesterov momentum the gradient is evaluated after the current velocity is applied. Thus one can interpret Nesterov momentum as attempting to add a correction factor to the standard method of momentum.

Normally considered for Gradient Descent, but also works for coordinate descent (Nest12a) and second order (Nest07).

Do momentum methods in the SGD setting fit here?

Hmmm.

### Sundry Hacks

Yellowfin an automatic SGD momentum tuner

### Parallel

Classic, basic SGD takes walks through the data set example-wise or feature-wise - but this doesn’t work in parallel, so you tend to go for mini-batch gradient descent so that you can at least vectorize. Apparently you can make SGD work in “true” parallel across communication-constrained cores, but I don’t yet understand how.

### Implementations

Specialized optimisation software.

• vowpal wabbit solve a bunch of optimisations well in an online setting.
• keras.optimizers support varied SGD-type online/minibatch optimizers for python + Tensorflow/Theano.
• tensorflow.train is a very similar list of online/SGD optimisers

### Refs

ACDL14
Agarwal, A., Chapelle, O., Dudık, M., & Langford, J. (2014) A Reliable Effective Terascale Linear Learning System. Journal of Machine Learning Research, 15(1), 1111–1133.
AgBH16
Agarwal, N., Bullins, B., & Hazan, E. (2016) Second Order Stochastic Optimization in Linear Time. ArXiv:1602.03943 [Cs, Stat].
AlHa16
Allen-Zhu, Z., & Hazan, E. (2016) Optimal Black-Box Reductions Between Optimization Objectives. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, & R. Garnett (Eds.), Advances in Neural Information Processing Systems 29 (pp. 1606–1614). Curran Associates, Inc.
BaMo11
Bach, F., & Moulines, E. (2011) Non-Asymptotic Analysis of Stochastic Approximation Algorithms for Machine Learning. In Advances in Neural Information Processing Systems (NIPS) (p. ). Spain
BaMo13
Bach, F. R., & Moulines, E. (2013) Non-strongly-convex smooth stochastic approximation with convergence rate O(1/n). ArXiv:1306.2119 [Cs, Math, Stat].
Batt92
Battiti, R. (1992) First-and second-order methods for learning: between steepest descent and Newton’s method. Neural Computation, 4(2), 141–166. DOI.
BoBG09
Bordes, A., Bottou, L., & Gallinari, P. (2009) SGD-QN: Careful Quasi-Newton Stochastic Gradient Descent. Journal of Machine Learning Research, 10, 1737–1754.
BoLB16
Botev, A., Lever, G., & Barber, D. (2016) Nesterov’s Accelerated Gradient and Momentum as approximations to Regularised Update Descent. ArXiv:1607.01981 [Cs, Stat].
Bott91
Bottou, L. (1991) Stochastic Gradient Learning in Neural Networks. In Proceedings of Neuro-Nîmes 91. Nimes, France: EC2
Bott98
Bottou, L. (1998) Online Algorithms and Stochastic Approximations. In D. Saad (Ed.), Online Learning and Neural Networks (Vol. 17, p. 142). Cambridge, UK: Cambridge University Press
Bott10
Bottou, L. (2010) Large-scale machine learning with stochastic gradient descent. In Proceedings of the 19th International Conference on Computational Statistics (COMPSTAT’2010) (pp. 177–186). Paris, France: Springer
Bott12
Bottou, L. (2012) Stochastic Gradient Descent Tricks. In Neural Networks: Tricks of the Trade (pp. 421–436). Springer, Berlin, Heidelberg DOI.
BoBo08
Bottou, L., & Bousquet, O. (2008) The Tradeoffs of Large Scale Learning. In J. C. Platt, D. Koller, Y. Singer, & S. Roweis (Eds.), Advances in Neural Information Processing Systems (Vol. 20, pp. 161–168). NIPS Foundation (http://books.nips.cc)
BoCN16
Bottou, L., Curtis, F. E., & Nocedal, J. (2016) Optimization Methods for Large-Scale Machine Learning. ArXiv:1606.04838 [Cs, Math, Stat].
BoLe04
Bottou, L., & LeCun, Y. (2004) Large Scale Online Learning. In S. Thrun, L. Saul, & B. Schölkopf (Eds.), Advances in Neural Information Processing Systems 16. Cambridge, MA: MIT Press
Bube15
Bubeck, S. (2015) Convex Optimization: Algorithms and Complexity. Foundations and Trends® in Machine Learning, 8(3–4), 231–357. DOI.
CeBS14
Cevher, V., Becker, S., & Schmidt, M. (2014) Convex Optimization for Big Data. IEEE Signal Processing Magazine, 31(5), 32–43. DOI.
Chen12
Chen, X. (2012) Smoothing methods for nonsmooth, nonconvex minimization. Mathematical Programming, 134(1), 71–99. DOI.
CHMB15
Choromanska, A., Henaff, Mi., Mathieu, M., Ben Arous, G., & LeCun, Y. (2015) The Loss Surfaces of Multilayer Networks. In Proceedings of the Eighteenth International Conference on Artificial Intelligence and Statistics (pp. 192–204).
DPGC14
Dauphin, Y., Pascanu, R., Gulcehre, C., Cho, K., Ganguli, S., & Bengio, Y. (2014) Identifying and attacking the saddle point problem in high-dimensional non-convex optimization. In Advances in Neural Information Processing Systems 27 (pp. 2933–2941). Curran Associates, Inc.
DeBL14
Defazio, A., Bach, F., & Lacoste-Julien, S. (2014) SAGA: A Fast Incremental Gradient Method With Support for Non-Strongly Convex Composite Objectives. In Advances in Neural Information Processing Systems 27.
Devo98
DeVore, R. A.(1998) Nonlinear approximation. Acta Numerica, 7, 51–150. DOI.
DuHS11
Duchi, J., Hazan, E., & Singer, Y. (2011) Adaptive Subgradient Methods for Online Learning and Stochastic Optimization. Journal of Machine Learning Research, 12(Jul), 2121–2159.
FrSc12
Friedlander, M. P., & Schmidt, M. (2012) Hybrid Deterministic-Stochastic Methods for Data Fitting. SIAM Journal on Scientific Computing, 34(3), A1380–A1405. DOI.
Frie02
Friedman, J. H.(2002) Stochastic gradient boosting. Computational Statistics & Data Analysis, 38(4), 367–378. DOI.
GhLa13a
Ghadimi, S., & Lan, G. (2013a) Accelerated Gradient Methods for Nonconvex Nonlinear and Stochastic Programming. ArXiv:1310.3787 [Math].
GhLa13b
Ghadimi, S., & Lan, G. (2013b) Stochastic First- and Zeroth-order Methods for Nonconvex Stochastic Programming. SIAM Journal on Optimization, 23(4), 2341–2368. DOI.
HaLS15
Hazan, E., Levy, K., & Shalev-Shwartz, S. (2015) Beyond Convexity: Stochastic Quasi-Convex Optimization. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, & R. Garnett (Eds.), Advances in Neural Information Processing Systems 28 (pp. 1594–1602). Curran Associates, Inc.
Heyd74
Heyde, C. C.(1974) On martingale limit theory and strong convergence results for stochastic approximation procedures. Stochastic Processes and Their Applications, 2(4), 359–370. DOI.
HiSK00
Hinton, G., Srivastava, N., & Kevin Swersky. (n.d.) Neural Networks for Machine Learning.
HuPK09
Hu, C., Pan, W., & Kwok, J. T.(2009) Accelerated gradient methods for stochastic optimization and online learning. In Advances in Neural Information Processing Systems (pp. 781–789). Curran Associates, Inc.
JaFM14
Jakovetic, D., Freitas Xavier, J. M., & Moura, J. M. F.(2014) Convergence Rates of Distributed Nesterov-Like Gradient Methods on Random Networks. IEEE Transactions on Signal Processing, 62(4), 868–882. DOI.
KiBa15
Kingma, D., & Ba, J. (2015) Adam: A Method for Stochastic Optimization. Proceeding of ICLR.
Lai03
Lai, T. L.(2003) Stochastic Approximation. The Annals of Statistics, 31(2), 391–406. DOI.
LaLZ09
Langford, J., Li, L., & Zhang, T. (2009) Sparse Online Learning via Truncated Gradient. In D. Koller, D. Schuurmans, Y. Bengio, & L. Bottou (Eds.), Advances in Neural Information Processing Systems 21 (pp. 905–912). Curran Associates, Inc.
Levy16
Levy, K. Y.(2016) The Power of Normalization: Faster Evasion of Saddle Points. ArXiv:1611.04831 [Cs, Math, Stat].
LiLR16
Li, Y., Liang, Y., & Risteski, A. (2016) Recovery Guarantee of Non-negative Matrix Factorization via Alternating Updates. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, & R. Garnett (Eds.), Advances in Neural Information Processing Systems 29 (pp. 4988–4996). Curran Associates, Inc.
LuMH15
Lucchi, A., McWilliams, B., & Hofmann, T. (2015) A Variance Reduced Stochastic Newton Method. ArXiv:1503.08316 [Cs].
MaBe17
Ma, S., & Belkin, M. (2017) Diving into the shallows: a computational perspective on large-scale shallow learning. ArXiv:1703.10622 [Cs, Stat].
Mair13
Mairal, J. (2013) Stochastic majorization-minimization algorithms for large-scale optimization. In Advances in Neural Information Processing Systems (pp. 2283–2291).
MHSY13
McMahan, H. B., Holt, G., Sculley, D., Young, M., Ebner, D., Grady, J., … Kubica, J. (2013) Ad Click Prediction: A View from the Trenches. In Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 1222–1230). New York, NY, USA: ACM DOI.
MZHR16
Mitliagkas, I., Zhang, C., Hadjis, S., & Ré, C. (2016) Asynchrony begets Momentum, with an Application to Deep Learning. ArXiv:1605.09774 [Cs, Math, Stat].
Nest07
Nesterov, Y. (2007) Accelerating the cubic regularization of Newton’s method on convex problems. Mathematical Programming, 112(1), 159–181. DOI.
Nest12a
Nesterov, Y. (2012a) Efficiency of Coordinate Descent Methods on Huge-Scale Optimization Problems. SIAM Journal on Optimization, 22(2), 341–362. DOI.
Nest12b
Nesterov, Y. (2012b) Gradient methods for minimizing composite functions. Mathematical Programming, 140(1), 125–161. DOI.
NLST17
Nguyen, L. M., Liu, J., Scheinberg, K., & Takáč, M. (2017) Stochastic Recursive Gradient Algorithm for Nonconvex Optimization. ArXiv:1705.07261 [Cs, Math, Stat].
Pate17
Patel, V. (2017) On SGD’s Failure in Practice: Characterizing and Overcoming Stalling. ArXiv:1702.00317 [Cs, Math, Stat].
PoJu92
Polyak, B. T., & Juditsky, A. B.(1992) Acceleration of Stochastic Approximation by Averaging. SIAM Journal on Control and Optimization, 30(4), 838–855. DOI.
RHSP16
Reddi, S. J., Hefny, A., Sra, S., Poczos, B., & Smola, A. (2016) Stochastic Variance Reduction for Nonconvex Optimization. In PMLR (Vol. 1603, pp. 314–323).
RoMo51
Robbins, H., & Monro, S. (1951) A Stochastic Approximation Method. The Annals of Mathematical Statistics, 22(3), 400–407. DOI.
RoSi85
Robbins, H., & Siegmund, D. (1985) A convergence theorem for non negative almost supermartingales and some applications. In Herbert Robbins Selected Papers (pp. 111–135). Springer DOI.
Rude16
Ruder, S. (2016) An overview of gradient descent optimization algorithms. ArXiv:1609.04747 [Cs].
Rupp85
Ruppert, D. (1985) A Newton-Raphson Version of the Multivariate Robbins-Monro Procedure. The Annals of Statistics, 13(1), 236–245. DOI.
SGAL14
Sagun, L., Guney, V. U., Arous, G. B., & LeCun, Y. (2014) Explorations on high dimensional landscapes. ArXiv:1412.6615 [Cs, Stat].
SaKi16
Salimans, T., & Kingma, D. P.(2016) Weight Normalization: A Simple Reparameterization to Accelerate Training of Deep Neural Networks. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, & R. Garnett (Eds.), Advances in Neural Information Processing Systems 29 (pp. 901–901). Curran Associates, Inc.
ShTe11
Shalev-Shwartz, S., & Tewari, A. (2011) Stochastic Methods for L1-regularized Loss Minimization. J. Mach. Learn. Res., 12, 1865–1892.
Spal00
Spall, J. C.(2000) Adaptive stochastic approximation by the simultaneous perturbation method. IEEE Transactions on Automatic Control, 45(10), 1839–1853. DOI.
VSSM06
Vishwanathan, S. V. N., Schraudolph, N. N., Schmidt, M. W., & Murphy, K. P.(2006) Accelerated Training of Conditional Random Fields with Stochastic Gradient Methods. In Proceedings of the 23rd International Conference on Machine Learning.
Wain14
Wainwright, M. J.(2014) Structured Regularizers for High-Dimensional Problems: Statistical and Computational Issues. Annual Review of Statistics and Its Application, 1(1), 233–253. DOI.
Xu11
Xu, W. (2011) Towards Optimal One Pass Large Scale Learning with Averaged Stochastic Gradient Descent. ArXiv:1107.2490 [Cs].
ZhWG17
Zhang, X., Wang, L., & Gu, Q. (2017) Stochastic Variance-reduced Gradient Descent for Low-rank Matrix Recovery from Linear Measurements. ArXiv:1701.00481 [Stat].
ZWLS10
Zinkevich, M., Weimer, M., Li, L., & Smola, A. J.(2010) Parallelized Stochastic Gradient Descent. In J. D. Lafferty, C. K. I. Williams, J. Shawe-Taylor, R. S. Zemel, & A. Culotta (Eds.), Advances in Neural Information Processing Systems 23 (pp. 2595–2603). Curran Associates, Inc.