# The Living Thing / Notebooks : Optimisation, continuous, online, possibly stochastic

The hip high-dimensional version of classic offline optimisation.

Something I am learning about belatedly; be warned. These are not reference-grade notes.

Batch and stochastic methods for minimising loss when you have a lot of data, or a lot of parameters, and using it all at once is silly, or when you want to iteratively improve your solution as data comes in, and you have access to a gradient for your loss, ideally automatically calculated <{filename}autodiff.rst>_. It’s not clear at all that it should work, except by collating all your data and optimising offline, except that much of modern machine learning shows that it does.

Sometimes this apparently stupid trick it might even be fast for small-dimensional cases, so you may as well try.

Technically, “online” optimisation in bandit/RL problems might imply that you have to “minimise regret online”, which has a slightly different meaning and, e.g. involves seeing each training only as it arrives along some notional arrow of time, yet wishing to make the “best” decision at the next time, and possibly choosing your next prediction in order to trade-off exploration versus exploitation etc.

In, e.g., SGD you can see your data as often as you want and in whatever order, but you only look at a little bit at a time. Usually the data is given and predictions make no difference to what information is available to you.

Some of the same technology pops up in each of these, but I am really thinking about SGD here.

There are many more permutations and variations used in practice.

## Gradient Descent

The workhorse of many large-ish scale machine-learning techniques and especially deep neural networks. See Sebastian Ruder’s explanation.

### Conditional Gradient

a.k.a. Frank-Wolfe algorithm: Don’t know much about this.

### Variance-reduced

Zeyuan Allen-Zhu : Faster Than SGD 1: Variance Reduction:

SGD is well-known for large-scale optimization. In my mind, there are two (and only two) fundamental improvements since the original introduction of SGD: (1) variance reduction, and (2) acceleration. In this post I’d love to conduct a survey regarding (1),

### Second order (Quasi-newton at scale)

LiSSA attempts to make 2nd order gradient descent methods practical (AgBH16):

linear time stochastic second order algorithm that achieves linear convergence for typical problems in machine learning while still maintaining run-times theoretically comparable to state-of-the-art first order algorithms. This relies heavily on the special structure of the optimization problem that allows our unbiased hessian estimator to be implemented efficiently, using only vector-vector products.

David McAllester observes:

Since $$H^{t+1}y^t$$ can be computed efficiently whenever we can run backpropagation, the conditions under which the LiSSA algorithm can be run are actually much more general than the paper suggests. Backpropagation can be run on essentially any natural loss function.

## Momentum, Nesterov Accelerated

Magical high speed convergence for smooth functions: Improved Nesterov accelaration and a lovely explanation of what the hell it is.

Normally considered for Gradient Descent, but also works for coordinate descent (Nest12a) and second order (Nest07).

Do momentum methods in the SGD setting fit here?

Hmmm.

## Sundry Hacks

### Weight Normalization

Pragmatically, controlling for variability in your data can be very hard in, e.g. deep learning so you might normalise it by the batch variance. Salimans and Kingma (SaKi16) have a more satisfying approach to this.

We present weight normalization: a reparameterization of the weight vectors in a neural network that decouples the length of those weight vectors from their direction. By reparameterizing the weights in this way we improve the conditioning of the optimization problem and we speed up convergence of stochastic gradient descent. Our reparameterization is inspired by batch normalization but does not introduce any dependencies between the examples in a minibatch. This means that our method can also be applied successfully to recurrent models such as LSTMs and to noise-sensitive applications such as deep reinforcement learning or generative models, for which batch normalization is less well suited. Although our method is much simpler, it still provides much of the speed-up of full batch normalization. In addition, the computational overhead of our method is lower, permitting more optimization steps to be taken in the same amount of time.

They provide an open implemention for keras, Tensorflow and lasagne.

## Parallel

Classic, basic SGD takes walks through the data set example-wise or feature-wise - but this doesn’t work in parallel, so you tend to go for mini-batch gradient descent so that you can at least vectorize. Apparently you can make SGD work in “true” parallel across communication-constrained cores, but I don’t yet understand how.

## Implementations

Specialized optimisation software.

See also statistical software.

• vowpal wabbit solve a bunch of optimisations well in an online setting.
• keras.optimizers support varied SGD-type online/minibatch optimizers for python + Tensorflow/Theano.
• tensorflow.train is a very similar list of online/SGD optimisers

## Refs

ACDL14
Agarwal, A., Chapelle, O., Dudık, M., & Langford, J. (2014) A Reliable Effective Terascale Linear Learning System. Journal of Machine Learning Research, 15(1), 1111–1133.
AgBH16
Agarwal, N., Bullins, B., & Hazan, E. (2016) Second Order Stochastic Optimization in Linear Time. arXiv:1602.03943 [Cs, Stat].
AlHa16
Allen-Zhu, Z., & Hazan, E. (2016) Optimal Black-Box Reductions Between Optimization Objectives. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, & R. Garnett (Eds.), Advances in Neural Information Processing Systems 29 (pp. 1606–1614). Curran Associates, Inc.
BaMo11
Bach, F., & Moulines, E. (2011) Non-Asymptotic Analysis of Stochastic Approximation Algorithms for Machine Learning. In Advances in Neural Information Processing Systems (NIPS) (p. ). Spain
BaMo00
Bach, F. R., & Moulines, E. (n.d.) Non-strongly-convex smooth stochastic approximation with convergence rate O(1/n).
Batt92
Battiti, R. (1992) First-and second-order methods for learning: between steepest descent and Newton’s method. Neural Computation, 4(2), 141–166. DOI.
BoBG09
Bordes, A., Bottou, L., & Gallinari, P. (2009) SGD-QN: Careful Quasi-Newton Stochastic Gradient Descent. Journal of Machine Learning Research, 10, 1737–1754.
Bott91
Bottou, L. (1991) Stochastic Gradient Learning in Neural Networks. In Proceedings of Neuro-Nîmes 91. Nimes, France: EC2
Bott98
Bottou, L. (1998) Online Algorithms and Stochastic Approximations. In D. Saad (Ed.), Online Learning and Neural Networks. Cambridge, UK: Cambridge University Press
Bott10
Bottou, L. (2010) Large-scale machine learning with stochastic gradient descent. In Proceedings of the 19th International Conference on Computational Statistics (COMPSTAT’2010) (pp. 177–186). Paris, France: Springer
BoBo08
Bottou, L., & Bousquet, O. (2008) The Tradeoffs of Large Scale Learning. In J. C. Platt, D. Koller, Y. Singer, & S. Roweis (Eds.), Advances in Neural Information Processing Systems (Vol. 20, pp. 161–168). NIPS Foundation (http://books.nips.cc)
BoCN16
Bottou, L., Curtis, F. E., & Nocedal, J. (2016) Optimization Methods for Large-Scale Machine Learning. arXiv:1606.04838 [Cs, Math, Stat].
BoLe04
Bottou, L., & LeCun, Y. (2004) Large Scale Online Learning. In S. Thrun, L. Saul, & B. Schölkopf (Eds.), Advances in Neural Information Processing Systems 16. Cambridge, MA: MIT Press
Bube15
Bubeck, S. (2015) Convex Optimization: Algorithms and Complexity. Foundations and Trends® in Machine Learning, 8(3–4), 231–357. DOI.
CeBS14
Cevher, V., Becker, S., & Schmidt, M. (2014) Convex Optimization for Big Data. IEEE Signal Processing Magazine, 31(5), 32–43. DOI.
Chen12
Chen, X. (2012) Smoothing methods for nonsmooth, nonconvex minimization. Mathematical Programming, 134(1), 71–99. DOI.
CHMB15
Choromanska, A., Henaff, Mi., Mathieu, M., Ben Arous, G., & LeCun, Y. (2015) The Loss Surfaces of Multilayer Networks. In Proceedings of the Eighteenth International Conference on Artificial Intelligence and Statistics (pp. 192–204).
DeBL14
Defazio, A., Bach, F., & Lacoste-Julien, S. (2014) SAGA: A Fast Incremental Gradient Method With Support for Non-Strongly Convex Composite Objectives. In Advances in Neural Information Processing Systems 27.
Devo98
DeVore, R. A.(1998) Nonlinear approximation. Acta Numerica, 7, 51–150. DOI.
DuHS11
Duchi, J., Hazan, E., & Singer, Y. (2011) Adaptive Subgradient Methods for Online Learning and Stochastic Optimization. Journal of Machine Learning Research, 12, 2121–2159.
FrSc12
Friedlander, M. P., & Schmidt, M. (2012) Hybrid Deterministic-Stochastic Methods for Data Fitting. SIAM Journal on Scientific Computing, 34(3), A1380–A1405. DOI.
Frie02
Friedman, J. H.(2002) Stochastic gradient boosting. Computational Statistics & Data Analysis, 38(4), 367–378. DOI.
GhLa13a
Ghadimi, S., & Lan, G. (2013a) Accelerated Gradient Methods for Nonconvex Nonlinear and Stochastic Programming. arXiv:1310.3787 [Math].
GhLa13b
Ghadimi, S., & Lan, G. (2013b) Stochastic First- and Zeroth-order Methods for Nonconvex Stochastic Programming. SIAM Journal on Optimization, 23(4), 2341–2368. DOI.
HiSK00
Hinton, G., Srivastava, N., & Kevin Swersky. (n.d.) Neural Networks for Machine Learning.
HuPK09
Hu, C., Pan, W., & Kwok, J. T.(2009) Accelerated gradient methods for stochastic optimization and online learning. In Advances in Neural Information Processing Systems (pp. 781–789). Curran Associates, Inc.
JaFM14
Jakovetic, D., Freitas Xavier, J. M., & Moura, J. M. F.(2014) Convergence Rates of Distributed Nesterov-Like Gradient Methods on Random Networks. IEEE Transactions on Signal Processing, 62(4), 868–882. DOI.
KiBa14
Kingma, D., & Ba, J. (2014) Adam: A Method for Stochastic Optimization. arXiv:1412.6980 [Cs].
LaLZ09
Langford, J., Li, L., & Zhang, T. (2009) Sparse Online Learning via Truncated Gradient. In D. Koller, D. Schuurmans, Y. Bengio, & L. Bottou (Eds.), Advances in Neural Information Processing Systems 21 (pp. 905–912). Curran Associates, Inc.
Levy16
Levy, K. Y.(2016) The Power of Normalization: Faster Evasion of Saddle Points. arXiv:1611.04831 [Cs, Math, Stat].
LiLR16
Li, Y., Liang, Y., & Risteski, A. (2016) Recovery Guarantee of Non-negative Matrix Factorization via Alternating Updates. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, & R. Garnett (Eds.), Advances in Neural Information Processing Systems 29 (pp. 4988–4996). Curran Associates, Inc.
LuMH15
Lucchi, A., McWilliams, B., & Hofmann, T. (2015) A Variance Reduced Stochastic Newton Method. arXiv:1503.08316 [Cs].
Mair13
Mairal, J. (2013) Stochastic majorization-minimization algorithms for large-scale optimization. In Advances in Neural Information Processing Systems (pp. 2283–2291).
MHSY13
McMahan, H. B., Holt, G., Sculley, D., Young, M., Ebner, D., Grady, J., … Kubica, J. (2013) Ad Click Prediction: A View from the Trenches. In Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 1222–1230). New York, NY, USA: ACM DOI.
Nest12a
Nesterov, Y. (2012a) Efficiency of Coordinate Descent Methods on Huge-Scale Optimization Problems. SIAM Journal on Optimization, 22(2), 341–362. DOI.
Nest12b
Nesterov, Yu. (2012b) Gradient methods for minimizing composite functions. Mathematical Programming, 140(1), 125–161. DOI.
Nest07
Nesterov, Yu. (2007) Accelerating the cubic regularization of Newton’s method on convex problems. Mathematical Programming, 112(1), 159–181. DOI.
PoJu92
Polyak, B. T., & Juditsky, A. B.(1992) Acceleration of Stochastic Approximation by Averaging. SIAM Journal on Control and Optimization, 30(4), 838–855. DOI.
SGAL14
Sagun, L., Guney, V. U., Arous, G. B., & LeCun, Y. (2014) Explorations on high dimensional landscapes. arXiv:1412.6615 [Cs, Stat].
SaKi16
Salimans, T., & Kingma, D. P.(2016) Weight Normalization: A Simple Reparameterization to Accelerate Training of Deep Neural Networks. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, & R. Garnett (Eds.), Advances in Neural Information Processing Systems 29 (pp. 901–901). Curran Associates, Inc.
ShTe11
Shalev-Shwartz, S., & Tewari, A. (2011) Stochastic Methods for L1-regularized Loss Minimization. J. Mach. Learn. Res., 12, 1865–1892.
VSSM06
Vishwanathan, S. V. N., Schraudolph, N. N., Schmidt, M. W., & Murphy, K. P.(2006) Accelerated Training of Conditional Random Fields with Stochastic Gradient Methods. In Proceedings of the 23rd International Conference on Machine Learning.
Wain14
Wainwright, M. J.(2014) Structured Regularizers for High-Dimensional Problems: Statistical and Computational Issues. Annual Review of Statistics and Its Application, 1(1), 233–253. DOI.
Xu11
Xu, W. (2011) Towards Optimal One Pass Large Scale Learning with Averaged Stochastic Gradient Descent. arXiv:1107.2490 [Cs].
ZhWG17
Zhang, X., Wang, L., & Gu, Q. (2017) Stochastic Variance-reduced Gradient Descent for Low-rank Matrix Recovery from Linear Measurements. arXiv:1701.00481 [Stat].
ZWLS10
Zinkevich, M., Weimer, M., Li, L., & Smola, A. J.(2010) Parallelized Stochastic Gradient Descent. In J. D. Lafferty, C. K. I. Williams, J. Shawe-Taylor, R. S. Zemel, & A. Culotta (Eds.), Advances in Neural Information Processing Systems 23 (pp. 2595–2603). Curran Associates, Inc.