The Living Thing / Notebooks : Optimisation, continuous, online, possibly stochastic

The hip high-dimensional version of classic offline optimisation.

Something I am learning about belatedly; be warned. These are not reference-grade notes.

Batch and stochastic methods for minimising loss when you have a lot of data, or a lot of parameters, and using it all at once is silly, or when you want to iteratively improve your solution as data comes in, and you have access to a gradient for your loss, ideally automatically calculated <{filename}autodiff.rst>_. It’s not clear at all that it should work, except by collating all your data and optimising offline, except that much of modern machine learning shows that it does.

Sometimes this apparently stupid trick it might even be fast for small-dimensional cases, so you may as well try.

Technically, “online” optimisation in bandit/RL problems might imply that you have to “minimise regret online”, which has a slightly different meaning and, e.g. involves seeing each training only as it arrives along some notional arrow of time, yet wishing to make the “best” decision at the next time, and possibly choosing your next prediction in order to trade-off exploration versus exploitation etc.

In, e.g., SGD you can see your data as often as you want and in whatever order, but you only look at a little bit at a time. Usually the data is given and predictions make no difference to what information is available to you.

Some of the same technology pops up in each of these, but I am really thinking about SGD here.

There are many more permutations and variations used in practice.

Gradient Descent

The workhorse of many large-ish scale machine-learning techniques and especially deep neural networks. See Sebastian Ruder’s explanation.

Conditional Gradient

a.k.a. Frank-Wolfe algorithm: Don’t know much about this.



Zeyuan Allen-Zhu : Faster Than SGD 1: Variance Reduction:

SGD is well-known for large-scale optimization. In my mind, there are two (and only two) fundamental improvements since the original introduction of SGD: (1) variance reduction, and (2) acceleration. In this post I’d love to conduct a survey regarding (1),

Second order (Quasi-newton at scale)

LiSSA attempts to make 2nd order gradient descent methods practical (AgBH16):

linear time stochastic second order algorithm that achieves linear convergence for typical problems in machine learning while still maintaining run-times theoretically comparable to state-of-the-art first order algorithms. This relies heavily on the special structure of the optimization problem that allows our unbiased hessian estimator to be implemented efficiently, using only vector-vector products.

David McAllester observes:

Since \(H^{t+1}y^t\) can be computed efficiently whenever we can run backpropagation, the conditions under which the LiSSA algorithm can be run are actually much more general than the paper suggests. Backpropagation can be run on essentially any natural loss function.

Momentum, Nesterov Accelerated

Magical high speed convergence for smooth functions: Improved Nesterov accelaration and a lovely explanation of what the hell it is.

Normally considered for Gradient Descent, but also works for coordinate descent (Nest12a) and second order (Nest07).

Do momentum methods in the SGD setting fit here?

…With sparsity


Sundry Hacks

Weight Normalization

Pragmatically, controlling for variability in your data can be very hard in, e.g. deep learning so you might normalise it by the batch variance. Salimans and Kingma (SaKi16) have a more satisfying approach to this.

We present weight normalization: a reparameterization of the weight vectors in a neural network that decouples the length of those weight vectors from their direction. By reparameterizing the weights in this way we improve the conditioning of the optimization problem and we speed up convergence of stochastic gradient descent. Our reparameterization is inspired by batch normalization but does not introduce any dependencies between the examples in a minibatch. This means that our method can also be applied successfully to recurrent models such as LSTMs and to noise-sensitive applications such as deep reinforcement learning or generative models, for which batch normalization is less well suited. Although our method is much simpler, it still provides much of the speed-up of full batch normalization. In addition, the computational overhead of our method is lower, permitting more optimization steps to be taken in the same amount of time.

They provide an open implemention for keras, Tensorflow and lasagne.


Classic, basic SGD takes walks through the data set example-wise or feature-wise - but this doesn’t work in parallel, so you tend to go for mini-batch gradient descent so that you can at least vectorize. Apparently you can make SGD work in “true” parallel across communication-constrained cores, but I don’t yet understand how.


Specialized optimisation software.

See also statistical software.


Agarwal, A., Chapelle, O., Dudık, M., & Langford, J. (2014) A Reliable Effective Terascale Linear Learning System. Journal of Machine Learning Research, 15(1), 1111–1133.
Agarwal, N., Bullins, B., & Hazan, E. (2016) Second Order Stochastic Optimization in Linear Time. arXiv:1602.03943 [Cs, Stat].
Allen-Zhu, Z., & Hazan, E. (2016) Optimal Black-Box Reductions Between Optimization Objectives. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, & R. Garnett (Eds.), Advances in Neural Information Processing Systems 29 (pp. 1606–1614). Curran Associates, Inc.
Bach, F., & Moulines, E. (2011) Non-Asymptotic Analysis of Stochastic Approximation Algorithms for Machine Learning. In Advances in Neural Information Processing Systems (NIPS) (p. ). Spain
Bach, F. R., & Moulines, E. (n.d.) Non-strongly-convex smooth stochastic approximation with convergence rate O(1/n).
Battiti, R. (1992) First-and second-order methods for learning: between steepest descent and Newton’s method. Neural Computation, 4(2), 141–166. DOI.
Bordes, A., Bottou, L., & Gallinari, P. (2009) SGD-QN: Careful Quasi-Newton Stochastic Gradient Descent. Journal of Machine Learning Research, 10, 1737–1754.
Bottou, L. (1991) Stochastic Gradient Learning in Neural Networks. In Proceedings of Neuro-Nîmes 91. Nimes, France: EC2
Bottou, L. (1998) Online Algorithms and Stochastic Approximations. In D. Saad (Ed.), Online Learning and Neural Networks. Cambridge, UK: Cambridge University Press
Bottou, L. (2010) Large-scale machine learning with stochastic gradient descent. In Proceedings of the 19th International Conference on Computational Statistics (COMPSTAT’2010) (pp. 177–186). Paris, France: Springer
Bottou, L., & Bousquet, O. (2008) The Tradeoffs of Large Scale Learning. In J. C. Platt, D. Koller, Y. Singer, & S. Roweis (Eds.), Advances in Neural Information Processing Systems (Vol. 20, pp. 161–168). NIPS Foundation (
Bottou, L., Curtis, F. E., & Nocedal, J. (2016) Optimization Methods for Large-Scale Machine Learning. arXiv:1606.04838 [Cs, Math, Stat].
Bottou, L., & LeCun, Y. (2004) Large Scale Online Learning. In S. Thrun, L. Saul, & B. Schölkopf (Eds.), Advances in Neural Information Processing Systems 16. Cambridge, MA: MIT Press
Bubeck, S. (2015) Convex Optimization: Algorithms and Complexity. Foundations and Trends® in Machine Learning, 8(3–4), 231–357. DOI.
Cevher, V., Becker, S., & Schmidt, M. (2014) Convex Optimization for Big Data. IEEE Signal Processing Magazine, 31(5), 32–43. DOI.
Chen, X. (2012) Smoothing methods for nonsmooth, nonconvex minimization. Mathematical Programming, 134(1), 71–99. DOI.
Choromanska, A., Henaff, Mi., Mathieu, M., Ben Arous, G., & LeCun, Y. (2015) The Loss Surfaces of Multilayer Networks. In Proceedings of the Eighteenth International Conference on Artificial Intelligence and Statistics (pp. 192–204).
Defazio, A., Bach, F., & Lacoste-Julien, S. (2014) SAGA: A Fast Incremental Gradient Method With Support for Non-Strongly Convex Composite Objectives. In Advances in Neural Information Processing Systems 27.
DeVore, R. A.(1998) Nonlinear approximation. Acta Numerica, 7, 51–150. DOI.
Duchi, J., Hazan, E., & Singer, Y. (2011) Adaptive Subgradient Methods for Online Learning and Stochastic Optimization. Journal of Machine Learning Research, 12, 2121–2159.
Friedlander, M. P., & Schmidt, M. (2012) Hybrid Deterministic-Stochastic Methods for Data Fitting. SIAM Journal on Scientific Computing, 34(3), A1380–A1405. DOI.
Friedman, J. H.(2002) Stochastic gradient boosting. Computational Statistics & Data Analysis, 38(4), 367–378. DOI.
Ghadimi, S., & Lan, G. (2013a) Accelerated Gradient Methods for Nonconvex Nonlinear and Stochastic Programming. arXiv:1310.3787 [Math].
Ghadimi, S., & Lan, G. (2013b) Stochastic First- and Zeroth-order Methods for Nonconvex Stochastic Programming. SIAM Journal on Optimization, 23(4), 2341–2368. DOI.
Hinton, G., Srivastava, N., & Kevin Swersky. (n.d.) Neural Networks for Machine Learning.
Hu, C., Pan, W., & Kwok, J. T.(2009) Accelerated gradient methods for stochastic optimization and online learning. In Advances in Neural Information Processing Systems (pp. 781–789). Curran Associates, Inc.
Jakovetic, D., Freitas Xavier, J. M., & Moura, J. M. F.(2014) Convergence Rates of Distributed Nesterov-Like Gradient Methods on Random Networks. IEEE Transactions on Signal Processing, 62(4), 868–882. DOI.
Kingma, D., & Ba, J. (2014) Adam: A Method for Stochastic Optimization. arXiv:1412.6980 [Cs].
Langford, J., Li, L., & Zhang, T. (2009) Sparse Online Learning via Truncated Gradient. In D. Koller, D. Schuurmans, Y. Bengio, & L. Bottou (Eds.), Advances in Neural Information Processing Systems 21 (pp. 905–912). Curran Associates, Inc.
Levy, K. Y.(2016) The Power of Normalization: Faster Evasion of Saddle Points. arXiv:1611.04831 [Cs, Math, Stat].
Li, Y., Liang, Y., & Risteski, A. (2016) Recovery Guarantee of Non-negative Matrix Factorization via Alternating Updates. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, & R. Garnett (Eds.), Advances in Neural Information Processing Systems 29 (pp. 4988–4996). Curran Associates, Inc.
Lucchi, A., McWilliams, B., & Hofmann, T. (2015) A Variance Reduced Stochastic Newton Method. arXiv:1503.08316 [Cs].
Mairal, J. (2013) Stochastic majorization-minimization algorithms for large-scale optimization. In Advances in Neural Information Processing Systems (pp. 2283–2291).
McMahan, H. B., Holt, G., Sculley, D., Young, M., Ebner, D., Grady, J., … Kubica, J. (2013) Ad Click Prediction: A View from the Trenches. In Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 1222–1230). New York, NY, USA: ACM DOI.
Nesterov, Y. (2012a) Efficiency of Coordinate Descent Methods on Huge-Scale Optimization Problems. SIAM Journal on Optimization, 22(2), 341–362. DOI.
Nesterov, Yu. (2012b) Gradient methods for minimizing composite functions. Mathematical Programming, 140(1), 125–161. DOI.
Nesterov, Yu. (2007) Accelerating the cubic regularization of Newton’s method on convex problems. Mathematical Programming, 112(1), 159–181. DOI.
Polyak, B. T., & Juditsky, A. B.(1992) Acceleration of Stochastic Approximation by Averaging. SIAM Journal on Control and Optimization, 30(4), 838–855. DOI.
Sagun, L., Guney, V. U., Arous, G. B., & LeCun, Y. (2014) Explorations on high dimensional landscapes. arXiv:1412.6615 [Cs, Stat].
Salimans, T., & Kingma, D. P.(2016) Weight Normalization: A Simple Reparameterization to Accelerate Training of Deep Neural Networks. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, & R. Garnett (Eds.), Advances in Neural Information Processing Systems 29 (pp. 901–901). Curran Associates, Inc.
Shalev-Shwartz, S., & Tewari, A. (2011) Stochastic Methods for L1-regularized Loss Minimization. J. Mach. Learn. Res., 12, 1865–1892.
Vishwanathan, S. V. N., Schraudolph, N. N., Schmidt, M. W., & Murphy, K. P.(2006) Accelerated Training of Conditional Random Fields with Stochastic Gradient Methods. In Proceedings of the 23rd International Conference on Machine Learning.
Wainwright, M. J.(2014) Structured Regularizers for High-Dimensional Problems: Statistical and Computational Issues. Annual Review of Statistics and Its Application, 1(1), 233–253. DOI.
Xu, W. (2011) Towards Optimal One Pass Large Scale Learning with Averaged Stochastic Gradient Descent. arXiv:1107.2490 [Cs].
Zhang, X., Wang, L., & Gu, Q. (2017) Stochastic Variance-reduced Gradient Descent for Low-rank Matrix Recovery from Linear Measurements. arXiv:1701.00481 [Stat].
Zinkevich, M., Weimer, M., Li, L., & Smola, A. J.(2010) Parallelized Stochastic Gradient Descent. In J. D. Lafferty, C. K. I. Williams, J. Shawe-Taylor, R. S. Zemel, & A. Culotta (Eds.), Advances in Neural Information Processing Systems 23 (pp. 2595–2603). Curran Associates, Inc.