Q: Which of these tricks bring new insight that I can apply outside of deep settings?

## General framing

How do we get generalisation from neural networks?

Hereâ€™s one interesting perspective. Is it correct? (ZBHR17)

The effective capacity of neural networks is large enough for a brute-force memorization of the entire data set.

Even optimization on random labels remains easy. In fact, training time increases only by a small constant factor compared with training on the true labels.

Randomizing labels is solely a data transformation, leaving all other properties of the learning problem unchanged.

[â€¦] Explicit regularization may improve generalization performance, but is neither necessary nor by itself sufficient for controlling generalization error. [â€¦]Appealing to linear models, we analyze how SGD acts as an implicit regularizer.

## Early stopping

e.g. Prec12. Donâ€™t keep training your model. The regularisation method that actually makes learning go faster, because you donâ€™t bother to do as much of it.

## Noise layers

### Dropout

A very popular of noise layer, multiplicative. Interesting because it has an interesting rationale in terms of model averaging and as a kind of implicit probabilistic learning.

### Input perturbation

Parametric noise layer. If you are hip you will take this further and do it byâ€¦

### Adversarial training

See adversarial learning.

## Regularisation penalties

L_1, L_2, dropoutâ€¦ Seems to be applied to weights, but rarely to actual neurons.

See Compressing neural networks for that latter use.

This is attractive but has an expensive hyperparameter to choose.

### Reversible learning

An elegant autodiff hack, where you find the gradient of the model (loss?) with respect to the model hyperparameters. Usually regularisation hyperparameters, although they donâ€™t require that. Proposed by Bengio, Baydin and Pearlmutter (Beng00, BaPe14) made feasible by Maclaurin et al (MaDA15). Differentiate your optimisation itself with respect to hyperparameters. Non-trivial to implement, though.

### Bayesian optimisation

Choose your regularisation hyperparameters optimally even without fancy reversible learning but designing optimal experiments to find the optimum loss. See Bayesian optimisation.

## Normalization

### Weight Normalization

Pragmatically, controlling for variability in your data can be very hard in, e.g. deep learning so you might normalise it by the batch variance. Salimans and Kingma (SaKi16) have a more satisfying approach to this.

We present weight normalization: a reparameterisation of the weight vectors in a neural network that decouples the length of those weight vectors from their direction. By reparameterizing the weights in this way we improve the conditioning of the optimization problem and we speed up convergence of stochastic gradient descent. Our reparameterisation is inspired by batch normalization but does not introduce any dependencies between the examples in a minibatch. This means that our method can also be applied successfully to recurrent models such as LSTMs and to noise-sensitive applications such as deep reinforcement learning or generative models, for which batch normalization is less well suited. Although our method is much simpler, it still provides much of the speed-up of full batch normalization. In addition, the computational overhead of our method is lower, permitting more optimization steps to be taken in the same amount of time.

They provide an open implemention for keras, Tensorflow and lasagne.

# Refs

Bach, Francis. 2014. â€śBreaking the Curse of Dimensionality with Convex Neural Networks,â€ť December. http://arxiv.org/abs/1412.8690.

Bahadori, Mohammad Taha, Krzysztof Chalupka, Edward Choi, Robert Chen, Walter F. Stewart, and Jimeng Sun. 2017. â€śNeural Causal Regularization Under the Independence of Mechanisms Assumption,â€ť February. http://arxiv.org/abs/1702.02604.

Baldi, Pierre, Peter Sadowski, and Zhiqin Lu. 2016. â€śLearning in the Machine: Random Backpropagation and the Learning Channel,â€ť December. http://arxiv.org/abs/1612.02734.

Baydin, Atilim Gunes, and Barak A. Pearlmutter. 2014. â€śAutomatic Differentiation of Algorithms for Machine Learning,â€ť April. http://arxiv.org/abs/1404.7456.

Belkin, Mikhail, Siyuan Ma, and Soumik Mandal. 2018. â€śTo Understand Deep Learning We Need to Understand Kernel Learning.â€ť In *International Conference on Machine Learning*, 541â€“49. http://proceedings.mlr.press/v80/belkin18a.html.

Bengio, Yoshua. 2000. â€śGradient-Based Optimization of Hyperparameters.â€ť *Neural Computation* 12 (8): 1889â€“1900. https://doi.org/10.1162/089976600300015187.

Dasgupta, Sakyasingha, Takayuki Yoshizumi, and Takayuki Osogami. 2016. â€śRegularized Dynamic Boltzmann Machine with Delay Pruning for Unsupervised Learning of Temporal Sequences,â€ť September. http://arxiv.org/abs/1610.01989.

Gal, Yarin, and Zoubin Ghahramani. 2016. â€śA Theoretically Grounded Application of Dropout in Recurrent Neural Networks.â€ť In. http://arxiv.org/abs/1512.05287.

Golowich, Noah, Alexander Rakhlin, and Ohad Shamir. 2017. â€śSize-Independent Sample Complexity of Neural Networks,â€ť December. http://arxiv.org/abs/1712.06541.

Graves, Alex. 2011. â€śPractical Variational Inference for Neural Networks.â€ť In *Proceedings of the 24th International Conference on Neural Information Processing Systems*, 2348â€“56. NIPSâ€™11. USA: Curran Associates Inc. https://papers.nips.cc/paper/4329-practical-variational-inference-for-neural-networks.pdf.

Hardt, Moritz, Benjamin Recht, and Yoram Singer. 2015. â€śTrain Faster, Generalize Better: Stability of Stochastic Gradient Descent,â€ť September. http://arxiv.org/abs/1509.01240.

Im, Daniel Jiwoong, Michael Tao, and Kristin Branson. 2016. â€śAn Empirical Analysis of the Optimization of Deep Network Loss Surfaces,â€ť December. http://arxiv.org/abs/1612.04010.

Kawaguchi, Kenji, Leslie Pack Kaelbling, and Yoshua Bengio. 2017. â€śGeneralization in Deep Learning,â€ť October. http://arxiv.org/abs/1710.05468.

Klambauer, GĂĽnter, Thomas Unterthiner, Andreas Mayr, and Sepp Hochreiter. 2017. â€śSelf-Normalizing Neural Networks,â€ť June. http://arxiv.org/abs/1706.02515.

Lobacheva, Ekaterina, Nadezhda Chirkova, and Dmitry Vetrov. 2017. â€śBayesian Sparsification of Recurrent Neural Networks.â€ť In *Workshop on Learning to Generate Natural Language*. http://arxiv.org/abs/1708.00077.

Maclaurin, Dougal, David K. Duvenaud, and Ryan P. Adams. 2015. â€śGradient-Based Hyperparameter Optimization Through Reversible Learning.â€ť In *ICML*, 2113â€“22. http://www.jmlr.org/proceedings/papers/v37/maclaurin15.pdf.

Molchanov, Dmitry, Arsenii Ashukha, and Dmitry Vetrov. 2017. â€śVariational Dropout Sparsifies Deep Neural Networks.â€ť In *Proceedings of ICML*. http://arxiv.org/abs/1701.05369.

Nguyen Xuan Vinh, Sarah Erfani, Sakrapee Paisitkriangkrai, James Bailey, Christopher Leckie, and Kotagiri Ramamohanarao. 2016. â€śTraining Robust Models Using Random Projection.â€ť In, 531â€“36. IEEE. https://doi.org/10.1109/ICPR.2016.7899688.

NĂ¸kland, Arild. 2016. â€śDirect Feedback Alignment Provides Learning in Deep Neural Networks.â€ť In *Advances in Neural Information Processing Systems*. http://arxiv.org/abs/1609.01596.

Pan, Wei, Hao Dong, and Yike Guo. 2016. â€śDropNeuron: Simplifying the Structure of Deep Neural Networks,â€ť June. http://arxiv.org/abs/1606.07326.

Perez, Carlos E. n.d. â€śDeep Learning: The Unreasonable Effectiveness of Randomness.â€ť Medium. https://medium.com/intuitionmachine/deep-learning-the-unreasonable-effectiveness-of-randomness-14d5aef13f87#.g5sjhxjrn.

Prechelt, Lutz. 2012. â€śEarly Stopping â€” but When?â€ť In *Neural Networks: Tricks of the Trade*, edited by GrĂ©goire Montavon, GeneviĂ¨ve B. Orr, and Klaus-Robert MĂĽller, 53â€“67. Lecture Notes in Computer Science 7700. Springer Berlin Heidelberg. https://doi.org/10.1007/978-3-642-35289-8_5.

Salimans, Tim, and Diederik P Kingma. 2016. â€śWeight Normalization: A Simple Reparameterization to Accelerate Training of Deep Neural Networks.â€ť In *Advances in Neural Information Processing Systems 29*, edited by D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, 901â€“1. Curran Associates, Inc. http://papers.nips.cc/paper/6114-weight-normalization-a-simple-reparameterization-to-accelerate-training-of-deep-neural-networks.pdf.

Scardapane, Simone, Danilo Comminiello, Amir Hussain, and Aurelio Uncini. 2016. â€śGroup Sparse Regularization for Deep Neural Networks,â€ť July. http://arxiv.org/abs/1607.00485.

Srinivas, Suraj, and R. Venkatesh Babu. 2016. â€śGeneralized Dropout,â€ť November. http://arxiv.org/abs/1611.06791.

Srivastava, Nitish, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2014. â€śDropout: A Simple Way to Prevent Neural Networks from Overfitting.â€ť *The Journal of Machine Learning Research* 15 (1): 1929â€“58. http://www.jmlr.org/papers/volume15/srivastava14a.old/source/srivastava14a.pdf.

Xie, Bo, Yingyu Liang, and Le Song. 2016. â€śDiversity Leads to Generalization in Neural Networks,â€ť November. http://arxiv.org/abs/1611.03131.

Zhang, Chiyuan, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. 2017. â€śUnderstanding Deep Learning Requires Rethinking Generalization.â€ť In *Proceedings of ICLR*. http://arxiv.org/abs/1611.03530.