The Living Thing / Notebooks : Regularising and generalisation in neural networks

Q: Which of these tricks can I apply outside of deep settings?

General framing

(ZBHR17)

  1. The effective capacity of neural networks is large enough for a brute-force memorization of the entire data set.
  2. Even optimization on random labels remains easy. In fact, training time increases only by a small constant factor compared with training on the true labels.
  3. Randomizing labels is solely a data transformation, leaving all other properties of the learning problem unchanged.

[…]Explicit regularization may improve generalization performance, but is neither necessary nor by itself sufficient for controlling generalization error. […]Appealing to linear models, we analyze how SGD acts as an implicit regularizer.

Early stopping

e.g. Prec12. Don’t keep training your model. The regularisation method that actually makes learning go faster, because you don’t bother to do as much of it.

Noise layers

Dropout

A very popular of noise layer, multiplicative.

Input perturbation

Parametric noise layer. If you are hip you will take this further and do it by…

Adversarial training

See adversarial learning.

Regularisation penalties

L_1, L_2, dropout… Seems to be applied to weights, but rarely to actual neurons.

See Compressing neural networks for that latter use.

This is attractive but has an expensive hyperparameter to choose.

Reversible learning

An elegant autodiff hack, where you find the gradient of the model (loss?) with respect to the model hyperparameters. Usually regularisation hyperparameters, although they don’t require that. Proposed by Bengio, Baydin and Pearlmutter (Beng00, BaPe14) made feasible by Maclaurin et al (MaDA15). Differentiate your optimisation itself with respect to hyperparameters. Non-trivial to implement, though.

Bayesian optimisation

Choose your regularisation hyperparameters optimally even without fancy reversible learning but designing optimal experiments to find the optimum loss. See Bayesian optimisation.

Refs

Bach14
Bach, F. (2014) Breaking the Curse of Dimensionality with Convex Neural Networks. arXiv:1412.8690 [Cs, Math, Stat].
BCCC17
Bahadori, M. T., Chalupka, K., Choi, E., Chen, R., Stewart, W. F., & Sun, J. (2017) Neural Causal Regularization under the Independence of Mechanisms Assumption. arXiv:1702.02604 [Cs, Stat].
BaSL16
Baldi, P., Sadowski, P., & Lu, Z. (2016) Learning in the Machine: Random Backpropagation and the Learning Channel. arXiv:1612.02734 [Cs].
BaPe14
Baydin, A. G., & Pearlmutter, B. A.(2014) Automatic Differentiation of Algorithms for Machine Learning. arXiv:1404.7456 [Cs, Stat].
Beng00
Bengio, Y. (2000) Gradient-Based Optimization of Hyperparameters. Neural Computation, 12(8), 1889–1900. DOI.
CBMF16
Cutajar, K., Bonilla, E. V., Michiardi, P., & Filippone, M. (2016) Practical Learning of Deep Gaussian Processes via Random Fourier Features. arXiv:1610.04386 [Stat].
DaYO16
Dasgupta, S., Yoshizumi, T., & Osogami, T. (2016) Regularized Dynamic Boltzmann Machine with Delay Pruning for Unsupervised Learning of Temporal Sequences. arXiv:1610.01989 [Cs, Stat].
Gal15
Gal, Y. (2015) A Theoretically Grounded Application of Dropout in Recurrent Neural Networks. arXiv:1512.05287 [Stat].
HaRS15
Hardt, M., Recht, B., & Singer, Y. (2015) Train faster, generalize better: Stability of stochastic gradient descent. arXiv:1509.01240 [Cs, Math, Stat].
ImTB16
Im, D. J., Tao, M., & Branson, K. (2016) An empirical analysis of the optimization of deep network loss surfaces. arXiv:1612.04010 [Cs].
MaDA15
Maclaurin, D., Duvenaud, D. K., & Adams, R. P.(2015) Gradient-based Hyperparameter Optimization through Reversible Learning. In ICML (pp. 2113–2122).
MoAV17
Molchanov, D., Ashukha, A., & Vetrov, D. (2017) Variational Dropout Sparsifies Deep Neural Networks. arXiv:1701.05369 [Cs, Stat].
Nøkl16
Nøkland, A. (2016) Direct Feedback Alignment Provides Learning in Deep Neural Networks. In Advances In Neural Information Processing Systems.
PaDG16
Pan, W., Dong, H., & Guo, Y. (2016) DropNeuron: Simplifying the Structure of Deep Neural Networks. arXiv:1606.07326 [Cs, Stat].
Pere16
Perez, C. E.(2016, November 6) Deep Learning: The Unreasonable Effectiveness of Randomness. Medium.
Prec12
Prechelt, L. (2012) Early Stopping — But When?. In G. Montavon, G. B. Orr, & K.-R. Müller (Eds.), Neural Networks: Tricks of the Trade (pp. 53–67). Springer Berlin Heidelberg DOI.
SCHU16
Scardapane, S., Comminiello, D., Hussain, A., & Uncini, A. (2016) Group Sparse Regularization for Deep Neural Networks. arXiv:1607.00485 [Cs, Stat].
SrBa16
Srinivas, S., & Babu, R. V.(2016) Generalized Dropout. arXiv:1611.06791 [Cs].
XiLS16
Xie, B., Liang, Y., & Song, L. (2016) Diversity Leads to Generalization in Neural Networks. arXiv:1611.03131 [Cs, Stat].
ZBHR17
Zhang, C., Bengio, S., Hardt, M., Recht, B., & Vinyals, O. (2017) Understanding deep learning requires rethinking generalization. In Proceedings of ICLR.