The Living Thing / Notebooks :

Why does deep learning work?

despite the fact we are totally just making this shit up

No time to frame this well, but there are a lot of versions of the question, so just pick one. The essential idea is that we say: Oh my, that deep learning model I just trained had terribly good performance compared with some simpler thing I tried. Can I make my solution simpler? Can I know a decent error bound? Can I learn anything about underlying system parameters by looking at the parameters I learned?

And the answer is not “yes” in any satisfying general sense. Pfft.

The SGD fitting process looks a lot like simulated annealing and like there should be some nice explanation from the statistical mechanics of simulated annealing. But it’s not the same, so fire up the paper mill!

Proceed with caution, since there is a lot of messy thinking here. Here are some things I’d like to read, but whose inclusion here should not be taken as a recommendation. The common theme is using ideas from physics to understand deep learning and other directed graph learning methods.

Charles H Martin. Why Deep Learning Works II: the Renormalization Group.

Max Tegmark, argues that statistical mechanics provides inside to deep learning, and neuroscience (LiTe16a, LiTe16b).

Natalie Wolchover summarises Mehta and Schwab (MeSc14).

Wiatowski et al, (WiGB17) and Shwartz-Ziv, and Tishby (ShTi17) argue that looking at neural networks as random fields with energy propagation dynamics provides some insight to how they work.

There are other connections to - physics-driven annealing methods and physics-inspired Boltzmann machines etc. TBC. C&C statistical mechanics of statistics.


Anderson, A. G., & Berg, C. P.(2017) The High-Dimensional Geometry of Binary Neural Networks. ArXiv:1705.07199 [Cs].
Baldassi, C., Borgs, C., Chayes, J. T., Ingrosso, A., Lucibello, C., Saglietti, L., & Zecchina, R. (2016) Unreasonable effectiveness of learning neural networks: From accessible states and robust ensembles to basic algorithmic schemes. Proceedings of the National Academy of Sciences, 113(48), E7655–E7662. DOI.
Barron, A. R.(1993) Universal approximation bounds for superpositions of a sigmoidal function. IEEE Transactions on Information Theory, 39(3), 930–945. DOI.
Choromanska, A., Henaff, Mi., Mathieu, M., Ben Arous, G., & LeCun, Y. (2015) The Loss Surfaces of Multilayer Networks. In Proceedings of the Eighteenth International Conference on Artificial Intelligence and Statistics (pp. 192–204).
Lin, H. W., & Tegmark, M. (2016a) Critical Behavior from Deep Dynamics: A Hidden Dimension in Natural Language. ArXiv:1606.06737 [Cond-Mat].
Lin, H. W., & Tegmark, M. (2016b) Why does deep and cheap learning work so well?. ArXiv:1608.08225 [Cond-Mat, Stat].
Lipton, Z. C.(2016) Stuck in a What? Adventures in Weight Space. ArXiv:1602.07320 [Cs].
Mehta, P., & Schwab, D. J.(2014) An exact mapping between the Variational Renormalization Group and Deep Learning. ArXiv:1410.3831 [Cond-Mat, Stat].
Perez, C. E.(2016, November 6) Deep Learning: The Unreasonable Effectiveness of Randomness. Medium.
Rolnick, D., & Tegmark, M. (2017) The power of deeper networks for expressing natural functions. ArXiv:1705.05502 [Cs, Stat].
Rumelhart, D. E.(1986) Learning representations by back-propagating errors. Nature, 323(6088), 533–536. DOI.
Sagun, L., Guney, V. U., Arous, G. B., & LeCun, Y. (2014) Explorations on high dimensional landscapes. ArXiv:1412.6615 [Cs, Stat].
Shwartz-Ziv, R., & Tishby, N. (2017) Opening the Black Box of Deep Neural Networks via Information. ArXiv:1703.00810 [Cs].
Wiatowski, T., Grohs, P., & Bölcskei, H. (2017) Energy Propagation in Deep Convolutional Neural Networks. ArXiv:1704.03636 [Cs, Math, Stat].
Xie, B., Liang, Y., & Song, L. (2016) Diversity Leads to Generalization in Neural Networks. ArXiv:1611.03131 [Cs, Stat].
Zhang, C., Bengio, S., Hardt, M., Recht, B., & Vinyals, O. (2017) Understanding deep learning requires rethinking generalization. In Proceedings of ICLR.