The Living Thing / Notebooks :

Why does deep learning work?

despite the fact we are totally just making this shit up

No time to frame this well, but there are a lot of versions of the question, so just pick one. The essential idea is that we say: Oh my, that deep learning model I just trained had terribly good performance compared with some simpler thing I tried. Can I make my solution simpler? Can I know a decent error bound? Can I learn anything about underlying system parameters by looking at the parameters I learned?

And the answer is not “yes” in any satisfying general sense. Pfft.

The SGD fitting process looks a lot like simulated annealing and like there should be some nice explanation from the statistical mechanics of simulated annealing. But it’s not the same, so fire up the paper mill!

Proceed with caution, since there is a lot of messy thinking here. Here are some things I’d like to read, but whose inclusion here should not be taken as a recommendation. The common theme is using ideas from physics to understand deep learning and other directed graph learning methods.

Charles H Martin. Why Deep Learning Works II: the Renormalization Group.

Max Tegmark, argues that statistical mechanics provides inside to deep learning, and neuroscience (LiTe16a, LiTe16b).

Natalie Wolchover summarises Mehta and Schwab (MeSc14).

Wiatowski et al, (WiGB17) and Shwartz-Ziv, and Tishby (ShTi17) argue that looking at neural networks as random fields with energy propagation dynamics provides some insight to how they work.

There are other connections to - physics-driven annealing methods and physics-inspired Boltzmann machines etc. TBC. C&C statistical mechanics of statistics.

Refs

AnBe17
Anderson, A. G., & Berg, C. P.(2017) The High-Dimensional Geometry of Binary Neural Networks. ArXiv:1705.07199 [Cs].
BBCI16
Baldassi, C., Borgs, C., Chayes, J. T., Ingrosso, A., Lucibello, C., Saglietti, L., & Zecchina, R. (2016) Unreasonable effectiveness of learning neural networks: From accessible states and robust ensembles to basic algorithmic schemes. Proceedings of the National Academy of Sciences, 113(48), E7655–E7662. DOI.
Barr93
Barron, A. R.(1993) Universal approximation bounds for superpositions of a sigmoidal function. IEEE Transactions on Information Theory, 39(3), 930–945. DOI.
CHMB15
Choromanska, A., Henaff, Mi., Mathieu, M., Ben Arous, G., & LeCun, Y. (2015) The Loss Surfaces of Multilayer Networks. In Proceedings of the Eighteenth International Conference on Artificial Intelligence and Statistics (pp. 192–204).
LiTe16a
Lin, H. W., & Tegmark, M. (2016a) Critical Behavior from Deep Dynamics: A Hidden Dimension in Natural Language. ArXiv:1606.06737 [Cond-Mat].
LiTe16b
Lin, H. W., & Tegmark, M. (2016b) Why does deep and cheap learning work so well?. ArXiv:1608.08225 [Cond-Mat, Stat].
Lipt16
Lipton, Z. C.(2016) Stuck in a What? Adventures in Weight Space. ArXiv:1602.07320 [Cs].
MeSc14
Mehta, P., & Schwab, D. J.(2014) An exact mapping between the Variational Renormalization Group and Deep Learning. ArXiv:1410.3831 [Cond-Mat, Stat].
Pere16
Perez, C. E.(2016, November 6) Deep Learning: The Unreasonable Effectiveness of Randomness. Medium.
RoTe17
Rolnick, D., & Tegmark, M. (2017) The power of deeper networks for expressing natural functions. ArXiv:1705.05502 [Cs, Stat].
Rume86
Rumelhart, D. E.(1986) Learning representations by back-propagating errors. Nature, 323(6088), 533–536. DOI.
SGAL14
Sagun, L., Guney, V. U., Arous, G. B., & LeCun, Y. (2014) Explorations on high dimensional landscapes. ArXiv:1412.6615 [Cs, Stat].
ShTi17
Shwartz-Ziv, R., & Tishby, N. (2017) Opening the Black Box of Deep Neural Networks via Information. ArXiv:1703.00810 [Cs].
WiGB17
Wiatowski, T., Grohs, P., & Bölcskei, H. (2017) Energy Propagation in Deep Convolutional Neural Networks. ArXiv:1704.03636 [Cs, Math, Stat].
XiLS16
Xie, B., Liang, Y., & Song, L. (2016) Diversity Leads to Generalization in Neural Networks. ArXiv:1611.03131 [Cs, Stat].
ZBHR17
Zhang, C., Bengio, S., Hardt, M., Recht, B., & Vinyals, O. (2017) Understanding deep learning requires rethinking generalization. In Proceedings of ICLR.