No time to frame this well, but there are a lot of versions of the question, so… pick one. The essential idea is that we say: Oh my, that deep learning model I just trained had terribly good performance compared with some simpler thing I tried. Can I make my model simpler and still get good results? Or is the overparameterization essential? Can I know a decent error bound? Can I learn anything about underlying system by looking at the parameters I learned?
And the answer is not “yes” in any satisfying general sense. Pfft.
The SGD fitting process looks a lot like simulated annealing and like there should be some nice explanation from the statistical mechanics of simulated annealing. There are other connections to physics-driven annealing methods and physics-inspired Boltzmann machines etc. TBC. C&C statistical mechanics of statistics. But it's not the same, so fire up the paper mill!
Proceed with caution, since there is a lot of messy thinking here. Here are some things I'd like to read, but whose inclusion here should not be taken as a recommendation. The common theme is using ideas from physics to understand deep learning and other directed graph learning methods.
Charles H Martin. Why Deep Learning Works II: the Renormalization Group.
Max Tegmark, argues that statistical mechanics provides inside to deep learning, and neuroscience (LiTe16a, LiTe16b), although there doesn't seem to be much actionable there?
Natalie Wolchover summarises Mehta and Schwab (MeSc14).
Wiatowski et al, (WiGB17) and Shwartz-Ziv, and Tishby (ShTi17) argue that looking at neural networks as random fields with energy propagation dynamics provides some insight to how they work. More impressively, IMO, Haber and Ruthotto leverage some similar insights to argue you can improve NNs by looking at them as Hamiltonian ODEs.
The combination of overparameterization and SGD is argued to be the secret by Zeyuan Allen-Zhu, Yuanzhi Li and Zhao Song.
- MeSc14: (2014) An exact mapping between the Variational Renormalization Group and Deep Learning. ArXiv:1410.3831 [Cond-Mat, Stat].
- LiTe16a: (2016a) Critical Behavior from Deep Dynamics: A Hidden Dimension in Natural Language. ArXiv:1606.06737 [Cond-Mat].
- Pere16: (2016, November 6) Deep Learning: The Unreasonable Effectiveness of Randomness. Medium.
- RuHa18: (2018) Deep Neural Networks motivated by Partial Differential Equations. ArXiv:1804.04272 [Cs, Math, Stat].
- XiLS16: (2016) Diversity Leads to Generalization in Neural Networks. ArXiv:1611.03131 [Cs, Stat].
- WiGB17: (2017) Energy Propagation in Deep Convolutional Neural Networks. IEEE Transactions on Information Theory, 1–1. DOI
- SGAL14: (2014) Explorations on high dimensional landscapes. ArXiv:1412.6615 [Cs, Stat].
- OlMS17: (2017) Feature Visualization. Distill, 2(11), e7. DOI
- Dala17: (2017) Further and stronger analogy between sampling and optimization: Langevin Monte Carlo and gradient descent. ArXiv:1704.04752 [Math, Stat].
- KaKB17: (2017) Generalization in Deep Learning. ArXiv:1710.05468 [Cs, Stat].
- PhSC17: (2017) Gradients explode - Deep Networks are shallow - ResNet explained. ArXiv:1712.05577 [Cs].
- RoTs18: (2018) Intriguing Properties of Randomly Weighted Networks: Generalizing while Learning Next to Nothing.
- HRHJ17: (2017) Learning across scales - A multiscale method for Convolution Neural Networks. ArXiv:1703.02009 [Cs].
- RuHW86: (1986) Learning representations by back-propagating errors. Nature, 323(6088), 533–536. DOI
- SVWX17: (2017) On the Complexity of Learning Neural Networks. ArXiv:1707.04615 [Cs].
- ArCH18: (2018) On the Optimization of Deep Networks: Implicit Acceleration by Overparameterization. ArXiv:1802.06509 [Cs].
- ShTi17: (2017) Opening the Black Box of Deep Neural Networks via Information. ArXiv:1703.00810 [Cs].
- CMHR18: (2018) Reversible Architectures for Arbitrarily Deep Residual Neural Networks. In arXiv:1709.03698 [cs, stat].
- GoRS17: (2017) Size-Independent Sample Complexity of Neural Networks. ArXiv:1712.06541 [Cs, Stat].
- HaRu18: (2018) Stable architectures for deep neural networks. Inverse Problems, 34(1), 014004. DOI
- MaHB17: (2017) Stochastic Gradient Descent as Approximate Bayesian Inference. JMLR.
- Lipt16: (2016) Stuck in a What? Adventures in Weight Space. ArXiv:1602.07320 [Cs].
- AnBe17: (2017) The High-Dimensional Geometry of Binary Neural Networks. ArXiv:1705.07199 [Cs].
- CHMB15: (2015) The Loss Surfaces of Multilayer Networks. In Proceedings of the Eighteenth International Conference on Artificial Intelligence and Statistics (pp. 192–204).
- RoTe17: (2017) The power of deeper networks for expressing natural functions. ArXiv:1705.05502 [Cs, Stat].
- GZLZ17: (2017) Towards Understanding the Invertibility of Convolutional Neural Networks. ArXiv:1705.08664 [Cs, Stat].
- ZBHR17: (2017) Understanding deep learning requires rethinking generalization. In Proceedings of ICLR.
- Barr93: (1993) Universal approximation bounds for superpositions of a sigmoidal function. IEEE Transactions on Information Theory, 39(3), 930–945. DOI
- BBCI16: (2016) Unreasonable effectiveness of learning neural networks: From accessible states and robust ensembles to basic algorithmic schemes. Proceedings of the National Academy of Sciences, 113(48), E7655–E7662. DOI
- LiTe16b: (2016b) Why does deep and cheap learning work so well? ArXiv:1608.08225 [Cond-Mat, Stat].