Modern computational neural network methods reascend the hype phase transition. a.k.a *deep learning* or *double plus fancy brainbots* or *please give the department have a bigger GPU budget it’s not to play video games I swear*.

I don’t intend to write an introduction to deep learning here; that ground has been tilled already.

But here are some handy links to resources I frequently use.

## What?

To be specific, deep learning is

a library of incremental improvements in areas such as Stochastic Gradient Descent, approximation theory, graphical models, and signal processing research, plus some handy advancements in SIMD architectures that, taken together, surprisingly elicit the kind of results from machine learning that everyone was hoping we’d get by at least 20 years ago, yet

*without*requiring us to develop substantially more clever grad students to do so, or,the state-of-the-art in artificial kitten recognition.

a rapidly metatstatizing buzzword

It’s a frothy (some might say foamy-mouthed) research bubble right now, with such cuteness at the extrema as, e.g. Inceptionising inceptionism (ADGH16) which learns to learn neural networks using neural networks. (well, it sort of does that, but is a long way from a bootstrapping general AI) Stay tuned for more of this.

There is not much to do with “neurons” left in the paradigm at this stage. What there is, is a bundle of clever tricks for training deep constrained hierarchical predictors and classifiers on modern computer hardware. Something closer to a convenient technology stack than a single “theory”.

Some network methods hew closer to behaviour of real neurons, although not *that* close; simulating actual brains is a different discipline with only intermittent and indirect connection.

Subtopics of interest to me:

- recurrent networks for audio data
- compressing deep networks
- neural stack machines
- probabilistic learning
- generative models, esp for art

## Why bother?

There are many answers here.

A classic —

### The ultimate regression algorithm

…until the next ultimate regression algorithm.

It turns out that this particular learning model (class of learning models) and training technologies is surprisingly good at getting every better models out of ever more data. Why burn three grad students on a perfect tractable and specific regression algorithm when you can use one algorithm to solve a whole bunch of regression problems, and which improves with the number of computers and the amount of data you have? How much of a relief is it to capital to decouple its effectiveness from the uncertainty and obstreperousness of human labour?

### Cool maths

Function approximations, interesting manifold inference. Weird product measure things, e.g. Mont14.

Even the stuff I’d assumed was trivial, like backpropagation, has a few wrinkles in practice. See Michael Nielson’s chapter and Chrisopher Olah’s visual summary.

Yes, this is a regular paper mill. Not only are there probably new insights to be had here, but also you can recycle any old machine learning insight, replace a layer in a network with that and *poof* – new paper.

### Insight into the mind

TBD. Maybe.

There claims to be communication between real neurology and neural networks in computer vision, but elsewhere neural networks are driven by their similarities to other things, such as being differentiable relaxations of traditional models, (differentiable stack machines!) or of being license to fit hierarchical models without regard for statistical niceties.

There might be some kind of occasional “stylised fact”-type relationship here.

### Trippy art projects

See generative art and neural networks

## Hip keywords for NN models

Not necessarily mutually exclusive; some design patterns you can use.

There are many summaries floating around here. Some that I looked at are Tomasz Malisiewicz’s summary of Deep Learning Trends @ ICLR 2016, or the Neural network zoo or Simon Brugman’s deep learning papers.

Some of these are descriptions of topologies, others of training tricks or whatever. Recurrent and convolutional are two types of topologies you might have in your ANN. But there are so many other possible ones: “Grid”, “highway”, “Turing” others…

Many are mentioned in passing in David Mcallester’s Cognitive Architectures post.

### Probabilistic/variational

See probabilistic Neural Networks.

### Convolutional

Signal processing baked in to neural networks. Not so complicated if you have ever done signal processing, apart from the abstruse use of “depth” to mean 2 different things in the literature.

Generally uses FIR filters plus some smudgy “pooling” (which is nonlinear downsampling), although IIR is also making an appearance by running RNN on multiple axes.

Terence Broad’s convnet visualizer

See the convenets entry.

### Generative Adversarial Networks

Train two networks to beat each other.

### Recurrent neural networks

Feedback neural networks structures to have with memory and a notion of time and “current” versus “past” state. See recurrent neural networks.

#### GridRNN etc

A mini-genre. KaDG15 et al connect recurrent cells across multiple axes, leading to a higher-rank MIMO system; This is natural in many kinds of spatial random fields, and I am amazed it was uncommon enough to need formalizing in a paper; but apparently it was and it did.

### Partial training

A.k.a. *transfer learning*. Recycling someone else’s features. I don’t know why this has a special term - I think it’s so that you can claim to do “end-to-end” learning, but then actually do what everyone else as done forever and works totally OK, which is to re-use other people’s work like real scientists.

### Attention mechanism

What’s that now? Long story, but see transformer or Sparse Transformer for particularly developed examples and explanations of this sub-field.

### Spike-based

Most simulated neural networks are based on a continuous activation potential and discrete time, unlike spiking biological ones, which are driven by discrete events in continuous time. There are a great many other differences (to real biology). What difference does this in particular make? I suspect it means that time is handled different.

### Kernel networks

Kernel trick + ANN = kernel ANNs.

Stay tuned for reframing more things as deep learning.

I think this is what *convex networks* are also?

Bengio, Le Roux, Vincent, Delalleau, and Marcotte, 2006.

### Extreme learning machines

Dunno. I think this is a flavour of random neural net?

### Autoencoding

TBD. Making a sparse encoding of something by demanding your network reproduces the after passing the network activations through a narrow bottleneck. Many flavours.

## Optimisation methods

Backpropagation plus stochastic gradient descent rules at the moment.

Does anything else get performance at this scale? What other techniques can be extracted from variational inference or MC sampling, or particle filters, since there is no clear reason that shoving any of these in as intermediate layers in the network is any *less* well-posed than a classical backprop layer? Although it does require more nous from the enthusiastic grad student.

## Preventing overfitting

See regularising deep learning.

## Activations for neural networks

## practicalities

Various design niceties.

### Managing those dimensions

Practically a lot of the time managing deep learning is remembering which axis is which.

Alexander Rush argues you want a NamedTensor. Implementations:

- namedtensor pytorch
- labeledtensor tesnorflow

Einsum does einstein summation, whcih is also very helpful.

## Software stuff

For general purposes I use,

- Tensorflow, plus a side order of Keras or
- pytorch
- julia

I could use…

- Intel’s ngraph, which compile neural nets esp for CPUs
- Collaboratively build, visualize, and design neural nets in browser
- Python: Theano (now defunct) was the trailblazer for python
- Lua: Torch
- MATLAB/Python: Caffe claims to be a “de facto standard”
- Python/C++: Paddlepaddle is Baidu’s nonfancy NN machine
- Minimalist C++: tiny-dnn is a C++11 implementation of deep learning. It is suitable for deep learning on limited computational resource, embedded systems and IoT devices.
NNpack “is an acceleration package for neural network computations. NNPACK aims to provide high-performance implementations of convnet layers for multi-core CPUs.”

NNPACK is not intended to be directly used by machine learning researchers; instead it provides low-level performance primitives to be leveraged by higher-level frameworks

USP: compiles to javacscript amongst other things.

- javascript: see javascript machine learning
iphone: DeepBeliefSDK

julia. various

### pre-computed/trained models

Caffe format:

The Caffe Zoo has lots of nice models, pre-trained on their wiki

Here’s a great CV one, Andrej Karpathy’s image captioner, Neuraltalk2

- for the NVC dataset: – pre-trained feature model here)
- Alexnet
- For lasagne: https://github.com/Lasagne/Recipes/tree/master/modelzoo
For Keras:

## Howtos

- diagramming convnets using matplotlib in python
- What’s wrong with deep learning? is a high speed diagrammatic introductory presentation with clickbait title, by one of the founding fathers, Yann LeCunn
- Yarin Gal on uncertainty quantification
- Memkite’s Deep learning bibliography
deeplearning.net’s reading list…

- and their tutorials are clear

- Michael Nielson has a free online textbook with code examples in python
- Dürr’s tutorial
- The cat recogniser team lead, Quoc Le, lectures
- cute: srirajology’s energetic “demystifying” howtos

## Refs

- LiBE15: Zachary C. Lipton, John Berkowitz, Charles Elkan (2015) A Critical Review of Recurrent Neural Networks for Sequence Learning.
*ArXiv:1506.00019 [Cs]*. - HiOT06: G Hinton, S Osindero, Y Teh (2006) A Fast Learning Algorithm for Deep Belief Nets.
*Neural Computation*, 18(7), 1527–1554. DOI - MoRe12: Derek Monner, James A. Reggia (2012) A generalized LSTM-like training algorithm for second-order recurrent neural networks.
*Neural Networks*, 25, 70–83. DOI - WiBö15: Thomas Wiatowski, Helmut Bölcskei (2015) A Mathematical Theory of Deep Convolutional Neural Networks for Feature Extraction. In Proceedings of IEEE International Symposium on Information Theory.
- GaEB15: Leon A. Gatys, Alexander S. Ecker, Matthias Bethge (2015) A Neural Algorithm of Artistic Style.
*ArXiv:1508.06576 [Cs, q-Bio]*. - HeWH16: Kun He, Yan Wang, John Hopcroft (2016) A Powerful Generative Model Using Random Weights for the Deep Image Representation. In Advances in Neural Information Processing Systems.
- Hint10: Geoffrey Hinton (2010) A practical guide to training restricted Boltzmann machines. In Neural Networks: Tricks of the Trade (Vol. 9, p. 926). Springer Berlin Heidelberg
- GaGh16: Yarin Gal, Zoubin Ghahramani (2016) A Theoretically Grounded Application of Dropout in Recurrent Neural Networks. In arXiv:1512.05287 [stat].
- LCHR06: Yann LeCun, Sumit Chopra, Raia Hadsell, M. Ranzato, F. Huang (2006) A tutorial on energy-based learning.
*Predicting Structured Data*. - MoDH12: A. r Mohamed, G. E. Dahl, G. Hinton (2012) Acoustic Modeling Using Deep Belief Networks.
*IEEE Transactions on Audio, Speech, and Language Processing*, 20(1), 14–22. DOI - Bose91: B. Boser (1991) An analog neural network processor with programmable topology.
*J. Solid State Circuits*, 26, 2017–2025. DOI - MeSc14: Pankaj Mehta, David J. Schwab (2014) An exact mapping between the Variational Renormalization Group and Deep Learning.
*ArXiv:1410.3831 [Cond-Mat, Stat]*. - Cybe89: G. Cybenko (1989) Approximation by superpositions of a sigmoidal function.
*Mathematics of Control, Signals and Systems*, 2, 303–314. DOI - Pink99: Allan Pinkus (1999) Approximation theory of the MLP model in neural networks.
*Acta Numerica*, 8, 143–195. DOI - LSLW15: Anders Boesen Lindbo Larsen, Søren Kaae Sønderby, Hugo Larochelle, Ole Winther (2015) Autoencoding beyond pixels using a learned similarity metric.
*ArXiv:1512.09300 [Cs, Stat]*. - Bach14: Francis Bach (2014) Breaking the Curse of Dimensionality with Convex Neural Networks.
*ArXiv:1412.8690 [Cs, Math, Stat]*. - OKVE16: Aäron van den Oord, Nal Kalchbrenner, Oriol Vinyals, Lasse Espeholt, Alex Graves, Koray Kavukcuoglu (2016) Conditional Image Generation with PixelCNN Decoders.
*ArXiv:1606.05328 [Cs]*. - Helm13: M. Helmstaedter (2013) Connectomic reconstruction of the inner plexiform layer in the mouse retina.
*Nature*, 500, 168–174. DOI - Dahl12: G. E. Dahl (2012) Context-dependent pre-trained deep neural networks for large vocabulary speech recognition.
*IEEE Transactions on Audio, Speech and Language Processing*, 20, 33–42. DOI - BRVD05: Yoshua Bengio, Nicolas L. Roux, Pascal Vincent, Olivier Delalleau, Patrice Marcotte (2005) Convex neural networks. In Advances in neural information processing systems (pp. 123–130).
- LGRN09: Honglak Lee, Roger Grosse, Rajesh Ranganath, Andrew Y. Ng (2009) Convolutional Deep Belief Networks for Scalable Unsupervised Learning of Hierarchical Representations. In Proceedings of the 26th Annual International Conference on Machine Learning (pp. 609–616). New York, NY, USA: ACM DOI
- Garc04: C. Garcia (2004) Convolutional face finder: a neural architecture for fast and robust face detection.
*IEEE Transactions on Pattern Analysis and Machine Intelligence*, 26, 1408–1423. DOI - Tura10: S. C. Turaga (2010) Convolutional networks can learn to generate affinity graphs for image segmentation.
*Neural Comput.*, 22, 511–538. DOI - LiTe16a: Henry W. Lin, Max Tegmark (2016a) Critical Behavior from Deep Dynamics: A Hidden Dimension in Natural Language.
*ArXiv:1606.06737 [Cond-Mat]*. - JCOV16: Max Jaderberg, Wojciech Marian Czarnecki, Simon Osindero, Oriol Vinyals, Alex Graves, Koray Kavukcuoglu (2016) Decoupled Neural Interfaces using Synthetic Gradients.
*ArXiv:1608.05343 [Cs]*. - KWKT15: Tejas D. Kulkarni, Will Whitney, Pushmeet Kohli, Joshua B. Tenenbaum (2015) Deep Convolutional Inverse Graphics Network.
*ArXiv:1503.03167 [Cs]*. - LeBH15: Yann LeCun, Yoshua Bengio, Geoffrey Hinton (2015) Deep learning.
*Nature*, 521(7553), 436–444. DOI - YuDe11: D. Yu, L. Deng (2011) Deep Learning and Its Applications to Signal and Information Processing [Exploratory DSP].
*IEEE Signal Processing Magazine*, 28(1), 145–154. DOI - Leun14: M. K. Leung (2014) Deep learning of the tissue-regulated splicing code.
*Bioinformatics*, 30, i121–i129. DOI - ZhCL15: Sixin Zhang, Anna Choromanska, Yann LeCun (2015) Deep learning with Elastic Averaging SGD. In Advances In Neural Information Processing Systems.
- ArRK10: I Arel, D C Rose, T P Karnowski (2010) Deep Machine Learning - A New Frontier in Artificial Intelligence Research [Research Frontier].
*IEEE Computational Intelligence Magazine*, 5(4), 13–18. DOI - Ma15: J. Ma (2015) Deep neural nets as a method for quantitative structure-activity relationships.
*J. Chem. Inf. Model.*, 55, 263–274. DOI - HDYD12: G. Hinton, Li Deng, Dong Yu, G.E. Dahl, A. Mohamed, N. Jaitly, … B. Kingsbury (2012) Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups.
*IEEE Signal Processing Magazine*, 29(6), 82–97. DOI - Cadi14: C. F. Cadieu (2014) Deep neural networks rival the representation of primate it cortex for core visual object recognition.
*PLoS Comp. Biol.*, 10, e1003963. DOI - GiSB16: R. Giryes, G. Sapiro, A. M. Bronstein (2016) Deep Neural Networks with Random Gaussian Weights: A Universal Classification Strategy?
*IEEE Transactions on Signal Processing*, 64(13), 3444–3457. DOI - HaCL06: R. Hadsell, S. Chopra, Y. LeCun (2006) Dimensionality Reduction by Learning an Invariant Mapping. In 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Vol. 2, pp. 1735–1742). DOI
- Nøkl16: Arild Nøkland (2016) Direct Feedback Alignment Provides Learning in Deep Neural Networks. In Advances In Neural Information Processing Systems.
- XiLS16: Bo Xie, Yingyu Liang, Le Song (2016) Diversity Leads to Generalization in Neural Networks.
*ArXiv:1611.03131 [Cs, Stat]*. - UGKA16: Gregor Urban, Krzysztof J. Geras, Samira Ebrahimi Kahou, Ozlem Aslan, Shengjie Wang, Rich Caruana, … Matt Richardson (2016) Do Deep Convolutional Nets Really Need to be Deep (Or Even Convolutional)?
*ArXiv:1603.05691 [Cs, Stat]*. - PaDG16: Wei Pan, Hao Dong, Yike Guo (2016) DropNeuron: Simplifying the Structure of Deep Neural Networks.
*ArXiv:1606.07326 [Cs, Stat]*. - LeBW96: Wee Sun Lee, Peter L. Bartlett, Robert C. Williamson (1996) Efficient agnostic learning of neural networks with bounded fan-in.
*IEEE Transactions on Information Theory*, 42(6), 2118–2132. DOI - MCCD13: Tomas Mikolov, Kai Chen, Greg Corrado, Jeffrey Dean (2013) Efficient Estimation of Word Representations in Vector Space.
*ArXiv:1301.3781 [Cs]*. - OlFi96: Bruno A. Olshausen, David J. Field (1996) Emergence of simple-cell receptive field properties by learning a sparse code for natural images.
*Nature*, 381(6583), 607–609. DOI - DiSc14: Sander Dieleman, Benjamin Schrauwen (2014) End to end learning for music audio. In 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 6964–6968). IEEE DOI
- WiGB17: Thomas Wiatowski, Philipp Grohs, Helmut Bölcskei (2017) Energy Propagation in Deep Convolutional Neural Networks.
*IEEE Transactions on Information Theory*, 1–1. DOI - GoSS14: Ian J. Goodfellow, Jonathon Shlens, Christian Szegedy (2014) Explaining and Harnessing Adversarial Examples.
*ArXiv:1412.6572 [Cs, Stat]*. - MiLS13: Tomas Mikolov, Quoc V. Le, Ilya Sutskever (2013) Exploiting Similarities among Languages for Machine Translation.
*ArXiv:1309.4168 [Cs]*. - SGAL14: Levent Sagun, V. Ugur Guney, Gerard Ben Arous, Yann LeCun (2014) Explorations on high dimensional landscapes.
*ArXiv:1412.6615 [Cs, Stat]*. - SmTo17: Leslie N. Smith, Nicholay Topin (2017) Exploring loss function topology with cyclical learning rates.
*ArXiv:1702.04283 [Cs]*. - HuZS04: Guang-Bin Huang, Qin-Yu Zhu, Chee-Kheong Siew (2004) Extreme learning machine: a new learning scheme of feedforward neural networks. In 2004 IEEE International Joint Conference on Neural Networks, 2004. Proceedings (Vol. 2, pp. 985–990 vol.2). DOI
- HuZS06: Guang-Bin Huang, Qin-Yu Zhu, Chee-Kheong Siew (2006) Extreme learning machine: Theory and applications.
*Neurocomputing*, 70(1–3), 489–501. DOI - HuSi05: Guang-Bin Huang, Chee-Kheong Siew (2005) Extreme learning machine with randomly assigned RBF kernels.
*International Journal of Information Technology*, 11(1), 16–24. - HuWL11: Guang-Bin Huang, Dian Hui Wang, Yuan Lan (2011) Extreme learning machines: a survey.
*International Journal of Machine Learning and Cybernetics*, 2(2), 107–122. DOI - Lawr97: S. Lawrence (1997) Face recognition: a convolutional neural-network approach.
*IEEE Transactions on Neural Networks*, 8, 98–113. DOI - KaRL10: Koray Kavukcuoglu, Marc’Aurelio Ranzato, Yann LeCun (2010) Fast Inference in Sparse Coding Algorithms with Applications to Object Recognition.
*ArXiv:1010.3467 [Cs]*. - BLRW17: Andrew Brock, Theodore Lim, J. M. Ritchie, Nick Weston (2017) FreezeOut: Accelerate Training by Progressively Freezing Layers.
*ArXiv:1706.04983 [Cs, Stat]*. - GPMX14: Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, … Yoshua Bengio (2014) Generative Adversarial Networks.
*ArXiv:1406.2661 [Cs, Stat]*. - MaDA15: Dougal Maclaurin, David K. Duvenaud, Ryan P. Adams (2015) Gradient-based Hyperparameter Optimization through Reversible Learning. In ICML (pp. 2113–2122).
- Lecu98: Y. LeCun (1998) Gradient-based learning applied to document recognition.
*Proceedings of the IEEE*, 86(11), 2278–2324. DOI - Mall12: Stéphane Mallat (2012) Group Invariant Scattering.
*Communications on Pure and Applied Mathematics*, 65(10), 1331–1398. DOI - SCHU16: Simone Scardapane, Danilo Comminiello, Amir Hussain, Aurelio Uncini (2016) Group Sparse Regularization for Deep Neural Networks.
*ArXiv:1607.00485 [Cs, Stat]*. - Mnih15: V. Mnih (2015) Human-level control through deep reinforcement learning.
*Nature*, 518, 529–533. DOI - DPGC14: Yann Dauphin, Razvan Pascanu, Caglar Gulcehre, Kyunghyun Cho, Surya Ganguli, Yoshua Bengio (2014) Identifying and attacking the saddle point problem in high-dimensional non-convex optimization. In Advances in Neural Information Processing Systems 27 (pp. 2933–2941). Curran Associates, Inc.
- KrSH12: Alex Krizhevsky, Ilya Sutskever, Geoffrey E. Hinton (2012) Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems (pp. 1097–1105).
- KSJC16: Diederik P. Kingma, Tim Salimans, Rafal Jozefowicz, Xi Chen, Ilya Sutskever, Max Welling (2016) Improving Variational Inference with Inverse Autoregressive Flow. In Advances in Neural Information Processing Systems 29. Curran Associates, Inc.
- Beng09: Yoshua Bengio (2009) Learning deep architectures for AI.
*Foundations and Trends® in Machine Learning*, 2(1), 1–127. DOI - Fara13: C. Farabet (2013) Learning hierarchical features for scene labeling.
*IEEE Transactions on Pattern Analysis and Machine Intelligence*, 35, 1915–1929. DOI - GlLi16: Amir Globerson, Roi Livni (2016) Learning Infinite-Layer Networks: Beyond the Kernel Trick.
*ArXiv:1606.05316 [Cs]*. - RuHW86: David E. Rumelhart, Geoffrey E. Hinton, Ronald J. Williams (1986) Learning representations by back-propagating errors.
*Nature*, 323(6088), 533–536. DOI - MoBa17: Ali Mousavi, Richard G. Baraniuk (2017) Learning to Invert: Signal Recovery via Deep Convolutional Networks. In ICASSP.
- ADGH16: Marcin Andrychowicz, Misha Denil, Sergio Gomez, Matthew W. Hoffman, David Pfau, Tom Schaul, Nando de Freitas (2016) Learning to learn by gradient descent by gradient descent.
*ArXiv:1606.04474 [Cs]*. - FuAm00: K. Fukumizu, S. Amari (2000) Local minima and plateaus in hierarchical structures of multilayer perceptrons.
*Neural Networks*, 13(3), 317–327. DOI - Ranz13: M. Ranzato (2013) Modeling natural images using gated MRFs.
*IEEE Transactions on Pattern Analysis and Machine Intelligence*, 35(9), 2206–2222. DOI - Cire12: D. Ciresan (2012) Multi-column deep neural network for traffic sign classification.
*Neural Networks*, 32, 333–338. DOI - HoSW89: Kurt Hornik, Maxwell Stinchcombe, Halbert White (1989) Multilayer feedforward networks are universal approximators.
*Neural Networks*, 2(5), 359–366. DOI - Amar98: Shun-ichi Amari (1998) Natural Gradient Works Efficiently in Learning.
*Neural Computation*, 10(2), 251–276. DOI - FuMi82: Kunihiko Fukushima, Sei Miyake (1982) Neocognitron: A new algorithm for pattern recognition tolerant of deformations and shifts in position.
*Pattern Recognition*, 15(6), 455–469. DOI - ChGS15: Tianqi Chen, Ian Goodfellow, Jonathon Shlens (2015) Net2Net: Accelerating Learning via Knowledge Transfer.
*ArXiv:1511.05641 [Cs]*. - KaSu15: Łukasz Kaiser, Ilya Sutskever (2015) Neural GPUs Learn Algorithms.
*ArXiv:1511.08228 [Cs]*. - PaSa17: Emilio Parisotto, Ruslan Salakhutdinov (2017) Neural Map: Structured Memory for Deep Reinforcement Learning.
*ArXiv:1702.08360 [Cs]*. - CMBB14: Kyunghyun Cho, Bart van Merriënboer, Dzmitry Bahdanau, Yoshua Bengio (2014) On the properties of neural machine translation: Encoder-decoder approaches.
*ArXiv Preprint ArXiv:1409.1259*. - PDGB14: Razvan Pascanu, Yann N. Dauphin, Surya Ganguli, Yoshua Bengio (2014) On the saddle point problem for non-convex optimization.
*ArXiv:1405.4604 [Cs]*. - GiSB14: Raja Giryes, Guillermo Sapiro, Alex M. Bronstein (2014) On the Stability of Deep Networks.
*ArXiv:1412.5896 [Cs, Math, Stat]*. - Ciod12: T. Ciodaro (2012) Online particle detection with neural networks based on topological calorimetry information.
*J. Phys. Conf. Series*, 368, 012030. DOI - ShTi17: Ravid Shwartz-Ziv, Naftali Tishby (2017) Opening the Black Box of Deep Neural Networks via Information.
*ArXiv:1703.00810 [Cs]*. - SMMD17: Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, Jeff Dean (2017) Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer.
*ArXiv:1701.06538 [Cs, Stat]*. - OoKK16: Aäron van den Oord, Nal Kalchbrenner, Koray Kavukcuoglu (2016) Pixel Recurrent Neural Networks.
*ArXiv:1601.06759 [Cs]*. - GoVS14: Ian J. Goodfellow, Oriol Vinyals, Andrew M. Saxe (2014) Qualitatively characterizing neural network optimization problems.
*ArXiv:1412.6544 [Cs, Stat]*. - Hube62: D. H. Hubel (1962) Receptive fields, binocular interaction, and functional architecture in the cat’s visual cortex.
*J. Physiol.*, 160, 106–154. DOI - HiSa06: Geoffrey E. Hinton, Ruslan R. Salakhutdinov (2006) Reducing the dimensionality of data with neural networks.
*Science*, 313(5786), 504–507. DOI - Telg15: Matus Telgarsky (2015) Representation Benefits of Deep Feedforward Networks.
*ArXiv:1509.08101 [Cs]*. - BeCV13: Yoshua Bengio, Aaron Courville, Pascal Vincent (2013) Representation Learning: A Review and New Perspectives.
*IEEE Transactions on Pattern Analysis and Machine Intelligence*, 35, 1798–1828. DOI - BeLe07: Yoshua Bengio, Yann LeCun (2007) Scaling learning algorithms towards AI.
*Large-Scale Kernel Machines*, 34, 1–41. - AGMM15: Sanjeev Arora, Rong Ge, Tengyu Ma, Ankur Moitra (2015) Simple, efficient, and neural algorithms for sparse coding. In Proceedings of The 28th Conference on Learning Theory (Vol. 40, pp. 113–149). Paris, France: PMLR
- OlFi04: Bruno A Olshausen, David J Field (2004) Sparse coding of sensory inputs.
*Current Opinion in Neurobiology*, 14(4), 481–487. DOI - RaBC08: Marcaurelio Ranzato, Y.-lan Boureau, Yann L. Cun (2008) Sparse Feature Learning for Deep Belief Networks. In Advances in Neural Information Processing Systems 20 (pp. 1185–1192). Curran Associates, Inc.
- SDBR14: Jost Tobias Springenberg, Alexey Dosovitskiy, Thomas Brox, Martin Riedmiller (2014) Striving for Simplicity: The All Convolutional Net. In Proceedings of International Conference on Learning Representations (ICLR) 2015.
- Lipt16a: Zachary C. Lipton (2016a) Stuck in a What? Adventures in Weight Space.
*ArXiv:1602.07320 [Cs]*. - MAPE15: Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, … Xiaoqiang Zheng (2015)
*TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems* - StGa15: Greg Ver Steeg, Aram Galstyan (2015) The Information Sieve.
*ArXiv:1507.02284 [Cs, Math, Stat]*. - CHMB15: Anna Choromanska, MIkael Henaff, Michael Mathieu, Gerard Ben Arous, Yann LeCun (2015) The Loss Surfaces of Multilayer Networks. In Proceedings of the Eighteenth International Conference on Artificial Intelligence and Statistics (pp. 192–204).
- Lipt16b: Zachary C. Lipton (2016b) The Mythos of Model Interpretability. In arXiv:1606.03490 [cs, stat].
- Hint95: G. E. Hinton (1995) The wake-sleep algorithm for unsupervised neural networks.
*Science*, 268(5214), 1558–1161. DOI - Hint07: Geoffrey E. Hinton (2007) To recognize shapes, first learn to generate images. In Progress in Brain Research (Vol. Volume 165, pp. 535–547). Elsevier
- Ning05: F. Ning (2005) Toward automatic phenotyping of developing embryos from videos.
*IEEE Transactions on Image Processing*, 14, 1360–1371. DOI - BaPS16: Atılım Güneş Baydin, Barak A. Pearlmutter, Jeffrey Mark Siskind (2016) Tricks from Deep Learning.
*ArXiv:1611.03777 [Cs, Stat]*. - Mall16: Stéphane Mallat (2016) Understanding Deep Convolutional Networks.
*ArXiv:1601.04920 [Cs, Stat]*. - ZBHR17: Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, Oriol Vinyals (2017) Understanding deep learning requires rethinking generalization. In Proceedings of ICLR.
- Barr93: A.R. Barron (1993) Universal approximation bounds for superpositions of a sigmoidal function.
*IEEE Transactions on Information Theory*, 39(3), 930–945. DOI - BBCI16: Carlo Baldassi, Christian Borgs, Jennifer T. Chayes, Alessandro Ingrosso, Carlo Lucibello, Luca Saglietti, Riccardo Zecchina (2016) Unreasonable effectiveness of learning neural networks: From accessible states and robust ensembles to basic algorithmic schemes.
*Proceedings of the National Academy of Sciences*, 113(48), E7655–E7662. DOI - RaMC15: Alec Radford, Luke Metz, Soumith Chintala (2015) Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks. In arXiv:1511.06434 [cs].
- Oord16: Aäron van den Oord (2016) Wavenet: A Generative Model for Raw Audio
- SaKi16: Tim Salimans, Diederik P Kingma (2016) Weight Normalization: A Simple Reparameterization to Accelerate Training of Deep Neural Networks. In Advances in Neural Information Processing Systems 29 (pp. 901–901). Curran Associates, Inc.
- Mont14: G. Montufar (2014) When does a mixture of products contain a product of mixtures?
*J. Discrete Math.*, 29, 321–347. DOI - LiTe16b: Henry W. Lin, Max Tegmark (2016b) Why does deep and cheap learning work so well?
*ArXiv:1608.08225 [Cond-Mat, Stat]*. - PaVe14: Arnab Paul, Suresh Venkatasubramanian (2014) Why does Deep Learning work? - A perspective from Group Theory.
*ArXiv:1412.6621 [Cs, Stat]*. - EBCM10: Dumitru Erhan, Yoshua Bengio, Aaron Courville, Pierre-Antoine Manzagol, Pascal Vincent, Samy Bengio (2010) Why does unsupervised pre-training help deep learning?
*Journal of Machine Learning Research*, 11(Feb), 625–660.