The Living Thing / Notebooks :

Recurrent neural networks

Feedback networks structured to have memory and a notion of “current” and “past” states, which can encode time (or whatever). Many wheels are re-invented with these, but the essential model is that we have a heavily nonlinear state filter inferred by gradient descent.

The connection with these and convolutional neural networks is suggestive for the same reason.

Many different flavours and topologies. GridRNN - KaDG15 - seems natural for note-based stuff, and for systems, like the cochlea, and indeed the entire human visual apparatus, with dependencies in both space and time.



As someone who does a lot of signal processing for music, the notion that these generalise linear systems theory is suggestive of interesting DSP applications, e.g. generative music.

To Learn

Inverse Autoregressive Flow (KSJC16).



As seen in normal signal processing/ The main problem here is that they are unstable in the training phase in many of the wild and weird NN SGD phases, unless you are clever. See BeSF94. The next three types are proposed solutions for that.

Long Short Term Memory (LSTM)

On the border with deep automata.

As always, Christopher Olah wins the visual explanation prize: Understanding LSTM Networks. Also neat: LSTM Networks for Sentiment Analysis: Alex Graves Generating Sequences With Recurrent Neural Networks, generates handwriting.

In a traditional recurrent neural network, during the gradient back-propagation phase, the gradient signal can end up being multiplied a large number of times (as many as the number of timesteps) by the weight matrix associated with the connections between the neurons of the recurrent hidden layer. This means that, the magnitude of weights in the transition matrix can have a strong impact on the learning process.[…]

These issues are the main motivation behind the LSTM model which introduces a new structure called a memory cell…]. A memory cell is composed of four main elements: an input gate, a neuron with a self-recurrent connection (a connection to itself), a forget gate and an output gate. […]The gates serve to modulate the interactions between the memory cell itself and its environment.

Gate Recurrent Unit (GRU)

Simpler than the LSTM. But how exactly does it work? Read CGCB15 and CMGB14.


Charming connection with my other research into acoustics, what I would call “Gerzon allpass” filters are now hip for use neural networks because of favourable normalisation characteristics.

Here is an implementation that uses unnecessary complex numbers.


i.e. Kalman fiters, but rebranded in the fine neural networks tradition of taking something uncontroversial from another field and putting the word “neural” in front. Practically these are usually variational, but there are some random sampling based ones.

See Sequential Neural Models with Stochastic Layers. (FSPW16)


A mini-genre. KaDG15 et al connect recurrent cells across multiple axes, leading to a higher-rank MIMO system; This is natural in many kinds of spatial random fields, and I am amazed it was uncommon enough to need formalizing in a paper; but it was and it did and good on Kalchbrenner et al.


Long story, bro.

keras implementation by Francesco Ferroni. Tensorflow implementation by Enea Ceolini.

Lasagne implementation by Danny Neil.


It’s still the wild west. Invent a category, name it and stake a claim. There’s publications in them thar hills.



TBPTT, state filters, filter stability.



You have to do this manually if you are researching RNNs, since the current pytorch RNN implementation is not extensible because the API design is bad, favouring cuDNN integration over user extensibility - you cannot even apply your own activation function if you use their implementation. However, it does not look like it will be too onerous. Here are some design patterns for that


Arisoy, E., Sainath, T. N., Kingsbury, B., & Ramabhadran, B. (2012) Deep Neural Network Language Models. In Proceedings of the NAACL-HLT 2012 Workshop: Will We Ever Really Replace the N-gram Model? On the Future of Language Modeling for HLT (pp. 20–28). Stroudsburg, PA, USA: Association for Computational Linguistics
Arjovsky, M., Shah, A., & Bengio, Y. (2015) Unitary Evolution Recurrent Neural Networks. ArXiv:1511.06464 [Cs, Stat].
Auer, P., Burgsteiner, H., & Maass, W. (2008) A learning rule for very simple universal approximators consisting of a single layer of perceptrons. Neural Networks, 21(5), 786–795. DOI.
Balduzzi, D., Frean, M., Leary, L., Lewis, J. P., Ma, K. W.-D., & McWilliams, B. (2017) The Shattered Gradients Problem: If resnets are the answer, then what is the question?. ArXiv Preprint ArXiv:1702.08591.
Bengio, Y., Simard, P., & Frasconi, P. (1994) Learning long-term dependencies with gradient descent is difficult. IEEE Transactions on Neural Networks, 5(2), 157–166. DOI.
Boulanger-Lewandowski, N., Bengio, Y., & Vincent, P. (2012) Modeling Temporal Dependencies in High-Dimensional Sequences: Application to Polyphonic Music Generation and Transcription. In 29th International Conference on Machine Learning.
Bown, O., & Lexer, S. (2006) Continuous-Time Recurrent Neural Networks for Generative and Interactive Musical Performance. In F. Rothlauf, J. Branke, S. Cagnoni, E. Costa, C. Cotta, R. Drechsler, … H. Takagi (Eds.), Applications of Evolutionary Computing (pp. 652–663). Springer Berlin Heidelberg
Buhusi, C. V., & Meck, W. H.(2005) What makes us tick? Functional and neural mechanisms of interval timing. Nature Reviews Neuroscience, 6(10), 755–765. DOI.
Charles, A., Yin, D., & Rozell, C. (2016) Distributed Sequence Memory of Multidimensional Inputs in Recurrent Networks. ArXiv:1605.08346 [Cs, Math, Stat].
Cho, K., van Merriënboer, B., Bahdanau, D., & Bengio, Y. (2014) On the properties of neural machine translation: Encoder-decoder approaches. ArXiv Preprint ArXiv:1409.1259.
Cho, K., van Merrienboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., & Bengio, Y. (2014) Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation. ArXiv:1406.1078 [Cs, Stat].
Chung, J., Ahn, S., & Bengio, Y. (2016) Hierarchical Multiscale Recurrent Neural Networks. ArXiv:1609.01704 [Cs].
Chung, J., Gulcehre, C., Cho, K., & Bengio, Y. (2014) Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling. In NIPS.
Chung, J., Gulcehre, C., Cho, K., & Bengio, Y. (2015) Gated Feedback Recurrent Neural Networks. ArXiv:1502.02367 [Cs, Stat].
Collins, J., Sohl-Dickstein, J., & Sussillo, D. (2016) Capacity and Trainability in Recurrent Neural Networks. In arXiv:1611.09913 [cs, stat].
Dasgupta, S., Yoshizumi, T., & Osogami, T. (2016) Regularized Dynamic Boltzmann Machine with Delay Pruning for Unsupervised Learning of Temporal Sequences. ArXiv:1610.01989 [Cs, Stat].
Doelling, K. B., & Poeppel, D. (2015) Cortical entrainment to music and its modulation by expertise. Proceedings of the National Academy of Sciences, 112(45), E6233–E6242. DOI.
Elman, J. L.(1990) Finding structure in time. Cognitive Science, 14, 179–211. DOI.
Fraccaro, M., Sø nderby, S. ren K., Paquet, U., & Winther, O. (2016) Sequential Neural Models with Stochastic Layers. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, & R. Garnett (Eds.), Advances in Neural Information Processing Systems 29 (pp. 2199–2207). Curran Associates, Inc.
Gal, Y. (2015) A Theoretically Grounded Application of Dropout in Recurrent Neural Networks. ArXiv:1512.05287 [Stat].
Gel, Y. R., Lyubchich, V., & Ramirez, L. L.(2016) Fast Patchwork Bootstrap for Quantifying Estimation Uncertainties in Sparse Random Networks.
Gers, F. A., Schmidhuber, J., & Cummins, F. (2000) Learning to Forget: Continual Prediction with LSTM. Neural Computation, 12(10), 2451–2471. DOI.
Gers, F. A., Schraudolph, N. N., & Schmidhuber, J. (2002) Learning precise timing with LSTM recurrent networks. Journal of Machine Learning Research, 3(Aug), 115–143.
Graves, A. (2012) Supervised sequence labelling with recurrent neural networks. . Heidelberg ; New York: Springer
Gregor, K., Danihelka, I., Graves, A., Rezende, D. J., & Wierstra, D. (2015) DRAW: A Recurrent Neural Network For Image Generation. ArXiv:1502.04623 [Cs].
Grzyb, B. J., Chinellato, E., Wojcik, G. M., & Kaminski, W. A.(2009) Which model to use for the Liquid State Machine?. In 2009 International Joint Conference on Neural Networks (pp. 1018–1024). DOI.
Hazan, H., & Manevitz, L. M.(2012) Topological constraints and robustness in liquid state machines. Expert Systems with Applications, 39(2), 1597–1606. DOI.
He, K., Wang, Y., & Hopcroft, J. (2016) A Powerful Generative Model Using Random Weights for the Deep Image Representation. ArXiv:1606.04801 [Cs].
Hinton, G., Deng, L., Yu, D., Dahl, G. E., Mohamed, A., Jaitly, N., … Kingsbury, B. (2012) Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups. IEEE Signal Processing Magazine, 29(6), 82–97. DOI.
Hochreiter, S., & Schmidhuber, J. (1997a) Long Short-Term Memory. Neural Computation, 9(8), 1735–1780. DOI.
Hochreiter, S., & Schmidhuber, J. (1997b) LTSM can solve hard time lag problems. In Advances in Neural Information Processing Systems: Proceedings of the 1996 Conference (pp. 473–479).
Jing, L., Shen, Y., Dubček, T., Peurifoy, J., Skirlo, S., Tegmark, M., & Soljačić, M. (2016) Tunable Efficient Unitary Neural Networks (EUNN) and their application to RNN. ArXiv:1612.05231 [Cs, Stat].
Jozefowicz, R., Zaremba, W., & Sutskever, I. (2015) An empirical exploration of recurrent network architectures. In Proceedings of the 32nd International Conference on Machine Learning (ICML-15) (pp. 2342–2350).
Kalchbrenner, N., Danihelka, I., & Graves, A. (2015) Grid Long Short-Term Memory. ArXiv:1507.01526 [Cs].
Karpathy, A., Johnson, J., & Fei-Fei, L. (2015) Visualizing and Understanding Recurrent Networks. ArXiv:1506.02078 [Cs].
Kingma, D. P., Salimans, T., Jozefowicz, R., Chen, X., Chen, X., Sutskever, I., & Welling, M. (2016) Improving Variational Autoencoders with Inverse Autoregressive Flow. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, & R. Garnett (Eds.), Advances in Neural Information Processing Systems 29 (pp. 4736–4744). Curran Associates, Inc.
LeCun, Y. (1998) Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11), 2278–2324. DOI.
Legenstein, R., Naeger, C., & Maass, W. (2005) What Can a Neuron Learn with Spike-Timing-Dependent Plasticity?. Neural Computation, 17(11), 2337–2382. DOI.
Lipton, Z. C., Berkowitz, J., & Elkan, C. (2015) A Critical Review of Recurrent Neural Networks for Sequence Learning. ArXiv:1506.00019 [Cs].
Lukoševičius, M., & Jaeger, H. (2009) Reservoir computing approaches to recurrent neural network training. Computer Science Review, 3(3), 127–149. DOI.
Maass, W., Natschläger, T., & Markram, H. (2004) Computational Models for Generic Cortical Microcircuits. In Computational Neuroscience: A Comprehensive Approach (pp. 575–605). Chapman & Hall/CRC
Mhammedi, Z., Hellicar, A., Rahman, A., & Bailey, J. (2016) Efficient Orthogonal Parametrisation of Recurrent Neural Networks Using Householder Reflections. ArXiv:1612.00188 [Cs].
Mikolov, T., Karafiát, M., Burget, L., Černockỳ, J., & Khudanpur, S. (2010) Recurrent Neural Network Based Language Model. In Eleventh Annual Conference of the International Speech Communication Association.
Mnih, V. (2015) Human-level control through deep reinforcement learning. Nature, 518, 529–533. DOI.
Mohamed, A. r, Dahl, G. E., & Hinton, G. (2012) Acoustic Modeling Using Deep Belief Networks. IEEE Transactions on Audio, Speech, and Language Processing, 20(1), 14–22. DOI.
Monner, D., & Reggia, J. A.(2012) A generalized LSTM-like training algorithm for second-order recurrent neural networks. Neural Networks, 25, 70–83. DOI.
Neil, D., Pfeiffer, M., & Liu, S.-C. (2016) Phased LSTM: Accelerating Recurrent Network Training for Long or Event-based Sequences. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, & R. Garnett (Eds.), Advances in Neural Information Processing Systems 29 (pp. 3882–3890). Curran Associates, Inc.
Neyshabur, B., Wu, Y., Salakhutdinov, R. R., & Srebro, N. (2016) Path-Normalized Optimization of Recurrent Neural Networks with ReLU Activations. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, & R. Garnett (Eds.), Advances in Neural Information Processing Systems 29 (pp. 3477–3485). Curran Associates, Inc.
Nussbaum-Thom, M., Cui, J., Ramabhadran, B., & Goel, V. (2016) Acoustic Modeling Using Bidirectional Gated Recurrent Convolutional Units. (pp. 390–394). DOI.
Owens, A., Isola, P., McDermott, J., Torralba, A., Adelson, E. H., & Freeman, W. T.(2015) Visually Indicated Sounds. ArXiv:1512.08512 [Cs].
Pascanu, R., Mikolov, T., & Bengio, Y. (2013) On the difficulty of training Recurrent Neural Networks. In arXiv:1211.5063 [cs] (pp. 1310–1318).
Patraucean, V., Handa, A., & Cipolla, R. (2015) Spatio-temporal video autoencoder with differentiable memory. ArXiv:1511.06309 [Cs].
Pillonetto, G. (2016) The interplay between system identification and machine learning. ArXiv:1612.09158 [Cs, Stat].
Ravanbakhsh, S., Schneider, J., & Poczos, B. (2016) Deep Learning with Sets and Point Clouds. ArXiv:1611.04500 [Cs, Stat].
Rohrbach, A., Rohrbach, M., & Schiele, B. (2015) The Long-Short Story of Movie Description. ArXiv:1506.01698 [Cs].
Steil, J. J.(2004) Backpropagation-decorrelation: online recurrent learning with O(N) complexity. In 2004 IEEE International Joint Conference on Neural Networks, 2004. Proceedings (Vol. 2, pp. 843–848 vol.2). DOI.
Surace, S. C., & Pfister, J.-P. (2016) Online Maximum Likelihood Estimation of the Parameters of Partially Observed Diffusion Processes.
Taylor, G. W., Hinton, G. E., & Roweis, S. T.(2006) Modeling human motion using binary latent variables. In Advances in neural information processing systems (pp. 1345–1352).
Theis, L., & Bethge, M. (2015) Generative Image Modeling Using Spatial LSTMs. ArXiv:1506.03478 [Cs, Stat].
Visin, F., Kastner, K., Cho, K., Matteucci, M., Courville, A., & Bengio, Y. (2015) ReNet: A Recurrent Neural Network Based Alternative to Convolutional Networks. ArXiv:1505.00393 [Cs].
Waibel, A. (1989) Phoneme recognition using time-delay neural networks. IEEE Trans. Acoustics Speech Signal Process., 37(3), 328–339. DOI.
Wisdom, S., Powers, T., Hershey, J., Le Roux, J., & Atlas, L. (2016) Full-capacity unitary recurrent neural networks. In Advances in Neural Information Processing Systems (pp. 4880–4888).
Wisdom, S., Powers, T., Pitton, J., & Atlas, L. (2016) Interpretable Recurrent Neural Networks Using Sequential Sparse Recovery. In Advances in Neural Information Processing Systems 29.
Wu, Y., Zhang, S., Zhang, Y., Bengio, Y., & Salakhutdinov, R. R.(2016) On Multiplicative Integration with Recurrent Neural Networks. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, & R. Garnett (Eds.), Advances in Neural Information Processing Systems 29 (pp. 2856–2864). Curran Associates, Inc.
Yao, L., Torabi, A., Cho, K., Ballas, N., Pal, C., Larochelle, H., & Courville, A. (2015) Describing Videos by Exploiting Temporal Structure. ArXiv:1502.08029 [Cs, Stat].