Feedback networks structured to have memory and a notion of “current” and “past” states, which can encode time (or whatever). Many wheels are re-invented with these, but the essential model is that we have a heavily nonlinear state filter inferred by gradient descent.

The connection with these and convolutional neural networks is suggestive for the same reason.

Many different flavours and topologies. GridRNN – KaDG15 – seems natural for note-based stuff, and for systems, like the cochlea, and indeed the entire human visual apparatus, with dependencies in both space and time.

## Intro

As someone who does a lot of signal processing for music, the notion that these generalise linear systems theory is suggestive of interesting DSP applications, e.g. generative music.

Awesome RNN is a curated links list of implementations.

Andrej Karpathy: The unreasonable effectiveness of RNN

Christopher Olah: Understanding LTSM RNNs

Jeff Donahue Long term recurrent NN

## To Learn

Inverse Autoregressive Flow (KSJC16).

## Flavours

### Linear

As seen in normal signal processing/ The main problem here is that they are unstable in the training phase in many of the wild and weird NN SGD phases, unless you are clever. See BeSF94. The next three types are proposed solutions for that.

### Long Short Term Memory (LSTM)

On the border with deep automata.

As always, Christopher Olah wins the visual explanation prize: Understanding LSTM Networks. Also neat: LSTM Networks for Sentiment Analysis: Alex Graves Generating Sequences With Recurrent Neural Networks, generates handwriting.

In a traditional recurrent neural network, during the gradient back-propagation phase, the gradient signal can end up being multiplied a large number of times (as many as the number of timesteps) by the weight matrix associated with the connections between the neurons of the recurrent hidden layer. This means that, the magnitude of weights in the transition matrix can have a strong impact on the learning process.[…]

These issues are the main motivation behind the LSTM model which introduces a new structure called a memory cell.[…] A memory cell is composed of four main elements: an input gate, a neuron with a self-recurrent connection (a connection to itself), a forget gate and an output gate. […]The gates serve to modulate the interactions between the memory cell itself and its environment.

### Gate Recurrent Unit (GRU)

Simpler than the LSTM. But how exactly does it work? Read CGCB15 and CMGB14.

### Unitary

Charming connection with my other research into acoustics, what I would call “Gerzon allpass” filters are now hip for use neural networks because of favourable normalisation characteristics.

Here is an implementation that uses unnecessary complex numbers.

### Probabilistic

i.e. Kalman fiters, but rebranded in the fine neural networks tradition of taking something uncontroversial from another field and putting the word “neural” in front. Practically these are usually variational, but there are some random sampling based ones.

See Sequential Neural Models with Stochastic Layers. (FSPW16)

### GridRNN

A mini-genre. KaDG15 et al connect recurrent cells across multiple axes, leading to a higher-rank MIMO system; This is natural in many kinds of spatial random fields, and I am amazed it was uncommon enough to need formalizing in a paper; but it was and it did and good on Kalchbrenner et al.

### Phased

Long story, bro.

keras implementation by Francesco Ferroni. Tensorflow implementation by Enea Ceolini.

Lasagne implementation by Danny Neil.

### Other

It’s still the wild west. Invent a category, name it and stake a claim. There’s publications in them thar hills.

## Practicalities

### General

TBPTT, state filters, filter stability.

### Tensorflow

See tensorflow.

### Pytorch

## Refs

- BeAt16: Souhaib Ben Taieb, Amir F. Atiya (2016) A Bias and Variance Analysis for Multistep-Ahead Time Series Forecasting.
*IEEE Transactions on Neural Networks and Learning Systems*, 27(1), 62–76. DOI - KGGS14: Jan Koutník, Klaus Greff, Faustino Gomez, Jürgen Schmidhuber (2014) A Clockwork RNN.
*ArXiv:1402.3511 [Cs]*. - LiBE15: Zachary C. Lipton, John Berkowitz, Charles Elkan (2015) A Critical Review of Recurrent Neural Networks for Sequence Learning.
*ArXiv:1506.00019 [Cs]*. - MoRe12: Derek Monner, James A. Reggia (2012) A generalized LSTM-like training algorithm for second-order recurrent neural networks.
*Neural Networks*, 25, 70–83. DOI - RERH18: Adam Roberts, Jesse Engel, Colin Raffel, Curtis Hawthorne, Douglas Eck (2018) A Hierarchical Latent Vector Model for Learning Long-Term Structure in Music.
*ArXiv:1803.05428 [Cs, Eess, Stat]*. - WiZi89: Ronald J. Williams, David Zipser (1989) A Learning Algorithm for Continually Running Fully Recurrent Neural Networks.
*Neural Computation*, 1(2), 270–280. DOI - AuBM08: Peter Auer, Harald Burgsteiner, Wolfgang Maass (2008) A learning rule for very simple universal approximators consisting of a single layer of perceptrons.
*Neural Networks*, 21(5), 786–795. DOI - WeTN17: Ruofeng Wen, Kari Torkkola, Balakrishnan Narayanaswamy (2017) A Multi-Horizon Quantile Recurrent Forecaster.
*ArXiv:1711.11053 [Stat]*. - HeWH16: Kun He, Yan Wang, John Hopcroft (2016) A Powerful Generative Model Using Random Weights for the Deep Image Representation. In Advances in Neural Information Processing Systems.
- CKDG15: Junyoung Chung, Kyle Kastner, Laurent Dinh, Kratarth Goel, Aaron C Courville, Yoshua Bengio (2015) A Recurrent Latent Variable Model for Sequential Data. In Advances in Neural Information Processing Systems 28 (pp. 2980–2988). Curran Associates, Inc.
- LaBr16: Thomas Laurent, James von Brecht (2016) A recurrent neural network without chaos.
*ArXiv:1612.06212 [Cs]*. - GaGh16: Yarin Gal, Zoubin Ghahramani (2016) A Theoretically Grounded Application of Dropout in Recurrent Neural Networks. In arXiv:1512.05287 [stat].
- NCRG16: Markus Nussbaum-Thom, Jia Cui, Bhuvana Ramabhadran, Vaibhava Goel (2016) Acoustic Modeling Using Bidirectional Gated Recurrent Convolutional Units. (pp. 390–394). DOI
- MoDH12: A. r Mohamed, G. E. Dahl, G. Hinton (2012) Acoustic Modeling Using Deep Belief Networks.
*IEEE Transactions on Audio, Speech, and Language Processing*, 20(1), 14–22. DOI - WiPe90: Ronald J. Williams, Jing Peng (1990) An Efficient Gradient-Based Algorithm for On-Line Training of Recurrent Network Trajectories.
*Neural Computation*, 2(4), 490–501. DOI - JoZS15: Rafal Jozefowicz, Wojciech Zaremba, Ilya Sutskever (2015) An empirical exploration of recurrent network architectures. In Proceedings of the 32nd International Conference on Machine Learning (ICML-15) (pp. 2342–2350).
- Werb90: P. J. Werbos (1990) Backpropagation through time: what it does and how to do it.
*Proceedings of the IEEE*, 78(10), 1550–1560. DOI - Stei04: J. J. Steil (2004) Backpropagation-decorrelation: online recurrent learning with O(N) complexity. In 2004 IEEE International Joint Conference on Neural Networks, 2004. Proceedings (Vol. 2, pp. 843–848 vol.2). DOI
- FoBV17: Meire Fortunato, Charles Blundell, Oriol Vinyals (2017) Bayesian Recurrent Neural Networks.
*ArXiv:1704.02798 [Cs, Stat]*. - RGMP18: Thomas Ryder, Andrew Golightly, A. Stephen McGough, Dennis Prangle (2018) Black-box Variational Inference for Stochastic Differential Equations.
*ArXiv:1802.03335 [Stat]*. - CoSS16: Jasmine Collins, Jascha Sohl-Dickstein, David Sussillo (2016) Capacity and Trainability in Recurrent Neural Networks. In arXiv:1611.09913 [cs, stat].
- MaNM04: W. Maass, T. Natschläger, H. Markram (2004) Computational Models for Generic Cortical Microcircuits. In Computational Neuroscience: A Comprehensive Approach (pp. 575–605). Chapman & Hall/CRC
- BoLe06: Oliver Bown, Sebastian Lexer (2006) Continuous-Time Recurrent Neural Networks for Generative and Interactive Musical Performance. In Applications of Evolutionary Computing (pp. 652–663). Springer Berlin Heidelberg
- DoPo15: Keith B. Doelling, David Poeppel (2015) Cortical entrainment to music and its modulation by expertise.
*Proceedings of the National Academy of Sciences*, 112(45), E6233–E6242. DOI - KrSS15: Rahul G. Krishnan, Uri Shalit, David Sontag (2015) Deep kalman filters.
*ArXiv Preprint ArXiv:1511.05121*. - Mart10: James Martens (2010) Deep Learning via Hessian-free Optimization. In Proceedings of the 27th International Conference on International Conference on Machine Learning (pp. 735–742). USA: Omnipress
- RaSP16: Siamak Ravanbakhsh, Jeff Schneider, Barnabas Poczos (2016) Deep Learning with Sets and Point Clouds. In arXiv:1611.04500 [cs, stat].
- ASKR12: Ebru Arisoy, Tara N. Sainath, Brian Kingsbury, Bhuvana Ramabhadran (2012) Deep Neural Network Language Models. In Proceedings of the NAACL-HLT 2012 Workshop: Will We Ever Really Replace the N-gram Model? On the Future of Language Modeling for HLT (pp. 20–28). Stroudsburg, PA, USA: Association for Computational Linguistics
- HDYD12: G. Hinton, Li Deng, Dong Yu, G.E. Dahl, A. Mohamed, N. Jaitly, … B. Kingsbury (2012) Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups.
*IEEE Signal Processing Magazine*, 29(6), 82–97. DOI - YTCB15: Li Yao, Atousa Torabi, Kyunghyun Cho, Nicolas Ballas, Christopher Pal, Hugo Larochelle, Aaron Courville (2015) Describing Videos by Exploiting Temporal Structure.
*ArXiv:1502.08029 [Cs, Stat]*. - Chev07: Guillaume Chevillon (2007) Direct Multi-Step Estimation and Forecasting.
*Journal of Economic Surveys*, 21(4), 746–785. DOI - ChYR16: Adam Charles, Dong Yin, Christopher Rozell (2016) Distributed Sequence Memory of Multidimensional Inputs in Recurrent Networks.
*ArXiv:1605.08346 [Cs, Math, Stat]*. - GDGR15: Karol Gregor, Ivo Danihelka, Alex Graves, Danilo Jimenez Rezende, Daan Wierstra (2015) DRAW: A Recurrent Neural Network For Image Generation.
*ArXiv:1502.04623 [Cs]*. - MHRB17: Zakaria Mhammedi, Andrew Hellicar, Ashfaqur Rahman, James Bailey (2017) Efficient Orthogonal Parametrisation of Recurrent Neural Networks Using Householder Reflections. In PMLR (pp. 2401–2409).
- CGCB14: Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, Yoshua Bengio (2014) Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling. In NIPS.
- MLTH17: Chris J. Maddison, Dieterich Lawson, George Tucker, Nicolas Heess, Mohammad Norouzi, Andriy Mnih, … Yee Whye Teh (2017) Filtering Variational Objectives.
*ArXiv Preprint ArXiv:1705.09279*. - Elma90: Jeffrey L Elman (1990) Finding structure in time.
*Cognitive Science*, 14, 179–211. DOI - WPHL16: Scott Wisdom, Thomas Powers, John Hershey, Jonathan Le Roux, Les Atlas (2016) Full-capacity unitary recurrent neural networks. In Advances in Neural Information Processing Systems (pp. 4880–4888).
- CGCB15: Junyoung Chung, Caglar Gulcehre, Kyunghyun Cho, Yoshua Bengio (2015) Gated Feedback Recurrent Neural Networks. In Proceedings of the 32Nd International Conference on International Conference on Machine Learning - Volume 37 (pp. 2067–2075). JMLR.org
- Werb88: Paul J. Werbos (1988) Generalization of backpropagation with application to a recurrent gas market model.
*Neural Networks*, 1(4), 339–356. DOI - ThBe15: Lucas Theis, Matthias Bethge (2015) Generative Image Modeling Using Spatial LSTMs.
*ArXiv:1506.03478 [Cs, Stat]*. - HBFS01: Sepp Hochreiter, Yoshua Bengio, Paolo Frasconi, Jürgen Schmidhuber (2001) Gradient Flow in Recurrent Nets: the Difficulty of Learning Long-Term Dependencies. In A field guide to dynamical recurrent neural networks. IEEE Press
- Lecu98: Y. LeCun (1998) Gradient-based learning applied to document recognition.
*Proceedings of the IEEE*, 86(11), 2278–2324. DOI - ChAB16: Junyoung Chung, Sungjin Ahn, Yoshua Bengio (2016) Hierarchical Multiscale Recurrent Neural Networks.
*ArXiv:1609.01704 [Cs]*. - Husz15: Ferenc Huszár (2015) How (not) to Train your Generative Model: Scheduled Sampling, Likelihood, Adversary?
*ArXiv:1511.05101 [Cs, Math, Stat]*. - Mnih15: V. Mnih (2015) Human-level control through deep reinforcement learning.
*Nature*, 518, 529–533. DOI - KSJC16: Diederik P. Kingma, Tim Salimans, Rafal Jozefowicz, Xi Chen, Ilya Sutskever, Max Welling (2016) Improving Variational Inference with Inverse Autoregressive Flow. In Advances in Neural Information Processing Systems 29. Curran Associates, Inc.
- WPPA16: Scott Wisdom, Thomas Powers, James Pitton, Les Atlas (2016) Interpretable Recurrent Neural Networks Using Sequential Sparse Recovery. In Advances in Neural Information Processing Systems 29.
- HaSZ17: Elad Hazan, Karan Singh, Cyril Zhang (2017) Learning Linear Dynamical Systems via Spectral Filtering. In NIPS.
- BeSF94: Y. Bengio, P. Simard, P. Frasconi (1994) Learning long-term dependencies with gradient descent is difficult.
*IEEE Transactions on Neural Networks*, 5(2), 157–166. DOI - CMGB14: Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, Yoshua Bengio (2014) Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation. In EMNLP 2014.
- GeSS02: Felix A. Gers, Nicol N. Schraudolph, Jürgen Schmidhuber (2002) Learning precise timing with LSTM recurrent networks.
*Journal of Machine Learning Research*, 3(Aug), 115–143. - MaSu11: James Martens, Ilya Sutskever (2011) Learning Recurrent Neural Networks with Hessian-free Optimization. In Proceedings of the 28th International Conference on International Conference on Machine Learning (pp. 1033–1040). USA: Omnipress
- RuHW86: David E. Rumelhart, Geoffrey E. Hinton, Ronald J. Williams (1986) Learning representations by back-propagating errors.
*Nature*, 323(6088), 533–536. DOI - GeSC00: Felix A. Gers, Jürgen Schmidhuber, Fred Cummins (2000) Learning to Forget: Continual Prediction with LSTM.
*Neural Computation*, 12(10), 2451–2471. DOI - HoSc97a: Sepp Hochreiter, Jürgen Schmidhuber (1997a) Long Short-Term Memory.
*Neural Computation*, 9(8), 1735–1780. DOI - HoSc97b: Sepp Hochreiter, Jiirgen Schmidhuber (1997b) LTSM can solve hard time lag problems. In Advances in Neural Information Processing Systems: Proceedings of the 1996 Conference (pp. 473–479).
- GMDL16: Audrunas Gruslys, Remi Munos, Ivo Danihelka, Marc Lanctot, Alex Graves (2016) Memory-Efficient Backpropagation Through Time. In Advances in Neural Information Processing Systems 29 (pp. 4125–4133). Curran Associates, Inc.
- TaHR06: Graham W. Taylor, Geoffrey E. Hinton, Sam T. Roweis (2006) Modeling human motion using binary latent variables. In Advances in neural information processing systems (pp. 1345–1352).
- BoBV12: Nicolas Boulanger-Lewandowski, Yoshua Bengio, Pascal Vincent (2012) Modeling Temporal Dependencies in High-Dimensional Sequences: Application to Polyphonic Music Generation and Transcription. In 29th International Conference on Machine Learning.
- SZLB95: Jonas Sjöberg, Qinghua Zhang, Lennart Ljung, Albert Benveniste, Bernard Delyon, Pierre-Yves Glorennec, … Anatoli Juditsky (1995) Nonlinear black-box modeling in system identification: a unified overview.
*Automatica*, 31(12), 1691–1724. DOI - WZZB16: Yuhuai Wu, Saizheng Zhang, Ying Zhang, Yoshua Bengio, Ruslan R Salakhutdinov (2016) On Multiplicative Integration with Recurrent Neural Networks. In Advances in Neural Information Processing Systems 29 (pp. 2856–2864). Curran Associates, Inc.
- PaMB13: Razvan Pascanu, Tomas Mikolov, Yoshua Bengio (2013) On the difficulty of training Recurrent Neural Networks. In arXiv:1211.5063 [cs] (pp. 1310–1318).
- CMBB14: Kyunghyun Cho, Bart van Merriënboer, Dzmitry Bahdanau, Yoshua Bengio (2014) On the properties of neural machine translation: Encoder-decoder approaches.
*ArXiv Preprint ArXiv:1409.1259*. - SuPf16: Simone Carlo Surace, Jean-Pascal Pfister (2016) Online Maximum Likelihood Estimation of the Parameters of Partially Observed Diffusion Processes.
- NWSS16: Behnam Neyshabur, Yuhuai Wu, Ruslan R Salakhutdinov, Nati Srebro (2016) Path-Normalized Optimization of Recurrent Neural Networks with ReLU Activations. In Advances in Neural Information Processing Systems 29 (pp. 3477–3485). Curran Associates, Inc.
- NePL16: Daniel Neil, Michael Pfeiffer, Shih-Chii Liu (2016) Phased LSTM: Accelerating Recurrent Network Training for Long or Event-based Sequences. In Advances in Neural Information Processing Systems 29 (pp. 3882–3890). Curran Associates, Inc.
- Grav11: Alex Graves (2011) Practical Variational Inference for Neural Networks. In Proceedings of the 24th International Conference on Neural Information Processing Systems (pp. 2348–2356). USA: Curran Associates Inc.
- LGZZ16: Alex Lamb, Anirudh Goyal, Ying Zhang, Saizheng Zhang, Aaron Courville, Yoshua Bengio (2016) Professor Forcing: A New Algorithm for Training Recurrent Networks. In Advances In Neural Information Processing Systems.
- CBLG16: Tim Cooijmans, Nicolas Ballas, César Laurent, Çağlar Gülçehre, Aaron Courville (2016) Recurrent batch normalization.
*ArXiv Preprint ArXiv:1603.09025*. - MKBČ10: Tomáš Mikolov, Martin Karafiát, Lukáš Burget, Jan Černockỳ, Sanjeev Khudanpur (2010) Recurrent Neural Network Based Language Model. In Eleventh Annual Conference of the International Speech Communication Association.
- DaYO16: Sakyasingha Dasgupta, Takayuki Yoshizumi, Takayuki Osogami (2016) Regularized Dynamic Boltzmann Machine with Delay Pruning for Unsupervised Learning of Temporal Sequences.
*ArXiv:1610.01989 [Cs, Stat]*. - VKCM15: Francesco Visin, Kyle Kastner, Kyunghyun Cho, Matteo Matteucci, Aaron Courville, Yoshua Bengio (2015) ReNet: A Recurrent Neural Network Based Alternative to Convolutional Networks.
*ArXiv:1505.00393 [Cs]*. - LuJa09: Mantas Lukoševičius, Herbert Jaeger (2009) Reservoir computing approaches to recurrent neural network training.
*Computer Science Review*, 3(3), 127–149. DOI - BVJS15: Samy Bengio, Oriol Vinyals, Navdeep Jaitly, Noam Shazeer (2015) Scheduled sampling for sequence prediction with recurrent neural networks. In Advances in Neural Information Processing Systems 28 (pp. 1171–1179). Cambridge, MA, USA: Curran Associates, Inc.
- FSPW16: Marco Fraccaro, Sø ren Kaae Sø nderby, Ulrich Paquet, Ole Winther (2016) Sequential Neural Models with Stochastic Layers. In Advances in Neural Information Processing Systems 29 (pp. 2199–2207). Curran Associates, Inc.
- PaHC15: Viorica Patraucean, Ankur Handa, Roberto Cipolla (2015) Spatio-temporal video autoencoder with differentiable memory.
*ArXiv:1511.06309 [Cs]*. - Grav12: Alex Graves (2012)
*Supervised sequence labelling with recurrent neural networks*. Heidelberg ; New York: Springer - AnBe17: Alexander G. Anderson, Cory P. Berg (2017) The High-Dimensional Geometry of Binary Neural Networks.
*ArXiv:1705.07199 [Cs]*. - Pill16: Gianluigi Pillonetto (2016) The interplay between system identification and machine learning.
*ArXiv:1612.09158 [Cs, Stat]*. - RoRS15: Anna Rohrbach, Marcus Rohrbach, Bernt Schiele (2015) The Long-Short Story of Movie Description.
*ArXiv:1506.01698 [Cs]*. - BFLL17: David Balduzzi, Marcus Frean, Lennox Leary, J. P. Lewis, Kurt Wan-Duo Ma, Brian McWilliams (2017) The Shattered Gradients Problem: If resnets are the answer, then what is the question? In PMLR (pp. 342–350).
- OlPS17: Junier B. Oliva, Barnabas Poczos, Jeff Schneider (2017) The Statistical Recurrent Unit.
*ArXiv:1703.00381 [Cs, Stat]*. - Hoch98: Sepp Hochreiter (1998) The vanishing gradient problem during learning recurrent neural nets and problem solutions.
*International Journal of Uncertainty Fuzziness and Knowledge Based Systems*, 6, 107–115. - HaMa12: Hananel Hazan, Larry M. Manevitz (2012) Topological constraints and robustness in liquid state machines.
*Expert Systems with Applications*, 39(2), 1597–1606. DOI - MaSu12: James Martens, Ilya Sutskever (2012) Training deep and recurrent networks with hessian-free optimization. In Neural networks: Tricks of the trade (pp. 479–535). Springer
- JSDP17: Li Jing, Yichen Shen, Tena Dubcek, John Peurifoy, Scott Skirlo, Yann LeCun, … Marin Soljačić (2017) Tunable Efficient Unitary Neural Networks (EUNN) and their application to RNNs. In PMLR (pp. 1733–1741).
- Jaeg02: Herbert Jaeger (2002)
*Tutorial on training recurrent neural networks, covering BPPT, RTRL, EKF and the" echo state network" approach*(Vol. 5). GMD-Forschungszentrum Informationstechnik - TaOl17: Corentin Tallec, Yann Ollivier (2017) Unbiasing Truncated Backpropagation Through Time.
*ArXiv:1705.08209 [Cs]*. - ArSB16: Martin Arjovsky, Amar Shah, Yoshua Bengio (2016) Unitary Evolution Recurrent Neural Networks. In Proceedings of the 33rd International Conference on International Conference on Machine Learning - Volume 48 (pp. 1120–1128). New York, NY, USA: JMLR.org
- KaJF15: Andrej Karpathy, Justin Johnson, Li Fei-Fei (2015) Visualizing and Understanding Recurrent Networks.
*ArXiv:1506.02078 [Cs]*. - LeNM05: Robert Legenstein, Christian Naeger, Wolfgang Maass (2005) What Can a Neuron Learn with Spike-Timing-Dependent Plasticity?
*Neural Computation*, 17(11), 2337–2382. DOI - BuMe05: Catalin V. Buhusi, Warren H. Meck (2005) What makes us tick? Functional and neural mechanisms of interval timing.
*Nature Reviews Neuroscience*, 6(10), 755–765. DOI - MiHa18: John Miller, Moritz Hardt (2018) When Recurrent Models Don’t Need To Be Recurrent.
*ArXiv:1805.10369 [Cs, Stat]*. - GCWK09: B. J. Grzyb, E. Chinellato, G. M. Wojcik, W. A. Kaminski (2009) Which model to use for the Liquid State Machine? In 2009 International Joint Conference on Neural Networks (pp. 1018–1024). DOI