Feedback networks structured to have memory and a notion of “current” and “past” states, which can encode time (or whatever). Many wheels are reinvented with these, but the essential model is that we have a heavily nonlinear state filter inferred by gradient descent.
The connection with these and convolutional neural networks is suggestive for the same reason.
Many different flavours and topologies. GridRNN  KaDG15  seems natural for notebased stuff, and for systems, like the cochlea, and indeed the entire human visual apparatus, with dependencies in both space and time.
Intro
As someone who does a lot of signal processing for music, the notion that these generalise linear systems theory is suggestive of interesting DSP applications, e.g. generative music.

Awesome RNN is a curated links list of implementations.

Andrej Karpathy: The unreasonable effectiveness of RNN

Christopher Olah: Understanding LTSM RNNs

Jeff Donahue Long term recurrent NN
To Learn
Inverse Autoregressive Flow (KSJC16).
Flavours
Linear
As seen in normal signal processing/ The main problem here is that they are unstable in the training phase in many of the wild and weird NN SGD phases, unless you are clever. See BeSF94. The next three types are proposed solutions for that.
Long Short Term Memory (LSTM)
On the border with deep automata.
As always, Christopher Olah wins the visual explanation prize: Understanding LSTM Networks. Also neat: LSTM Networks for Sentiment Analysis: Alex Graves Generating Sequences With Recurrent Neural Networks, generates handwriting.
In a traditional recurrent neural network, during the gradient backpropagation phase, the gradient signal can end up being multiplied a large number of times (as many as the number of timesteps) by the weight matrix associated with the connections between the neurons of the recurrent hidden layer. This means that, the magnitude of weights in the transition matrix can have a strong impact on the learning process.[…]
These issues are the main motivation behind the LSTM model which introduces a new structure called a memory cell.[…] A memory cell is composed of four main elements: an input gate, a neuron with a selfrecurrent connection (a connection to itself), a forget gate and an output gate. […]The gates serve to modulate the interactions between the memory cell itself and its environment.
Gate Recurrent Unit (GRU)
Simpler than the LSTM. But how exactly does it work? Read CGCB15 and CMGB14.
Unitary
Charming connection with my other research into acoustics, what I would call “Gerzon allpass” filters are now hip for use neural networks because of favourable normalisation characteristics.
Here is an implementation that uses unnecessary complex numbers.
Probabilistic
i.e. Kalman fiters, but rebranded in the fine neural networks tradition of taking something uncontroversial from another field and putting the word “neural” in front. Practically these are usually variational, but there are some random sampling based ones.
See Sequential Neural Models with Stochastic Layers. (FSPW16)
GridRNN
A minigenre. KaDG15 et al connect recurrent cells across multiple axes, leading to a higherrank MIMO system; This is natural in many kinds of spatial random fields, and I am amazed it was uncommon enough to need formalizing in a paper; but it was and it did and good on Kalchbrenner et al.
Phased
Long story, bro.
keras implementation by Francesco Ferroni. Tensorflow implementation by Enea Ceolini.
Lasagne implementation by Danny Neil.
Other
It's still the wild west. Invent a category, name it and stake a claim. There's publications in them thar hills.
Practicalities
General
TBPTT, state filters, filter stability.
Tensorflow
See tensorflow.
Pytorch
Refs
 BeAt16: Souhaib Ben Taieb, Amir F. Atiya (2016) A Bias and Variance Analysis for MultistepAhead Time Series Forecasting. IEEE Transactions on Neural Networks and Learning Systems, 27(1), 62–76. DOI
 KGGS14: Jan Koutník, Klaus Greff, Faustino Gomez, Jürgen Schmidhuber (2014) A Clockwork RNN. ArXiv:1402.3511 [Cs].
 LiBE15: Zachary C. Lipton, John Berkowitz, Charles Elkan (2015) A Critical Review of Recurrent Neural Networks for Sequence Learning. ArXiv:1506.00019 [Cs].
 MoRe12: Derek Monner, James A. Reggia (2012) A generalized LSTMlike training algorithm for secondorder recurrent neural networks. Neural Networks, 25, 70–83. DOI
 RERH18: Adam Roberts, Jesse Engel, Colin Raffel, Curtis Hawthorne, Douglas Eck (2018) A Hierarchical Latent Vector Model for Learning LongTerm Structure in Music. ArXiv:1803.05428 [Cs, Eess, Stat].
 WiZi89: Ronald J. Williams, David Zipser (1989) A Learning Algorithm for Continually Running Fully Recurrent Neural Networks. Neural Computation, 1(2), 270–280. DOI
 AuBM08: Peter Auer, Harald Burgsteiner, Wolfgang Maass (2008) A learning rule for very simple universal approximators consisting of a single layer of perceptrons. Neural Networks, 21(5), 786–795. DOI
 WeTN17: Ruofeng Wen, Kari Torkkola, Balakrishnan Narayanaswamy (2017) A MultiHorizon Quantile Recurrent Forecaster. ArXiv:1711.11053 [Stat].
 HeWH16: Kun He, Yan Wang, John Hopcroft (2016) A Powerful Generative Model Using Random Weights for the Deep Image Representation. In Advances in Neural Information Processing Systems.
 CKDG15: Junyoung Chung, Kyle Kastner, Laurent Dinh, Kratarth Goel, Aaron C Courville, Yoshua Bengio (2015) A Recurrent Latent Variable Model for Sequential Data. In Advances in Neural Information Processing Systems 28 (pp. 2980–2988). Curran Associates, Inc.
 LaBr16: Thomas Laurent, James von Brecht (2016) A recurrent neural network without chaos. ArXiv:1612.06212 [Cs].
 GaGh16: Yarin Gal, Zoubin Ghahramani (2016) A Theoretically Grounded Application of Dropout in Recurrent Neural Networks. In arXiv:1512.05287 [stat].
 NCRG16: Markus NussbaumThom, Jia Cui, Bhuvana Ramabhadran, Vaibhava Goel (2016) Acoustic Modeling Using Bidirectional Gated Recurrent Convolutional Units. (pp. 390–394). DOI
 MoDH12: A. r Mohamed, G. E. Dahl, G. Hinton (2012) Acoustic Modeling Using Deep Belief Networks. IEEE Transactions on Audio, Speech, and Language Processing, 20(1), 14–22. DOI
 WiPe90: Ronald J. Williams, Jing Peng (1990) An Efficient GradientBased Algorithm for OnLine Training of Recurrent Network Trajectories. Neural Computation, 2(4), 490–501. DOI
 JoZS15: Rafal Jozefowicz, Wojciech Zaremba, Ilya Sutskever (2015) An empirical exploration of recurrent network architectures. In Proceedings of the 32nd International Conference on Machine Learning (ICML15) (pp. 2342–2350).
 Werb90: P. J. Werbos (1990) Backpropagation through time: what it does and how to do it. Proceedings of the IEEE, 78(10), 1550–1560. DOI
 Stei04: J. J. Steil (2004) Backpropagationdecorrelation: online recurrent learning with O(N) complexity. In 2004 IEEE International Joint Conference on Neural Networks, 2004. Proceedings (Vol. 2, pp. 843–848 vol.2). DOI
 FoBV17: Meire Fortunato, Charles Blundell, Oriol Vinyals (2017) Bayesian Recurrent Neural Networks. ArXiv:1704.02798 [Cs, Stat].
 RGMP18: Thomas Ryder, Andrew Golightly, A. Stephen McGough, Dennis Prangle (2018) Blackbox Variational Inference for Stochastic Differential Equations. ArXiv:1802.03335 [Stat].
 CoSS16: Jasmine Collins, Jascha SohlDickstein, David Sussillo (2016) Capacity and Trainability in Recurrent Neural Networks. In arXiv:1611.09913 [cs, stat].
 MaNM04: W. Maass, T. Natschläger, H. Markram (2004) Computational Models for Generic Cortical Microcircuits. In Computational Neuroscience: A Comprehensive Approach (pp. 575–605). Chapman & Hall/CRC
 BoLe06: Oliver Bown, Sebastian Lexer (2006) ContinuousTime Recurrent Neural Networks for Generative and Interactive Musical Performance. In Applications of Evolutionary Computing (pp. 652–663). Springer Berlin Heidelberg
 DoPo15: Keith B. Doelling, David Poeppel (2015) Cortical entrainment to music and its modulation by expertise. Proceedings of the National Academy of Sciences, 112(45), E6233–E6242. DOI
 KrSS15: Rahul G. Krishnan, Uri Shalit, David Sontag (2015) Deep kalman filters. ArXiv Preprint ArXiv:1511.05121.
 Mart10: James Martens (2010) Deep Learning via Hessianfree Optimization. In Proceedings of the 27th International Conference on International Conference on Machine Learning (pp. 735–742). USA: Omnipress
 RaSP16: Siamak Ravanbakhsh, Jeff Schneider, Barnabas Poczos (2016) Deep Learning with Sets and Point Clouds. In arXiv:1611.04500 [cs, stat].
 ASKR12: Ebru Arisoy, Tara N. Sainath, Brian Kingsbury, Bhuvana Ramabhadran (2012) Deep Neural Network Language Models. In Proceedings of the NAACLHLT 2012 Workshop: Will We Ever Really Replace the Ngram Model? On the Future of Language Modeling for HLT (pp. 20–28). Stroudsburg, PA, USA: Association for Computational Linguistics
 HDYD12: G. Hinton, Li Deng, Dong Yu, G.E. Dahl, A. Mohamed, N. Jaitly, … B. Kingsbury (2012) Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups. IEEE Signal Processing Magazine, 29(6), 82–97. DOI
 YTCB15: Li Yao, Atousa Torabi, Kyunghyun Cho, Nicolas Ballas, Christopher Pal, Hugo Larochelle, Aaron Courville (2015) Describing Videos by Exploiting Temporal Structure. ArXiv:1502.08029 [Cs, Stat].
 Chev07: Guillaume Chevillon (2007) Direct MultiStep Estimation and Forecasting. Journal of Economic Surveys, 21(4), 746–785. DOI
 ChYR16: Adam Charles, Dong Yin, Christopher Rozell (2016) Distributed Sequence Memory of Multidimensional Inputs in Recurrent Networks. ArXiv:1605.08346 [Cs, Math, Stat].
 GDGR15: Karol Gregor, Ivo Danihelka, Alex Graves, Danilo Jimenez Rezende, Daan Wierstra (2015) DRAW: A Recurrent Neural Network For Image Generation. ArXiv:1502.04623 [Cs].
 MHRB17: Zakaria Mhammedi, Andrew Hellicar, Ashfaqur Rahman, James Bailey (2017) Efficient Orthogonal Parametrisation of Recurrent Neural Networks Using Householder Reflections. In PMLR (pp. 2401–2409).
 CGCB14: Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, Yoshua Bengio (2014) Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling. In NIPS.
 MLTH17: Chris J. Maddison, Dieterich Lawson, George Tucker, Nicolas Heess, Mohammad Norouzi, Andriy Mnih, … Yee Whye Teh (2017) Filtering Variational Objectives. ArXiv Preprint ArXiv:1705.09279.
 Elma90: Jeffrey L Elman (1990) Finding structure in time. Cognitive Science, 14, 179–211. DOI
 WPHL16: Scott Wisdom, Thomas Powers, John Hershey, Jonathan Le Roux, Les Atlas (2016) Fullcapacity unitary recurrent neural networks. In Advances in Neural Information Processing Systems (pp. 4880–4888).
 CGCB15: Junyoung Chung, Caglar Gulcehre, Kyunghyun Cho, Yoshua Bengio (2015) Gated Feedback Recurrent Neural Networks. In Proceedings of the 32Nd International Conference on International Conference on Machine Learning  Volume 37 (pp. 2067–2075). JMLR.org
 Werb88: Paul J. Werbos (1988) Generalization of backpropagation with application to a recurrent gas market model. Neural Networks, 1(4), 339–356. DOI
 ThBe15: Lucas Theis, Matthias Bethge (2015) Generative Image Modeling Using Spatial LSTMs. ArXiv:1506.03478 [Cs, Stat].
 HBFS01: Sepp Hochreiter, Yoshua Bengio, Paolo Frasconi, Jürgen Schmidhuber (2001) Gradient Flow in Recurrent Nets: the Difficulty of Learning LongTerm Dependencies. In A field guide to dynamical recurrent neural networks. IEEE Press
 Lecu98: Y. LeCun (1998) Gradientbased learning applied to document recognition. Proceedings of the IEEE, 86(11), 2278–2324. DOI
 ChAB16: Junyoung Chung, Sungjin Ahn, Yoshua Bengio (2016) Hierarchical Multiscale Recurrent Neural Networks. ArXiv:1609.01704 [Cs].
 Husz15: Ferenc Huszár (2015) How (not) to Train your Generative Model: Scheduled Sampling, Likelihood, Adversary? ArXiv:1511.05101 [Cs, Math, Stat].
 Mnih15: V. Mnih (2015) Humanlevel control through deep reinforcement learning. Nature, 518, 529–533. DOI
 KSJC16: Diederik P. Kingma, Tim Salimans, Rafal Jozefowicz, Xi Chen, Ilya Sutskever, Max Welling (2016) Improving Variational Inference with Inverse Autoregressive Flow. In Advances in Neural Information Processing Systems 29. Curran Associates, Inc.
 WPPA16: Scott Wisdom, Thomas Powers, James Pitton, Les Atlas (2016) Interpretable Recurrent Neural Networks Using Sequential Sparse Recovery. In Advances in Neural Information Processing Systems 29.
 HaSZ17: Elad Hazan, Karan Singh, Cyril Zhang (2017) Learning Linear Dynamical Systems via Spectral Filtering. In NIPS.
 BeSF94: Y. Bengio, P. Simard, P. Frasconi (1994) Learning longterm dependencies with gradient descent is difficult. IEEE Transactions on Neural Networks, 5(2), 157–166. DOI
 CMGB14: Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, Yoshua Bengio (2014) Learning Phrase Representations using RNN EncoderDecoder for Statistical Machine Translation. In EMNLP 2014.
 GeSS02: Felix A. Gers, Nicol N. Schraudolph, Jürgen Schmidhuber (2002) Learning precise timing with LSTM recurrent networks. Journal of Machine Learning Research, 3(Aug), 115–143.
 MaSu11: James Martens, Ilya Sutskever (2011) Learning Recurrent Neural Networks with Hessianfree Optimization. In Proceedings of the 28th International Conference on International Conference on Machine Learning (pp. 1033–1040). USA: Omnipress
 RuHW86: David E. Rumelhart, Geoffrey E. Hinton, Ronald J. Williams (1986) Learning representations by backpropagating errors. Nature, 323(6088), 533–536. DOI
 GeSC00: Felix A. Gers, Jürgen Schmidhuber, Fred Cummins (2000) Learning to Forget: Continual Prediction with LSTM. Neural Computation, 12(10), 2451–2471. DOI
 HoSc97a: Sepp Hochreiter, Jürgen Schmidhuber (1997a) Long ShortTerm Memory. Neural Computation, 9(8), 1735–1780. DOI
 HoSc97b: Sepp Hochreiter, Jiirgen Schmidhuber (1997b) LTSM can solve hard time lag problems. In Advances in Neural Information Processing Systems: Proceedings of the 1996 Conference (pp. 473–479).
 GMDL16: Audrunas Gruslys, Remi Munos, Ivo Danihelka, Marc Lanctot, Alex Graves (2016) MemoryEfficient Backpropagation Through Time. In Advances in Neural Information Processing Systems 29 (pp. 4125–4133). Curran Associates, Inc.
 TaHR06: Graham W. Taylor, Geoffrey E. Hinton, Sam T. Roweis (2006) Modeling human motion using binary latent variables. In Advances in neural information processing systems (pp. 1345–1352).
 BoBV12: Nicolas BoulangerLewandowski, Yoshua Bengio, Pascal Vincent (2012) Modeling Temporal Dependencies in HighDimensional Sequences: Application to Polyphonic Music Generation and Transcription. In 29th International Conference on Machine Learning.
 SZLB95: Jonas Sjöberg, Qinghua Zhang, Lennart Ljung, Albert Benveniste, Bernard Delyon, PierreYves Glorennec, … Anatoli Juditsky (1995) Nonlinear blackbox modeling in system identification: a unified overview. Automatica, 31(12), 1691–1724. DOI
 WZZB16: Yuhuai Wu, Saizheng Zhang, Ying Zhang, Yoshua Bengio, Ruslan R Salakhutdinov (2016) On Multiplicative Integration with Recurrent Neural Networks. In Advances in Neural Information Processing Systems 29 (pp. 2856–2864). Curran Associates, Inc.
 PaMB13: Razvan Pascanu, Tomas Mikolov, Yoshua Bengio (2013) On the difficulty of training Recurrent Neural Networks. In arXiv:1211.5063 [cs] (pp. 1310–1318).
 CMBB14: Kyunghyun Cho, Bart van Merriënboer, Dzmitry Bahdanau, Yoshua Bengio (2014) On the properties of neural machine translation: Encoderdecoder approaches. ArXiv Preprint ArXiv:1409.1259.
 SuPf16: Simone Carlo Surace, JeanPascal Pfister (2016) Online Maximum Likelihood Estimation of the Parameters of Partially Observed Diffusion Processes.
 NWSS16: Behnam Neyshabur, Yuhuai Wu, Ruslan R Salakhutdinov, Nati Srebro (2016) PathNormalized Optimization of Recurrent Neural Networks with ReLU Activations. In Advances in Neural Information Processing Systems 29 (pp. 3477–3485). Curran Associates, Inc.
 NePL16: Daniel Neil, Michael Pfeiffer, ShihChii Liu (2016) Phased LSTM: Accelerating Recurrent Network Training for Long or Eventbased Sequences. In Advances in Neural Information Processing Systems 29 (pp. 3882–3890). Curran Associates, Inc.
 Grav11: Alex Graves (2011) Practical Variational Inference for Neural Networks. In Proceedings of the 24th International Conference on Neural Information Processing Systems (pp. 2348–2356). USA: Curran Associates Inc.
 LGZZ16: Alex Lamb, Anirudh Goyal, Ying Zhang, Saizheng Zhang, Aaron Courville, Yoshua Bengio (2016) Professor Forcing: A New Algorithm for Training Recurrent Networks. In Advances In Neural Information Processing Systems.
 CBLG16: Tim Cooijmans, Nicolas Ballas, César Laurent, Çağlar Gülçehre, Aaron Courville (2016) Recurrent batch normalization. ArXiv Preprint ArXiv:1603.09025.
 MKBČ10: Tomáš Mikolov, Martin Karafiát, Lukáš Burget, Jan Černockỳ, Sanjeev Khudanpur (2010) Recurrent Neural Network Based Language Model. In Eleventh Annual Conference of the International Speech Communication Association.
 DaYO16: Sakyasingha Dasgupta, Takayuki Yoshizumi, Takayuki Osogami (2016) Regularized Dynamic Boltzmann Machine with Delay Pruning for Unsupervised Learning of Temporal Sequences. ArXiv:1610.01989 [Cs, Stat].
 VKCM15: Francesco Visin, Kyle Kastner, Kyunghyun Cho, Matteo Matteucci, Aaron Courville, Yoshua Bengio (2015) ReNet: A Recurrent Neural Network Based Alternative to Convolutional Networks. ArXiv:1505.00393 [Cs].
 LuJa09: Mantas Lukoševičius, Herbert Jaeger (2009) Reservoir computing approaches to recurrent neural network training. Computer Science Review, 3(3), 127–149. DOI
 BVJS15: Samy Bengio, Oriol Vinyals, Navdeep Jaitly, Noam Shazeer (2015) Scheduled sampling for sequence prediction with recurrent neural networks. In Advances in Neural Information Processing Systems 28 (pp. 1171–1179). Cambridge, MA, USA: Curran Associates, Inc.
 FSPW16: Marco Fraccaro, Sø ren Kaae Sø nderby, Ulrich Paquet, Ole Winther (2016) Sequential Neural Models with Stochastic Layers. In Advances in Neural Information Processing Systems 29 (pp. 2199–2207). Curran Associates, Inc.
 PaHC15: Viorica Patraucean, Ankur Handa, Roberto Cipolla (2015) Spatiotemporal video autoencoder with differentiable memory. ArXiv:1511.06309 [Cs].
 Grav12: Alex Graves (2012) Supervised sequence labelling with recurrent neural networks. Heidelberg ; New York: Springer
 AnBe17: Alexander G. Anderson, Cory P. Berg (2017) The HighDimensional Geometry of Binary Neural Networks. ArXiv:1705.07199 [Cs].
 Pill16: Gianluigi Pillonetto (2016) The interplay between system identification and machine learning. ArXiv:1612.09158 [Cs, Stat].
 RoRS15: Anna Rohrbach, Marcus Rohrbach, Bernt Schiele (2015) The LongShort Story of Movie Description. ArXiv:1506.01698 [Cs].
 BFLL17: David Balduzzi, Marcus Frean, Lennox Leary, J. P. Lewis, Kurt WanDuo Ma, Brian McWilliams (2017) The Shattered Gradients Problem: If resnets are the answer, then what is the question? In PMLR (pp. 342–350).
 OlPS17: Junier B. Oliva, Barnabas Poczos, Jeff Schneider (2017) The Statistical Recurrent Unit. ArXiv:1703.00381 [Cs, Stat].
 Hoch98: Sepp Hochreiter (1998) The vanishing gradient problem during learning recurrent neural nets and problem solutions. International Journal of Uncertainty Fuzziness and Knowledge Based Systems, 6, 107–115.
 HaMa12: Hananel Hazan, Larry M. Manevitz (2012) Topological constraints and robustness in liquid state machines. Expert Systems with Applications, 39(2), 1597–1606. DOI
 MaSu12: James Martens, Ilya Sutskever (2012) Training deep and recurrent networks with hessianfree optimization. In Neural networks: Tricks of the trade (pp. 479–535). Springer
 JSDP17: Li Jing, Yichen Shen, Tena Dubcek, John Peurifoy, Scott Skirlo, Yann LeCun, … Marin Soljačić (2017) Tunable Efficient Unitary Neural Networks (EUNN) and their application to RNNs. In PMLR (pp. 1733–1741).
 Jaeg02: Herbert Jaeger (2002) Tutorial on training recurrent neural networks, covering BPPT, RTRL, EKF and the" echo state network" approach (Vol. 5). GMDForschungszentrum Informationstechnik
 TaOl17: Corentin Tallec, Yann Ollivier (2017) Unbiasing Truncated Backpropagation Through Time. ArXiv:1705.08209 [Cs].
 ArSB16: Martin Arjovsky, Amar Shah, Yoshua Bengio (2016) Unitary Evolution Recurrent Neural Networks. In Proceedings of the 33rd International Conference on International Conference on Machine Learning  Volume 48 (pp. 1120–1128). New York, NY, USA: JMLR.org
 KaJF15: Andrej Karpathy, Justin Johnson, Li FeiFei (2015) Visualizing and Understanding Recurrent Networks. ArXiv:1506.02078 [Cs].
 LeNM05: Robert Legenstein, Christian Naeger, Wolfgang Maass (2005) What Can a Neuron Learn with SpikeTimingDependent Plasticity? Neural Computation, 17(11), 2337–2382. DOI
 BuMe05: Catalin V. Buhusi, Warren H. Meck (2005) What makes us tick? Functional and neural mechanisms of interval timing. Nature Reviews Neuroscience, 6(10), 755–765. DOI
 MiHa18: John Miller, Moritz Hardt (2018) When Recurrent Models Don’t Need To Be Recurrent. ArXiv:1805.10369 [Cs, Stat].
 GCWK09: B. J. Grzyb, E. Chinellato, G. M. Wojcik, W. A. Kaminski (2009) Which model to use for the Liquid State Machine? In 2009 International Joint Conference on Neural Networks (pp. 1018–1024). DOI