The Living Thing / Notebooks :

Deep learning

designing the fanciest usable differentiable loss surface

Bjorn Stenger:

Bjorn Stenger’s brief history of machine learning.
Bjorn Stenger’s brief history of machine learning.

Modern computational neural network methods reascend the hype phase transition. a.k.a deep learning or double plus fancy brainbots or please give the department have a bigger GPU budget it’s not to play video games I swear.

I don’t intend to write an introduction to deep learning here; that ground has been tilled already.

But here are some handy links to resources I frequently use.


To be specific, deep learning is

It’s a frothy (some might say foamy-mouthed) research bubble right now, with such cuteness at the extrema as, e.g. Inceptionising inceptionism (ADGH16) which learns to learn neural networks using neural networks. (well, it sort of does that, but is a long way from a bootstrapping general AI) Stay tuned for more of this.

There is not much to do with “neurons” left in the paradigm at this stage. What there is, is a bundle of clever tricks for training deep constrained hierarchical predictors and classifiers on modern computer hardware. Something closer to a convenient technology stack than a single “theory”.

Some network methods hew closer to behaviour of real neurons, although not that close; simulating actual brains is a different discipline with only intermittent and indirect connection.

Subtopics of interest to me:

Why bother?

There are many answers here.

A classic —

The ultimate regression algorithm

…until the next ultimate regression algorithm.

It turns out that this particular learning model (class of learning models) and training technologies is surprisingly good at getting every better models out of ever more data. Why burn three grad students on a perfect tractable and specific regression algorithm when you can use one algorithm to solve a whole bunch of regression problems, and which improves with the number of computers and the amount of data you have? How much of a relief is it to capital to decouple its effectiveness from the uncertainty and obstreperousness of human labour?

Cool maths

Function approximations, interesting manifold inference. Weird product measure things, e.g. Mont14.

Even the stuff I’d assumed was trivial, like backpropagation, has a few wrinkles in practice. See Michael Nielson’s chapter and Chrisopher Olah’s visual summary.

Yes, this is a regular paper mill. Not only are there probably new insights to be had here, but also you can recycle any old machine learning insight, replace a layer in a network with that and poof – new paper.

Insight into the mind

TBD. Maybe.

There claims to be communication between real neurology and neural networks in computer vision, but elsewhere neural networks are driven by their similarities to other things, such as being differentiable relaxations of traditional models, (differentiable stack machines!) or of being license to fit hierarchical models without regard for statistical niceties.

There might be some kind of occasional “stylised fact”-type relationship here.

Trippy art projects

See generative art and neural networks

Hip keywords for NN models

Not necessarily mutually exclusive; some design patterns you can use.

There are many summaries floating around here. Some that I looked at are Tomasz Malisiewicz’s summary of Deep Learning Trends @ ICLR 2016, or the Neural network zoo or Simon Brugman’s deep learning papers.

Some of these are descriptions of topologies, others of training tricks or whatever. Recurrent and convolutional are two types of topologies you might have in your ANN. But there are so many other possible ones: “Grid”, “highway”, “Turing” others…

Many are mentioned in passing in David Mcallester’s Cognitive Architectures post.


See probabilistic Neural Networks.


Signal processing baked in to neural networks. Not so complicated if you have ever done signal processing, apart from the abstruse use of “depth” to mean 2 different things in the literature.

Generally uses FIR filters plus some smudgy “pooling” (which is nonlinear downsampling), although IIR is also making an appearance by running RNN on multiple axes.

Terence Broad’s convnet visualizer

See the convenets entry.

Generative Adversarial Networks

Train two networks to beat each other.

Recurrent neural networks

Feedback neural networks structures to have with memory and a notion of time and “current” versus “past” state. See recurrent neural networks.

GridRNN etc

A mini-genre. KaDG15 et al connect recurrent cells across multiple axes, leading to a higher-rank MIMO system; This is natural in many kinds of spatial random fields, and I am amazed it was uncommon enough to need formalizing in a paper; but apparently it was and it did.

Partial training

A.k.a. transfer learning. Recycling someone else’s features. I don’t know why this has a special term - I think it’s so that you can claim to do “end-to-end” learning, but then actually do what everyone else as done forever and works totally OK, which is to re-use other people’s work like real scientists.

Attention mechanism

What’s that now? Long story, but see transformer or Sparse Transformer for particularly developed examples and explanations of this sub-field.


Most simulated neural networks are based on a continuous activation potential and discrete time, unlike spiking biological ones, which are driven by discrete events in continuous time. There are a great many other differences (to real biology). What difference does this in particular make? I suspect it means that time is handled different.

Kernel networks

Kernel trick + ANN = kernel ANNs.

Stay tuned for reframing more things as deep learning.

I think this is what convex networks are also?

Francis Bach:

I’m sure the brain totes does this
I’m sure the brain totes does this

Bengio, Le Roux, Vincent, Delalleau, and Marcotte, 2006.

Extreme learning machines

Dunno. I think this is a flavour of random neural net?


TBD. Making a sparse encoding of something by demanding your network reproduces the after passing the network activations through a narrow bottleneck. Many flavours.

Optimisation methods

Backpropagation plus stochastic gradient descent rules at the moment.

Does anything else get performance at this scale? What other techniques can be extracted from variational inference or MC sampling, or particle filters, since there is no clear reason that shoving any of these in as intermediate layers in the network is any less well-posed than a classical backprop layer? Although it does require more nous from the enthusiastic grad student.

Preventing overfitting

See regularising deep learning.

Activations for neural networks

See activation functions


Various design niceties.

Managing those dimensions

Practically a lot of the time managing deep learning is remembering which axis is which.

Alexander Rush argues you want a NamedTensor. Implementations:

Einsum does einstein summation, whcih is also very helpful.

Software stuff

For general purposes I use,

I could use…

pre-computed/trained models