The Living Thing / Notebooks :

Compressing artificial neural networks

It seems we should be able to do better than a gigantic network with millions of parameters.

CWZZ17 Is a popular summary article of the state of the art theory and LoWK17 of the practice at theri respective publication dates.

Question: how do you do this with recurrent neural networks? To read: NaUD17.

Plainly, once we have trained the graph, how can we simplify it, compress it, or prune it? One model here is the “Student-Teacher” network, where you use one big network to train a little network. RBKC14, UGKA16… Summarised by Tomasz Maliciekicz:

we now have teacher-student training algorithms which you can use to have a shallower network “mimic” the teacher’s responses on a large dataset. These shallower networks are able to learn much better using a teacher and in fact, such shallow networks produce inferior results when they are trained on the teacher’s training set. So it seems you get go [Data to MegaDeep], and [MegaDeep to MiniDeep], but you cannot directly go from [Data to MiniDeep].

NB these networks are still bloody big, much bigger than we might hope.

This all seems intuitive, for the following hand-wavy reason: overparameterization is demonstrably important, and some “slack variables”for assimilating all the data they receive. However, when the network has reached a “good” optimum, some of those parameters are no longer needed; a much smaller representation of the manifold that each layer learned is probably available. But how much smaller?

This is suggestive of using some of the dimension reduction ideas such as mixture models, or whatever function approximation / matrix factorisation takes your fancy, to learn “good” approximation of each layer, once the overall network is trained.

Song Han gives presentation that probably touches on some of this.

Quantizing to fewer bits is another popular approach. (8 bits, 1 bit…)

How about, as presaged, matrix-sketching type approaches? Suggestive link with compressive sensing low rank matrix factorisation. HaMD15 looks like this, and it combines other approaches too, but there must be more?

Here is an interesting attempt to examine a related problem in reverse, and connect it to recurrent neural networks:

Hypernetworks (HaDL16):

Most modern neural network architectures are either a deep ConvNet, or a long RNN, or some combination of the two. These two architectures seem to be at opposite ends of a spectrum. Recurrent Networks can be viewed as a really deep feed forward network with the identical weights at each layer (this is called weight-tying). A deep ConvNet allows each layer to be different. But perhaps the two are related somehow. Every year, the winning ImageNet models get deeper and deeper. Think about a deep 110-layer, or even 1001-layer Residual Network architectures we keep hearing about. Do all 110 layers have to be unique? Are most layers even useful?

People have already thought of forcing a deep ConvNet to be like an RNN, i.e. with identical weights at every layer. However, if we force a deep ResNet to have its weight tied, the performance would be embarrassing. In our paper, we use HyperNetworks to explore a middle ground - to enforce a relaxed version of weight-tying. A HyperNetwork is just a small network that generates the weights of a much larger network, like the weights of a deep ResNet, effectively parameterizing the weights of each layer of the ResNet. We can use hypernetwork to explore the tradeoff between the model’s expressivity versus how much we tie the weights of a deep ConvNet. It is kind of like applying compression to an image, and being able to adjust how much compression we want to use, except here, the images are the weights of a deep ConvNet.

Interestingly, the authors are keen on applications in generative art neural networks, and have posted a (somewhat abstruse) implementation.

They mention also a version by Schmidhuber’s omnipresent lab: Compressed Network Search

Regularisation

Normally regularisation penalties are not used to reduce the overall size of a neural network. In matrix terms, they seem to do matrix sparsification but not matrix sketching.

See PaDG16 for one attempt to drop neurons:

DropNeuron is aimed to train a small model from a large random initialized model, rather than compress or reduce a large trained model. DropNeuron can be mixed used with other regularization techniques, e.g. Dropout, L1, L2.

Refs