The Living Thing / Notebooks : Compressing artificial neural networks

It seems we should be able to do better than a gigantic network with millions of parameters; Once we have trained the graph, how can we simplify it, compress it, or prune it? One model here is the “Student-Teacher” network, where you use one big network to train a little network. RBKC14, UGKA16… Summarised by Tomasz Maliciekicz:

we now have teacher-student training algorithms which you can use to have a shallower network “mimic” the teacher’s responses on a large dataset. These shallower networks are able to learn much better using a teacher and in fact, such shallow networks produce inferior results when they are trained on the teacher’s training set. So it seems you get go [Data to MegaDeep], and [MegaDeep to MiniDeep], but you cannot directly go from [Data to MiniDeep].

I haven’t actually read the papers here yet, but this seems intuitive, for the following hand-wavy reason: Large neural networks stay far from optimum for a long period of time so they assimilate a lot of data in their stochastic gradient descent; they are using a simulated-annealing-like process to explore many possible configurations much as a Monte Carlo can explore a large parameter space; However, they are still seeking optima, so in a sense we need too many parameters to allow them to actually have enough “maladaptation” to escape bad optima. However, when the network has reached a “good” optimum, those parameters are no longer needed; a much lower representation of the manifold that each layer learned is probably available.

This is suggestive of using some of the dimension reduction ideas such as mixture models or whatever function approximation / matrix factorisation takes your fancy to learn “good” approximation of each layer, once the overall network is trained.

Question: how do you do this with recurrent neural networks?

Song Han gives presentation that probably touches on some of this.

Quantizing to fewer bits is another popular approach. (8 bits - Dett15, 1 bit…)

How about, as presaged, matrix-sketching type approaches? Suggestive link with compressive sensing low rank matrix factorisation. HaMD15 looks like this, and it combines other approaches too, but there must be more?

Here is an interesting attempt to examine the problem in reverse, and connect it to recurrent neural networks:

Hypernetworks (HaDL16):

Most modern neural network architectures are either a deep ConvNet, or a long RNN, or some combination of the two. These two architectures seem to be at opposite ends of a spectrum. Recurrent Networks can be viewed as a really deep feed forward network with the identical weights at each layer (this is called weight-tying). A deep ConvNet allows each layer to be different. But perhaps the two are related somehow. Every year, the winning ImageNet models get deeper and deeper. Think about a deep 110-layer, or even 1001-layer Residual Network architectures we keep hearing about. Do all 110 layers have to be unique? Are most layers even useful?

People have already thought of forcing a deep ConvNet to be like an RNN, i.e. with identical weights at every layer. However, if we force a deep ResNet to have its weight tied, the performance would be embarrassing. In our paper, we use HyperNetworks to explore a middle ground - to enforce a relaxed version of weight-tying. A HyperNetwork is just a small network that generates the weights of a much larger network, like the weights of a deep ResNet, effectively parameterizing the weights of each layer of the ResNet. We can use hypernetwork to explore the tradeoff between the model’s expressivity versus how much we tie the weights of a deep ConvNet. It is kind of like applying compression to an image, and being able to adjust how much compression we want to use, except here, the images are the weights of a deep ConvNet.

Interestingly, the authors are keen on applications in generative art neural networks, and have posted a (somewhat abstruse) implementation.

They mention also a version by Schmidhuber’s omnipresent lab: Compressed Network Search

Regularisation

Normally regularisation penalties are not used to reduce the overall size of a neural network. In matrix terms, they seem to do matrix sparsification but not matrix sketching.

See PaDG16 for one attempt to drop neurons:

DropNeuron is aimed to train a small model from a large random initialized model, rather than compress or reduce a large trained model. DropNeuron can be mixed used with other regularization techniques, e.g. Dropout, L1, L2.

Refs

ChGS15
Chen, T., Goodfellow, I., & Shlens, J. (2015) Net2Net: Accelerating Learning via Knowledge Transfer. arXiv:1511.05641 [Cs].
CWTW15a
Chen, W., Wilson, J. T., Tyree, S., Weinberger, K. Q., & Chen, Y. (2015a) Compressing Convolutional Neural Networks. arXiv:1506.04449 [Cs].
CWTW15b
Chen, W., Wilson, J. T., Tyree, S., Weinberger, K. Q., & Chen, Y. (2015b) Compressing Neural Networks with the Hashing Trick. arXiv:1504.04788 [Cs].
Dett15
Dettmers, T. (2015) 8-Bit Approximations for Parallelism in Deep Learning. arXiv:1511.04561 [Cs].
GlLi16
Globerson, A., & Livni, R. (2016) Learning Infinite-Layer Networks: Beyond the Kernel Trick. arXiv:1606.05316 [Cs].
HaDL16
Ha, D., Dai, A., & Le, Q. V.(2016) HyperNetworks. arXiv:1609.09106 [Cs].
HaMD15
Han, S., Mao, H., & Dally, W. J.(2015) Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding. arXiv:1510.00149 [Cs].
LCMB15
Lin, Z., Courbariaux, M., Memisevic, R., & Bengio, Y. (2015) Neural Networks with Few Multiplications. arXiv:1510.03009 [Cs].
PaDG16
Pan, W., Dong, H., & Guo, Y. (2016) DropNeuron: Simplifying the Structure of Deep Neural Networks. arXiv:1606.07326 [Cs, Stat].
RBKC14
Romero, A., Ballas, N., Kahou, S. E., Chassang, A., Gatta, C., & Bengio, Y. (2014) FitNets: Hints for Thin Deep Nets. arXiv:1412.6550 [Cs].
SCHU16
Scardapane, S., Comminiello, D., Hussain, A., & Uncini, A. (2016) Group Sparse Regularization for Deep Neural Networks. arXiv:1607.00485 [Cs, Stat].
ShFZ16
Shi, L., Feng, S., & ZhifanZhu. (2016) Functional Hashing for Compressing Neural Networks. arXiv:1605.06560 [Cs].
StGa15
Steeg, G. V., & Galstyan, A. (2015) The Information Sieve. arXiv:1507.02284 [Cs, Math, Stat].
UGKA16
Urban, G., Geras, K. J., Kahou, S. E., Aslan, O., Wang, S., Caruana, R., … Richardson, M. (2016) Do Deep Convolutional Nets Really Need to be Deep (Or Even Convolutional)?. arXiv:1603.05691 [Cs, Stat].
WCLH16
Wang, Z., Chang, S., Ling, Q., Huang, S., Hu, X., Shi, H., & Huang, T. S.(2016) Stacked Approximated Regression Machine: A Simple Deep Learning Approach. . Presented at the NIPS