The Living Thing / Notebooks :

Compressing artificial neural networks

It seems we should be able to do better than a gigantic network with millions of parameters;

CWZZ17 Is a popular summary article of the state of the art theory and LoWK17 of the practice.

Question: how do you do this with recurrent neural networks? To read: NaUD17.

Plainly, once we have trained the graph, how can we simplify it, compress it, or prune it? One model here is the “Student-Teacher” network, where you use one big network to train a little network. RBKC14, UGKA16… Summarised by Tomasz Maliciekicz:

we now have teacher-student training algorithms which you can use to have a shallower network “mimic” the teacher’s responses on a large dataset. These shallower networks are able to learn much better using a teacher and in fact, such shallow networks produce inferior results when they are trained on the teacher’s training set. So it seems you get go [Data to MegaDeep], and [MegaDeep to MiniDeep], but you cannot directly go from [Data to MiniDeep].

I haven’t actually read the papers here yet, but this seems intuitive, for the following hand-wavy reason: Large neural networks stay far from optimum for a long period of time so they assimilate a lot of data in their stochastic gradient descent; they are using a simulated-annealing-like process to explore many possible configurations much as a Monte Carlo can explore a large parameter space; However, they are still seeking optima, so in a sense we need too many parameters to allow them to actually have enough “maladaptation” to escape bad optima. However, when the network has reached a “good” optimum, those parameters are no longer needed; a much lower representation of the manifold that each layer learned is probably available.

This is suggestive of using some of the dimension reduction ideas such as mixture models or whatever function approximation / matrix factorisation takes your fancy to learn “good” approximation of each layer, once the overall network is trained.

Song Han gives presentation that probably touches on some of this.

Quantizing to fewer bits is another popular approach. (8 bits, 1 bit…)

How about, as presaged, matrix-sketching type approaches? Suggestive link with compressive sensing low rank matrix factorisation. HaMD15 looks like this, and it combines other approaches too, but there must be more?

Here is an interesting attempt to examine the problem in reverse, and connect it to recurrent neural networks:

Hypernetworks (HaDL16):

Most modern neural network architectures are either a deep ConvNet, or a long RNN, or some combination of the two. These two architectures seem to be at opposite ends of a spectrum. Recurrent Networks can be viewed as a really deep feed forward network with the identical weights at each layer (this is called weight-tying). A deep ConvNet allows each layer to be different. But perhaps the two are related somehow. Every year, the winning ImageNet models get deeper and deeper. Think about a deep 110-layer, or even 1001-layer Residual Network architectures we keep hearing about. Do all 110 layers have to be unique? Are most layers even useful?

People have already thought of forcing a deep ConvNet to be like an RNN, i.e. with identical weights at every layer. However, if we force a deep ResNet to have its weight tied, the performance would be embarrassing. In our paper, we use HyperNetworks to explore a middle ground - to enforce a relaxed version of weight-tying. A HyperNetwork is just a small network that generates the weights of a much larger network, like the weights of a deep ResNet, effectively parameterizing the weights of each layer of the ResNet. We can use hypernetwork to explore the tradeoff between the model’s expressivity versus how much we tie the weights of a deep ConvNet. It is kind of like applying compression to an image, and being able to adjust how much compression we want to use, except here, the images are the weights of a deep ConvNet.

Interestingly, the authors are keen on applications in generative art neural networks, and have posted a (somewhat abstruse) implementation.

They mention also a version by Schmidhuber’s omnipresent lab: Compressed Network Search

Regularisation

Normally regularisation penalties are not used to reduce the overall size of a neural network. In matrix terms, they seem to do matrix sparsification but not matrix sketching.

See PaDG16 for one attempt to drop neurons:

DropNeuron is aimed to train a small model from a large random initialized model, rather than compress or reduce a large trained model. DropNeuron can be mixed used with other regularization techniques, e.g. Dropout, L1, L2.

Refs

AgNR16
Aghasi, A., Nguyen, N., & Romberg, J. (2016) Net-Trim: A Layer-wise Convex Pruning of Deep Neural Networks. ArXiv:1611.05162 [Cs, Stat].
BoSc16
Borgerding, M., & Schniter, P. (2016) Onsager-Corrected Deep Networks for Sparse Linear Inverse Problems. ArXiv:1612.01183 [Cs, Math].
ChGS15
Chen, T., Goodfellow, I., & Shlens, J. (2015) Net2Net: Accelerating Learning via Knowledge Transfer. ArXiv:1511.05641 [Cs].
CWTW15a
Chen, W., Wilson, J. T., Tyree, S., Weinberger, K. Q., & Chen, Y. (2015a) Compressing Convolutional Neural Networks. ArXiv:1506.04449 [Cs].
CWTW15b
Chen, W., Wilson, J. T., Tyree, S., Weinberger, K. Q., & Chen, Y. (2015b) Compressing Neural Networks with the Hashing Trick. ArXiv:1504.04788 [Cs].
CWZZ17
Cheng, Y., Wang, D., Zhou, P., & Zhang, T. (2017) A Survey of Model Compression and Acceleration for Deep Neural Networks. ArXiv:1710.09282 [Cs].
CBMF16
Cutajar, K., Bonilla, E. V., Michiardi, P., & Filippone, M. (2016) Practical Learning of Deep Gaussian Processes via Random Fourier Features. ArXiv:1610.04386 [Stat].
CBMF17
Cutajar, K., Bonilla, E. V., Michiardi, P., & Filippone, M. (2017) Random Feature Expansions for Deep Gaussian Processes. In PMLR.
Dani17
Daniely, A. (2017) Depth Separation for Neural Networks. ArXiv:1702.08489 [Cs, Stat].
GRCL17
Garg, S., Rish, I., Cecchi, G., & Lozano, A. (2017) Neurogenesis-Inspired Dictionary Learning: Online Model Adaption in a Changing World. In arXiv:1701.06106 [cs, stat].
Ghos17
Ghosh, T. (2017) QuickNet: Maximizing Efficiency and Efficacy in Deep Architectures. ArXiv:1701.02291 [Cs, Stat].
GlLi16
Globerson, A., & Livni, R. (2016) Learning Infinite-Layer Networks: Beyond the Kernel Trick. ArXiv:1606.05316 [Cs].
GrRK00
Gray, S., Radford, A., & Kingma, D. P.(n.d.) GPU Kernels for Block-Sparse Weights. , 12.
HaDL16
Ha, D., Dai, A., & Le, Q. V.(2016) HyperNetworks. ArXiv:1609.09106 [Cs].
HaMD15
Han, S., Mao, H., & Dally, W. J.(2015) Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding. ArXiv:1510.00149 [Cs].
HaRS15
Hardt, M., Recht, B., & Singer, Y. (2015) Train faster, generalize better: Stability of stochastic gradient descent. ArXiv:1509.01240 [Cs, Math, Stat].
HZCK17
Howard, A. G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., … Adam, H. (2017) MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications. ArXiv:1704.04861 [Cs].
IHMA16
Iandola, F. N., Han, S., Moskewicz, M. W., Ashraf, K., Dally, W. J., & Keutzer, K. (2016) SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <05MB model size. ArXiv:1602.07360 [Cs].
LGMR17
Lee, H., Ge, R., Ma, T., Risteski, A., & Arora, S. (2017) On the ability of neural nets to express distributions. In arXiv:1702.07028 [cs].
LCMB15
Lin, Z., Courbariaux, M., Memisevic, R., & Bengio, Y. (2015) Neural Networks with Few Multiplications. ArXiv:1510.03009 [Cs].
LoCV17
Lobacheva, E., Chirkova, N., & Vetrov, D. (2017) Bayesian Sparsification of Recurrent Neural Networks. In Workshop on Learning to Generate Natural Language.
LoWK17
Louizos, C., Welling, M., & Kingma, D. P.(2017) Learning Sparse Neural Networks through $L_0$ Regularization. ArXiv:1712.01312 [Cs, Stat].
MoAV17
Molchanov, D., Ashukha, A., & Vetrov, D. (2017) Variational Dropout Sparsifies Deep Neural Networks. In Proceedings of ICML.
NaUD17
Narang, S., Undersander, E., & Diamos, G. (2017) Block-Sparse Recurrent Neural Networks. ArXiv:1711.02782 [Cs, Stat].
PaDG16
Pan, W., Dong, H., & Guo, Y. (2016) DropNeuron: Simplifying the Structure of Deep Neural Networks. ArXiv:1606.07326 [Cs, Stat].
RBKC14
Romero, A., Ballas, N., Kahou, S. E., Chassang, A., Gatta, C., & Bengio, Y. (2014) FitNets: Hints for Thin Deep Nets. ArXiv:1412.6550 [Cs].
SCHU16
Scardapane, S., Comminiello, D., Hussain, A., & Uncini, A. (2016) Group Sparse Regularization for Deep Neural Networks. ArXiv:1607.00485 [Cs, Stat].
ShFZ16
Shi, L., Feng, S., & ZhifanZhu. (2016) Functional Hashing for Compressing Neural Networks. ArXiv:1605.06560 [Cs].
SrBa16
Srinivas, S., & Babu, R. V.(2016) Generalized Dropout. ArXiv:1611.06791 [Cs].
StGa15
Steeg, G. V., & Galstyan, A. (2015) The Information Sieve. ArXiv:1507.02284 [Cs, Math, Stat].
UlMW17
Ullrich, K., Meeds, E., & Welling, M. (2017) Soft Weight-Sharing for Neural Network Compression. ArXiv Preprint ArXiv:1702.04008.
UGKA16
Urban, G., Geras, K. J., Kahou, S. E., Aslan, O., Wang, S., Caruana, R., … Richardson, M. (2016) Do Deep Convolutional Nets Really Need to be Deep (Or Even Convolutional)?. ArXiv:1603.05691 [Cs, Stat].
WXYT16
Wang, Y., Xu, C., You, S., Tao, D., & Xu, C. (2016) CNNpack: Packing Convolutional Neural Networks in the Frequency Domain. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, & R. Garnett (Eds.), Advances in Neural Information Processing Systems 29 (pp. 253–261). Curran Associates, Inc.
WCLH16
Wang, Z., Chang, S., Ling, Q., Huang, S., Hu, X., Shi, H., & Huang, T. S.(2016) Stacked Approximated Regression Machine: A Simple Deep Learning Approach. . Presented at the NIPS
Zhao17
Zhao, L. (2017) Fast Algorithms on Random Matrices and Structured Matrices.