It seems we should be able to do better than a gigantic network with millions of parameters;

CWZZ17 Is a popular summary article of the state of the art theory and LoWK17 of the practice.

Question: how do you do this with recurrent neural networks? To read: NaUD17.

Plainly, once we have trained the graph, how can we simplify it, compress it, or prune it? One model here is the “Student-Teacher” network, where you use one big network to train a little network. RBKC14, UGKA16… Summarised by Tomasz Maliciekicz:

we now have teacher-student training algorithms which you can use to have a shallower network “mimic” the teacher’s responses on a large dataset. These shallower networks are able to learn much better using a teacher and in fact, such shallow networks produce inferior results when they are trained on the teacher’s training set. So it seems you get go [Data to MegaDeep], and [MegaDeep to MiniDeep], but you cannot directly go from [Data to MiniDeep].

I haven’t actually read the papers here yet, but this seems intuitive, for the following hand-wavy reason: Large neural networks stay far from optimum for a long period of time so they assimilate a lot of data in their stochastic gradient descent; they are using a simulated-annealing-like process to explore many possible configurations much as a Monte Carlo can explore a large parameter space; However, they are still seeking optima, so in a sense we need too many parameters to allow them to actually have enough “maladaptation” to escape bad optima. However, when the network has reached a “good” optimum, those parameters are no longer needed; a much lower representation of the manifold that each layer learned is probably available.

This is suggestive of using some of the dimension reduction ideas such as mixture models or whatever function approximation / matrix factorisation takes your fancy to learn “good” approximation of each layer, once the overall network is trained.

Song Han gives presentation that probably touches on some of this.

Quantizing to fewer bits is another popular approach. (8 bits, 1 bit…)

How about, as presaged, matrix-sketching type approaches? Suggestive link with compressive sensing low rank matrix factorisation. HaMD15 looks like this, and it combines other approaches too, but there must be more?

Here is an interesting attempt to examine the problem in reverse, and connect it to recurrent neural networks:

Most modern neural network architectures are either a deep ConvNet, or a long RNN, or some combination of the two. These two architectures seem to be at opposite ends of a spectrum. Recurrent Networks can be viewed as a really deep feed forward network with the identical weights at each layer (this is called weight-tying). A deep ConvNet allows each layer to be different. But perhaps the two are related somehow. Every year, the winning ImageNet models get deeper and deeper. Think about a deep 110-layer, or even 1001-layer Residual Network architectures we keep hearing about. Do all 110 layers have to be unique? Are most layers even useful?

People have already thought of forcing a deep ConvNet to be like an RNN, i.e. with identical weights at every layer. However, if we force a deep ResNet to have its weight tied, the performance would be embarrassing. In our paper, we use HyperNetworks to explore a middle ground - to enforce a relaxed version of weight-tying. A HyperNetwork is just a small network that generates the weights of a much larger network, like the weights of a deep ResNet, effectively parameterizing the weights of each layer of the ResNet. We can use hypernetwork to explore the tradeoff between the model’s expressivity versus how much we tie the weights of a deep ConvNet. It is kind of like applying compression to an image, and being able to adjust how much compression we want to use, except here, the images are the weights of a deep ConvNet.

Interestingly, the authors are keen on applications in generative art neural networks, and have posted a (somewhat abstruse) implementation.

They mention also a version by Schmidhuber’s omnipresent lab: Compressed Network Search

## Regularisation

Normally regularisation penalties are not used to reduce the overall size of a neural network.
In matrix terms, they seem to do matrix *sparsification*
but not matrix *sketching*.

See PaDG16 for one attempt to drop neurons:

DropNeuron is aimed to train a small model from a large random initialized model, rather than compress or reduce a large trained model. DropNeuron can be mixed used with other regularization techniques, e.g. Dropout, L1, L2.

## Refs

- AgNR16
- Aghasi, A., Nguyen, N., & Romberg, J. (2016) Net-Trim: A Layer-wise Convex Pruning of Deep Neural Networks.
*ArXiv:1611.05162 [Cs, Stat]*. - BoSc16
- Borgerding, M., & Schniter, P. (2016) Onsager-Corrected Deep Networks for Sparse Linear Inverse Problems.
*ArXiv:1612.01183 [Cs, Math]*. - ChGS15
- Chen, T., Goodfellow, I., & Shlens, J. (2015) Net2Net: Accelerating Learning via Knowledge Transfer.
*ArXiv:1511.05641 [Cs]*. - CWTW15a
- Chen, W., Wilson, J. T., Tyree, S., Weinberger, K. Q., & Chen, Y. (2015a) Compressing Convolutional Neural Networks.
*ArXiv:1506.04449 [Cs]*. - CWTW15b
- Chen, W., Wilson, J. T., Tyree, S., Weinberger, K. Q., & Chen, Y. (2015b) Compressing Neural Networks with the Hashing Trick.
*ArXiv:1504.04788 [Cs]*. - CWZZ17
- Cheng, Y., Wang, D., Zhou, P., & Zhang, T. (2017) A Survey of Model Compression and Acceleration for Deep Neural Networks.
*ArXiv:1710.09282 [Cs]*. - CBMF16
- Cutajar, K., Bonilla, E. V., Michiardi, P., & Filippone, M. (2016) Practical Learning of Deep Gaussian Processes via Random Fourier Features.
*ArXiv:1610.04386 [Stat]*. - CBMF17
- Cutajar, K., Bonilla, E. V., Michiardi, P., & Filippone, M. (2017) Random Feature Expansions for Deep Gaussian Processes. In PMLR.
- Dani17
- Daniely, A. (2017) Depth Separation for Neural Networks.
*ArXiv:1702.08489 [Cs, Stat]*. - GRCL17
- Garg, S., Rish, I., Cecchi, G., & Lozano, A. (2017) Neurogenesis-Inspired Dictionary Learning: Online Model Adaption in a Changing World. In arXiv:1701.06106 [cs, stat].
- Ghos17
- Ghosh, T. (2017) QuickNet: Maximizing Efficiency and Efficacy in Deep Architectures.
*ArXiv:1701.02291 [Cs, Stat]*. - GlLi16
- Globerson, A., & Livni, R. (2016) Learning Infinite-Layer Networks: Beyond the Kernel Trick.
*ArXiv:1606.05316 [Cs]*. - GrRK00
- Gray, S., Radford, A., & Kingma, D. P.(n.d.) GPU Kernels for Block-Sparse Weights. , 12.
- HaDL16
- Ha, D., Dai, A., & Le, Q. V.(2016) HyperNetworks.
*ArXiv:1609.09106 [Cs]*. - HaMD15
- Han, S., Mao, H., & Dally, W. J.(2015) Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding.
*ArXiv:1510.00149 [Cs]*. - HaRS15
- Hardt, M., Recht, B., & Singer, Y. (2015) Train faster, generalize better: Stability of stochastic gradient descent.
*ArXiv:1509.01240 [Cs, Math, Stat]*. - HZCK17
- Howard, A. G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., … Adam, H. (2017) MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications.
*ArXiv:1704.04861 [Cs]*. - IHMA16
- Iandola, F. N., Han, S., Moskewicz, M. W., Ashraf, K., Dally, W. J., & Keutzer, K. (2016) SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <05MB model size.
*ArXiv:1602.07360 [Cs]*. - LGMR17
- Lee, H., Ge, R., Ma, T., Risteski, A., & Arora, S. (2017) On the ability of neural nets to express distributions. In arXiv:1702.07028 [cs].
- LCMB15
- Lin, Z., Courbariaux, M., Memisevic, R., & Bengio, Y. (2015) Neural Networks with Few Multiplications.
*ArXiv:1510.03009 [Cs]*. - LoCV17
- Lobacheva, E., Chirkova, N., & Vetrov, D. (2017) Bayesian Sparsification of Recurrent Neural Networks. In Workshop on Learning to Generate Natural Language.
- LoWK17
- Louizos, C., Welling, M., & Kingma, D. P.(2017) Learning Sparse Neural Networks through $L_0$ Regularization.
*ArXiv:1712.01312 [Cs, Stat]*. - MoAV17
- Molchanov, D., Ashukha, A., & Vetrov, D. (2017) Variational Dropout Sparsifies Deep Neural Networks. In Proceedings of ICML.
- NaUD17
- Narang, S., Undersander, E., & Diamos, G. (2017) Block-Sparse Recurrent Neural Networks.
*ArXiv:1711.02782 [Cs, Stat]*. - PaDG16
- Pan, W., Dong, H., & Guo, Y. (2016) DropNeuron: Simplifying the Structure of Deep Neural Networks.
*ArXiv:1606.07326 [Cs, Stat]*. - RBKC14
- Romero, A., Ballas, N., Kahou, S. E., Chassang, A., Gatta, C., & Bengio, Y. (2014) FitNets: Hints for Thin Deep Nets.
*ArXiv:1412.6550 [Cs]*. - SCHU16
- Scardapane, S., Comminiello, D., Hussain, A., & Uncini, A. (2016) Group Sparse Regularization for Deep Neural Networks.
*ArXiv:1607.00485 [Cs, Stat]*. - ShFZ16
- Shi, L., Feng, S., & ZhifanZhu. (2016) Functional Hashing for Compressing Neural Networks.
*ArXiv:1605.06560 [Cs]*. - SrBa16
- Srinivas, S., & Babu, R. V.(2016) Generalized Dropout.
*ArXiv:1611.06791 [Cs]*. - StGa15
- Steeg, G. V., & Galstyan, A. (2015) The Information Sieve.
*ArXiv:1507.02284 [Cs, Math, Stat]*. - UlMW17
- Ullrich, K., Meeds, E., & Welling, M. (2017) Soft Weight-Sharing for Neural Network Compression.
*ArXiv Preprint ArXiv:1702.04008*. - UGKA16
- Urban, G., Geras, K. J., Kahou, S. E., Aslan, O., Wang, S., Caruana, R., … Richardson, M. (2016) Do Deep Convolutional Nets Really Need to be Deep (Or Even Convolutional)?.
*ArXiv:1603.05691 [Cs, Stat]*. - WXYT16
- Wang, Y., Xu, C., You, S., Tao, D., & Xu, C. (2016) CNNpack: Packing Convolutional Neural Networks in the Frequency Domain. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, & R. Garnett (Eds.), Advances in Neural Information Processing Systems 29 (pp. 253–261). Curran Associates, Inc.
- WCLH16
- Wang, Z., Chang, S., Ling, Q., Huang, S., Hu, X., Shi, H., & Huang, T. S.(2016) Stacked Approximated Regression Machine: A Simple Deep Learning Approach. . Presented at the NIPS
- Zhao17
- Zhao, L. (2017) Fast Algorithms on Random Matrices and Structured Matrices.