Compressing neural nets

pruning, compacting and otherwise fitting a good estimate into fewer parameters

October 14, 2016 — May 7, 2021

edge computing

machine learning

model selection

neural nets

sparser than thou

How to make neural nets smaller while still preserving their performance. This is a subtle problem, As we suspect that part of their special sauce is precisely that they are overparameterized which is to say, one reason they work is precisely that they are bigger than they “need” to be. The problem of finding the network that is smaller than the bigger one that it seems to need to be is tricky. My instinct is to use some sparse regularisation but this does not carry over to the deep network setting, at least naïvely.

1 Pruning

Train a big network, then deleting neurons and see if it still works. See Jacob Gildenblat, Pruning deep neural networks to make them fast and small, or Why reducing the costs of training neural networks remains a challenge

2 Lottery tickets

Kim Martineau’s summary of the state of the art in “Lottery ticket” (Frankle and Carbin 2019) pruning strategies is fun; See also You et al. (2019) for an elaboration. The idea is that we can try to “prune early” and never both fitting the big network as in classic pruning.

3 Regularising away neurons

Figure 2: Salome prepares trim the neural net.

Seems like it should be easy to apply something like LASSO in the NN setting to deep neural nets to trim away irrelevant features. Aren’t they just stacked layers of regressions, after all? and it works so well in linear regressions. But in deep nets it is not generally obvious how to shrink away whole neurons.

I am curious if Lemhadri et al. (2021) does the job:

Much work has been done recently to make neural networks more interpretable, and one approach is to arrange for the network to use only a subset of the available features. In linear models, Lasso (or ℓ1-regularized) regression assigns zero weights to the most irrelevant or redundant features, and is widely used in data science. However the Lasso only applies to linear models. Here we introduce LassoNet, a neural network framework with global feature selection. Our approach achieves feature sparsity by adding a skip (residual) layer and allowing a feature to participate in any hidden layer only if its skip-layer representative is active. Unlike other approaches to feature selection for neural nets, our method uses a modified objective function with constraints, and so integrates feature selection with the parameter learning directly. As a result, it delivers an entire regularization path of solutions with a range of feature sparsity. We apply LassoNet to a number of real-data problems and find that it significantly outperforms state-of-the-art methods for feature selection and regression. LassoNet uses projected proximal gradient descent, and generalizes directly to deep networks. It can be implemented by adding just a few lines of code to a standard neural network.

4 Edge ML

A.k.a. Tiny ML, Mobile ML etc. A major consumer of compressing neural nets, since small devices cannot fit large nerual nets. See Edge ML

5 Incoming

DeepSpeed Compression: A composable library for extreme compression and zero-cost quantization

6 Computational cost of

TBD.

7 References

Aghasi, Nguyen, and Romberg. 2016. “Net-Trim: A Layer-Wise Convex Pruning of Deep Neural Networks.” arXiv:1611.05162 [Cs, Stat].

Bardes, Ponce, and LeCun. 2022. “VICReg: Variance-Invariance-Covariance Regularization for Self-Supervised Learning.”

Blalock, Ortiz, Frankle, et al. 2020. “What Is the State of Neural Network Pruning?” arXiv:2003.03033 [Cs, Stat].

Bölcskei, Grohs, Kutyniok, et al. 2019. “Optimal Approximation with Sparsely Connected Deep Neural Networks.” SIAM Journal on Mathematics of Data Science.

Borgerding, and Schniter. 2016. “Onsager-Corrected Deep Networks for Sparse Linear Inverse Problems.” arXiv:1612.01183 [Cs, Math].

Cai, Gan, Wang, et al. 2020. “Once-for-All: Train One Network and Specialize It for Efficient Deployment.” In.

Chen, Tianqi, Goodfellow, and Shlens. 2015. “Net2Net: Accelerating Learning via Knowledge Transfer.” arXiv:1511.05641 [Cs].

Cheng, Wang, Zhou, et al. 2017. “A Survey of Model Compression and Acceleration for Deep Neural Networks.” arXiv:1710.09282 [Cs].

Chen, Wenlin, Wilson, Tyree, et al. 2015. “Compressing Convolutional Neural Networks.” arXiv:1506.04449 [Cs].

Cutajar, Bonilla, Michiardi, et al. 2017. “Random Feature Expansions for Deep Gaussian Processes.” In PMLR.

Daniely. 2017. “Depth Separation for Neural Networks.” arXiv:1702.08489 [Cs, Stat].

DeVore, Hanin, and Petrova. 2021. “Neural Network Approximation.” Acta Numerica.

Elbrächter, Perekrestenko, Grohs, et al. 2021. “Deep Neural Network Approximation Theory.” IEEE Transactions on Information Theory.

Frankle, and Carbin. 2019. “The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks.” arXiv:1803.03635 [Cs].

Garg, Rish, Cecchi, et al. 2017. “Neurogenesis-Inspired Dictionary Learning: Online Model Adaption in a Changing World.” In arXiv:1701.06106 [Cs, Stat].

Ghosh. 2017. “QuickNet: Maximizing Efficiency and Efficacy in Deep Architectures.” arXiv:1701.02291 [Cs, Stat].

Globerson, and Livni. 2016. “Learning Infinite-Layer Networks: Beyond the Kernel Trick.” arXiv:1606.05316 [Cs].

Gray, Radford, and Kingma. n.d. “GPU Kernels for Block-Sparse Weights.”

Gu, Johnson, Goel, et al. 2021. “Combining Recurrent, Convolutional, and Continuous-Time Models with Linear State Space Layers.” In Advances in Neural Information Processing Systems.

Ha, Dai, and Le. 2016. “HyperNetworks.” arXiv:1609.09106 [Cs].

Hardt, Recht, and Singer. 2015. “Train Faster, Generalize Better: Stability of Stochastic Gradient Descent.” arXiv:1509.01240 [Cs, Math, Stat].

Hayou, Ton, Doucet, et al. 2020. “Pruning Untrained Neural Networks: Principles and Analysis.” arXiv:2002.08797 [Cs, Stat].

Hazimeh, Ponomareva, Mol, et al. 2020. “The Tree Ensemble Layer: Differentiability Meets Conditional Computation.”

He, Lin, Liu, et al. 2019. “AMC: AutoML for Model Compression and Acceleration on Mobile Devices.” arXiv:1802.03494 [Cs].

Howard, Zhu, Chen, et al. 2017. “MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications.” arXiv:1704.04861 [Cs].

Iandola, Han, Moskewicz, et al. 2016. “SqueezeNet: AlexNet-Level Accuracy with 50x Fewer Parameters and <0.5MB Model Size.” arXiv:1602.07360 [Cs].

Ke, and Fan. 2022. “On the Optimization and Pruning for Bayesian Deep Learning.”

LeCun, Denker, and Solla. 1990. “Optimal Brain Damage.” In Advances in Neural Information Processing Systems.

Lee, Ge, Ma, et al. 2017. “On the Ability of Neural Nets to Express Distributions.” In arXiv:1702.07028 [Cs].

Lemhadri, Ruan, Abraham, et al. 2021. “LassoNet: A Neural Network with Feature Sparsity.” Journal of Machine Learning Research.

Liebenwein, Baykal, Carter, et al. 2021. “Lost in Pruning: The Effects of Pruning Neural Networks Beyond Test Accuracy.” arXiv:2103.03014 [Cs].

Lobacheva, Chirkova, and Vetrov. 2017. “Bayesian Sparsification of Recurrent Neural Networks.” In Workshop on Learning to Generate Natural Language.

Louizos, Welling, and Kingma. 2017. “Learning Sparse Neural Networks Through \(L_0\) Regularization.” arXiv:1712.01312 [Cs, Stat].

Mariet. 2016. “Learning and enforcing diversity with Determinantal Point Processes.”

Molchanov, Ashukha, and Vetrov. 2017. “Variational Dropout Sparsifies Deep Neural Networks.” In Proceedings of ICML.

Narang, Undersander, and Diamos. 2017. “Block-Sparse Recurrent Neural Networks.” arXiv:1711.02782 [Cs, Stat].

Pan, Dong, and Guo. 2016. “DropNeuron: Simplifying the Structure of Deep Neural Networks.” arXiv:1606.07326 [Cs, Stat].

“Pruning by Explaining: A Novel Criterion for Deep Neural Network Pruning.” 2021. Pattern Recognition.

Renda, Frankle, and Carbin. 2020. “Comparing Rewinding and Fine-Tuning in Neural Network Pruning.” arXiv:2003.02389 [Cs, Stat].

Scardapane, Comminiello, Hussain, et al. 2016. “Group Sparse Regularization for Deep Neural Networks.” arXiv:1607.00485 [Cs, Stat].

Shi, Feng, and ZhifanZhu. 2016. “Functional Hashing for Compressing Neural Networks.” arXiv:1605.06560 [Cs].

Shwartz-Ziv, and LeCun. 2023. “To Compress or Not to Compress- Self-Supervised Learning and Information Theory: A Review.”

Srinivas, and Babu. 2016. “Generalized Dropout.” arXiv:1611.06791 [Cs].

Ullrich, Meeds, and Welling. 2017. “Soft Weight-Sharing for Neural Network Compression.” arXiv Preprint arXiv:1702.04008.

Urban, Geras, Kahou, et al. 2016. “Do Deep Convolutional Nets Really Need to Be Deep (Or Even Convolutional)?” arXiv:1603.05691 [Cs, Stat].

van Gelder, Wortsman, and Ehsani. 2020. “Deconstructing the Structure of Sparse Neural Networks.” In.

Venturi, and Li. 2022. “The Mori-Zwanzig Formulation of Deep Learning.”

ver Steeg, and Galstyan. 2015. “The Information Sieve.” arXiv:1507.02284 [Cs, Math, Stat].

Wang, Zhangyang, Chang, Ling, et al. 2016. “Stacked Approximated Regression Machine: A Simple Deep Learning Approach.” In.

Wang, Yunhe, Xu, Xu, et al. 2019. “Packing Convolutional Neural Networks in the Frequency Domain.” IEEE transactions on pattern analysis and machine intelligence.

Wang, Yunhe, Xu, You, et al. 2016. “CNNpack: Packing Convolutional Neural Networks in the Frequency Domain.” In Advances in Neural Information Processing Systems 29.

Warden, and Situnayake. 2020. TinyML: Machine Learning with TensorFlow Lite on Arduino and Ultra-Low-Power Microcontrollers.

Yarotsky, and Zhevnerchuk. 2020. “The Phase Diagram of Approximation Rates for Deep Neural Networks.” In Proceedings of the 34th International Conference on Neural Information Processing Systems. NIPS’20.

You, Li, Xu, et al. 2019. “Drawing Early-Bird Tickets: Toward More Efficient Training of Deep Networks.” In.

Zhao. 2017. “Fast Algorithms on Random Matrices and Structured Matrices.”