The Living Thing / Notebooks : Model selection and learning theory for densities

Model selection/ Learning theory meets density estimation, especially for mixture densities.

There are various convergence results, depending on your assumptions on mixing distributions and component distributions and true distributions and the stars and so on.

Everyone has assumptions on fixed component scales, or requires a free regularisation parameter. Especially I would like sample complexity results that can allow me to bound number of components before seeing the data. AFAICT there is no-one who can give me a good, general model selection procedure that does not involve cross-validation in practice.

Please, prove me wrong.

TODO: explain why this is fiddly, choice of loss function etc.

Large sample results for mixtures

I would prefere not to rely on asymptotic distributions, but maybe the use of the AIC would be smooth enough for me to just deal with it.

McRa14 claims the AIC doesn’t work because ‘certain regularity assumptions’ necessary for it are not satisfied (which?) but that I may try the BIC, as per BaCo91 or BHLL08. Contrariwise, Mass07 gives explicit cases where the AIC does work in penalised density estimation, so… yeah. In any case, it’s certainly true that the derivation and calculation of the GIC is tedious.

GhVa01 give MLE risk bounds under various assumptions, which I could maybe use independently.

Finite sample results for mixtures

In a penalised least-squares framework, see BePR11, Geer08, BuTW07a. This generally seems much more reasonable to me, and I’m reading up on it. However, I think that basis selection is going to turn out to be nasty. Massart et al (MaMi11) have some results based on Massart’s lecture notes (Mass07), which, by the way, are really accessible as an introduction.

Massart (Mass07) explains some of the complications here and gives example applicattions of both least-squares and ML model selection procedures for general density model selection. However, his results are restricted to mixtures of orthonormal bases, which, realistically, means histograms with bins of disjoint support, if you want your density to be non-negative. Wondering how much work it is to get overlapping support.


Barron, A., Birgé, L., & Massart, P. (1999) Risk bounds for model selection via penalization. Probability Theory and Related Fields, 113(3), 301–413.
Barron, A. R.(1994) Approximation and Estimation Bounds for Artificial Neural Networks. Mach. Learn., 14(1), 115–133. DOI.
Barron, A. R., & Cover, T. M.(1991) Minimum complexity density estimation. IEEE Transactions on Information Theory, 37(4), 1034–1054. DOI.
Barron, A. R., Huang, C., Li, J. Q., & Luo, X. (2008) MDL, penalized likelihood, and statistical risk. In Information Theory Workshop, 2008. ITW’08. IEEE (pp. 247–257). IEEE DOI.
Barron, A., Rissanen, J., & Yu, B. (1998) The minimum description length principle in coding and modeling. IEEE Transactions on Information Theory, 44(6), 2743–2760. DOI.
Bertin, K., Pennec, E. L., & Rivoirard, V. (2011) Adaptive Dantzig density estimation. Annales de l’Institut Henri Poincaré, Probabilités et Statistiques, 47(1), 43–74. DOI.
Birgé, L., & Rozenholc, Y. (2006) How many bins should be put in a regular histogram. ESAIM: Probability and Statistics, 10, 24–45. DOI.
Bishop, C. (1991) Improving the Generalization Properties of Radial Basis Function Neural Networks. Neural Computation, 3(4), 579–588. DOI.
Boyd, N., Hastie, T., Boyd, S., Recht, B., & Jordan, M. (2016) Saturating Splines and Feature Selection. arXiv:1609.06764 [Stat].
Bunea, F., Tsybakov, A. B., & Wegkamp, M. H.(2007a) Sparse Density Estimation with ℓ1 Penalties. In N. H. Bshouty & C. Gentile (Eds.), Learning Theory (pp. 530–543). Springer Berlin Heidelberg DOI.
Bunea, F., Tsybakov, A., & Wegkamp, M. (2007b) Sparsity oracle inequalities for the Lasso. Electronic Journal of Statistics, 1, 169–194. DOI.
Genovese, C. R., & Wasserman, L. (2000) Rates of convergence for the Gaussian mixture sieve. Annals of Statistics, 1105–1127.
Ghosal, S., & van der Vaart, A. W.(2001) Entropies and rates of convergence for maximum likelihood and Bayes estimation for mixtures of normal densities. The Annals of Statistics, 29(5), 1233–1263. DOI.
Huang, C., Cheang, G. L. H., & Barron, A. R.(2008) Risk of penalized least squares, greedy selection and l1 penalization for flexible function libraries.
Massart, P. (2000) Some applications of concentration inequalities to statistics. In Annales de la Faculté des sciences de Toulouse: Mathématiques (Vol. 9, pp. 245–303).
Massart, P. (2007) Concentration inequalities and model selection: Ecole d’Eté de Probabilités de Saint-Flour XXXIII - 2003. . Berlin ; New York: Springer-Verlag
Maugis, C., & Michel, B. (2011) A non asymptotic penalized criterion for Gaussian mixture model selection. ESAIM: Probability and Statistics, 15, 41–68. DOI.
McLachlan, G. J., & Rathnayake, S. (2014) On the number of components in a Gaussian mixture model. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 4(5), 341–355. DOI.
Orr, M. J.(1996) Introduction to radial basis function networks. . Technical Report, Center for Cognitive Science, University of Edinburgh
Que, Q., & Belkin, M. (2016) Back to the future: Radial Basis Function networks revisited. In Appearing in Proceedings of the 19th International Conference on Artificial Intelligence and Statistics (AISTATS) 2016.
Rakhlin, A., Panchenko, D., & Mukherjee, S. (2005) Risk bounds for mixture density estimation. ESAIM: Probability and Statistics, 9, 220–229. DOI.
Reynaud-Bouret, P. (2003) Adaptive estimation of the intensity of inhomogeneous Poisson processes via concentration inequalities. Probability Theory and Related Fields, 126(1). DOI.
Reynaud-Bouret, P., & Schbath, S. (2010) Adaptive estimation for Hawkes processes; application to genome analysis. The Annals of Statistics, 38(5), 2781–2822. DOI.
Shimazaki, H., & Shinomoto, S. (2010) Kernel bandwidth optimization in spike rate estimation. Journal of Computational Neuroscience, 29(1–2), 171–182. DOI.
van de Geer, S. (1996) Rates of convergence for the maximum likelihood estimator in mixture models. Journal of Nonparametric Statistics, 6(4), 293–310. DOI.
van de Geer, S. (1997) Asymptotic normality in mixture models. ESAIM: Probability and Statistics, 1, 17–33.
van de Geer, S. (2003) Asymptotic theory for maximum likelihood in nonparametric mixture models. Computational Statistics & Data Analysis, 41(3–4), 453–464. DOI.
van de Geer, S. A.(2008) High-dimensional generalized linear models and the lasso. The Annals of Statistics, 36(2), 614–645. DOI.
Zeevi, A. J., Meir, R., & Maiorov, V. (1998) Error bounds for functional approximation and estimation using mixtures of experts. IEEE Transactions on Information Theory, 44(3), 1010–1025. DOI.