The Living Thing / Notebooks : Model selection and learning theory for densities

Model selection/ Learning theory meets density estimation, especially for mixture densities.

There are various convergence results, depending on your assumptions on mixing distributions and component distributions and true distributions and the stars and so on.

Everyone has assumptions on fixed component scales, or requires a free regularisation parameter. Especially I would like sample complexity results that can allow me to bound number of components before seeing the data. AFAICT there is no-one who can give me a good, general model selection procedure that does not involve cross-validation in practice.

Please, prove me wrong.

TODO: explain why this is fiddly, choice of loss function etc.

Large sample results for mixtures

I would prefere not to rely on asymptotic distributions, but maybe the use of the AIC would be smooth enough for me to just deal with it.

McRa14 claims the AIC doesn’t work because ‘certain regularity assumptions’ necessary for it are not satisfied (which?) but that I may try the BIC, as per BaCo91 or BHLL08. Contrariwise, Mass07 gives explicit cases where the AIC does work in penalised density estimation, so… yeah. In any case, it’s certainly true that the derivation and calculation of the GIC is tedious.

GhVa01 give MLE risk bounds under various assumptions, which I could maybe use independently.

Finite sample results for mixtures

In a penalised least-squares framework, see BePR11, Geer08, BuTW07a. This generally seems much more reasonable to me, and I’m reading up on it. However, I think that basis selection is going to turn out to be nasty. Massart et al (MaMi11) have some results based on Massart’s lecture notes (Mass07), which, by the way, are really accessible as an introduction.

Massart (Mass07) explains some of the complications here and gives example applicattions of both least-squares and ML model selection procedures for general density model selection. However, his results are restricted to mixtures of orthonormal bases, which, realistically, means histograms with bins of disjoint support, if you want your density to be non-negative. Wondering how much work it is to get overlapping support.

Refs

BaBM99
Barron, A., Birgé, L., & Massart, P. (1999) Risk bounds for model selection via penalization. Probability Theory and Related Fields, 113(3), 301–413.
Barr94
Barron, A. R.(1994) Approximation and Estimation Bounds for Artificial Neural Networks. Mach. Learn., 14(1), 115–133. DOI.
BaCo91
Barron, A. R., & Cover, T. M.(1991) Minimum complexity density estimation. IEEE Transactions on Information Theory, 37(4), 1034–1054. DOI.
BHLL08
Barron, A. R., Huang, C., Li, J. Q., & Luo, X. (2008) MDL, penalized likelihood, and statistical risk. In Information Theory Workshop, 2008. ITW’08. IEEE (pp. 247–257). IEEE DOI.
BaRY98
Barron, A., Rissanen, J., & Yu, B. (1998) The minimum description length principle in coding and modeling. IEEE Transactions on Information Theory, 44(6), 2743–2760. DOI.
BePR11
Bertin, K., Pennec, E. L., & Rivoirard, V. (2011) Adaptive Dantzig density estimation. Annales de l’Institut Henri Poincaré, Probabilités et Statistiques, 47(1), 43–74. DOI.
BiRo06
Birgé, L., & Rozenholc, Y. (2006) How many bins should be put in a regular histogram. ESAIM: Probability and Statistics, 10, 24–45. DOI.
Bish91
Bishop, C. (1991) Improving the Generalization Properties of Radial Basis Function Neural Networks. Neural Computation, 3(4), 579–588. DOI.
BHBR16
Boyd, N., Hastie, T., Boyd, S., Recht, B., & Jordan, M. (2016) Saturating Splines and Feature Selection. arXiv:1609.06764 [Stat].
BuTW07a
Bunea, F., Tsybakov, A. B., & Wegkamp, M. H.(2007a) Sparse Density Estimation with ℓ1 Penalties. In N. H. Bshouty & C. Gentile (Eds.), Learning Theory (pp. 530–543). Springer Berlin Heidelberg DOI.
BuTW07b
Bunea, F., Tsybakov, A., & Wegkamp, M. (2007b) Sparsity oracle inequalities for the Lasso. Electronic Journal of Statistics, 1, 169–194. DOI.
GeWa00
Genovese, C. R., & Wasserman, L. (2000) Rates of convergence for the Gaussian mixture sieve. Annals of Statistics, 1105–1127.
GhVa01
Ghosal, S., & van der Vaart, A. W.(2001) Entropies and rates of convergence for maximum likelihood and Bayes estimation for mixtures of normal densities. The Annals of Statistics, 29(5), 1233–1263. DOI.
HuCB08
Huang, C., Cheang, G. L. H., & Barron, A. R.(2008) Risk of penalized least squares, greedy selection and l1 penalization for flexible function libraries.
Mass00
Massart, P. (2000) Some applications of concentration inequalities to statistics. In Annales de la Faculté des sciences de Toulouse: Mathématiques (Vol. 9, pp. 245–303).
Mass07
Massart, P. (2007) Concentration inequalities and model selection: Ecole d’Eté de Probabilités de Saint-Flour XXXIII - 2003. . Berlin ; New York: Springer-Verlag
MaMi11
Maugis, C., & Michel, B. (2011) A non asymptotic penalized criterion for Gaussian mixture model selection. ESAIM: Probability and Statistics, 15, 41–68. DOI.
McRa14
McLachlan, G. J., & Rathnayake, S. (2014) On the number of components in a Gaussian mixture model. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 4(5), 341–355. DOI.
Orr96
Orr, M. J.(1996) Introduction to radial basis function networks. . Technical Report, Center for Cognitive Science, University of Edinburgh
QuBe16
Que, Q., & Belkin, M. (2016) Back to the future: Radial Basis Function networks revisited. In Appearing in Proceedings of the 19th International Conference on Artificial Intelligence and Statistics (AISTATS) 2016.
RaPM05
Rakhlin, A., Panchenko, D., & Mukherjee, S. (2005) Risk bounds for mixture density estimation. ESAIM: Probability and Statistics, 9, 220–229. DOI.
Reyn03
Reynaud-Bouret, P. (2003) Adaptive estimation of the intensity of inhomogeneous Poisson processes via concentration inequalities. Probability Theory and Related Fields, 126(1). DOI.
ReSc10
Reynaud-Bouret, P., & Schbath, S. (2010) Adaptive estimation for Hawkes processes; application to genome analysis. The Annals of Statistics, 38(5), 2781–2822. DOI.
ShSh10
Shimazaki, H., & Shinomoto, S. (2010) Kernel bandwidth optimization in spike rate estimation. Journal of Computational Neuroscience, 29(1–2), 171–182. DOI.
Geer96
van de Geer, S. (1996) Rates of convergence for the maximum likelihood estimator in mixture models. Journal of Nonparametric Statistics, 6(4), 293–310. DOI.
Geer97
van de Geer, S. (1997) Asymptotic normality in mixture models. ESAIM: Probability and Statistics, 1, 17–33.
Geer03
van de Geer, S. (2003) Asymptotic theory for maximum likelihood in nonparametric mixture models. Computational Statistics & Data Analysis, 41(3–4), 453–464. DOI.
Geer08
van de Geer, S. A.(2008) High-dimensional generalized linear models and the lasso. The Annals of Statistics, 36(2), 614–645. DOI.
ZeMM98
Zeevi, A. J., Meir, R., & Maiorov, V. (1998) Error bounds for functional approximation and estimation using mixtures of experts. IEEE Transactions on Information Theory, 44(3), 1010–1025. DOI.