Model selection and learning theory for densities

Model selection/ learning theory meets density estimation, especially for mixture densities.

There are various convergence results, depending on your assumptions on mixing distributions and component distributions and true distributions and the stars and so on.

Everyone has assumptions on fixed component scales, or requires a free regularisation parameter. Especially I would like sample complexity results that can allow me to bound number of components before seeing the data. AFAICT there is no-one who can give me a good, general model selection procedure that does not involve cross-validation in practice.

TODO: explain why this is fiddly, choice of loss function etc.

Large sample results for mixtures

I would prefere not to rely on asymptotic distributions, but maybe the use of the AIC would be smooth enough for me to just deal with it.

McRa14 claims the AIC doesn't work because ‘certain regularity assumptions’ necessary for it are not satisfied (which?) but that I may try the BIC, as per BaCo91 or BHLL08. Contrariwise, Mass07 gives explicit cases where the AIC does work in penalised density estimation, so… yeah. In any case, it's certainly true that the derivation and calculation of the GIC is tedious.

GhVa01 give MLE risk bounds under various assumptions, which I could maybe use independently.

Finite sample results for mixtures

In a penalised least-squares framework, see BePR11, Geer08, BuTW07a. This generally seems much more reasonable to me, and I'm reading up on it. However, I think that basis selection is going to turn out to be nasty. Massart et al (MaMi11) have some results based on Massart's lecture notes (Mass07), which, by the way, are really accessible as an introduction.

Massart (Mass07) explains some of the complications here and gives example applicattions of both least-squares and ML model selection procedures for general density model selection. However, his results are restricted to mixtures of orthonormal bases, which, realistically, means histograms with bins of disjoint support, if you want your density to be non-negative. Wondering how much work it is to get overlapping support.