Model selection/ learning theory meets density estimation, especially for mixture densities.

There are various convergence results, depending on your assumptions on mixing distributions and component distributions and true distributions and the stars and so on.

Everyone has assumptions on fixed component scales, or requires a free regularisation parameter. Especially I would like sample complexity results that can allow me to bound number of components before seeing the data. AFAICT there is no-one who can give me a good, general model selection procedure that does not involve cross-validation in practice.

Please, prove me wrong.

TODO: explain why this is fiddly, choice of loss function etc.

## Large sample results for mixtures

I would prefere not to rely on asymptotic distributions, but maybe the use of the AIC would be smooth enough for me to just deal with it.

McRa14 claims the AIC doesn't work because ‘certain regularity assumptions’ necessary for it are not satisfied (which?) but that I may try the BIC, as per BaCo91 or BHLL08. Contrariwise, Mass07 gives explicit cases where the AIC does work in penalised density estimation, so… yeah. In any case, it's certainly true that the derivation and calculation of the GIC is tedious.

GhVa01 give MLE risk bounds under various assumptions, which I could maybe use independently.

## Finite sample results for mixtures

In a penalised least-squares framework, see BePR11, Geer08, BuTW07a. This generally seems much more reasonable to me, and I'm reading up on it. However, I think that basis selection is going to turn out to be nasty. Massart et al (MaMi11) have some results based on Massart's lecture notes (Mass07), which, by the way, are really accessible as an introduction.

Massart (Mass07) explains some of the complications here and gives example applicattions of both least-squares and ML model selection procedures for general density model selection. However, his results are restricted to mixtures of orthonormal bases, which, realistically, means histograms with bins of disjoint support, if you want your density to be non-negative. Wondering how much work it is to get overlapping support.

## Refs

- Wahb85: Grace Wahba (1985) A Comparison of GCV and GML for Choosing the Smoothing Parameter in the Generalized Spline Smoothing Problem.
*The Annals of Statistics*, 13(4), 1378–1402. DOI - ThCl13: Kukatharmini Tharmaratnam, Gerda Claeskens (2013) A comparison of robust versions of the AIC based on M-, S- and MM-estimators.
*Statistics*, 47(1), 216–235. DOI - BuNo95: P. Burman, D. Nolan (1995) A general Akaike-type criterion for model selection in robust regression.
*Biometrika*, 82(4), 877–886. DOI - BeGa09: Yoav Benjamini, Yulia Gavrilov (2009) A simple forward selection procedure based on false discovery rate control.
*The Annals of Applied Statistics*, 3(1), 179–198. DOI - RaWu89: Radhakrishna Rao, Yuehua Wu (1989) A strongly consistent procedure for model selection in a regression problem.
*Biometrika*, 76(2), 369–374. DOI - KlRB10: Marius Kloft, Ulrich Rückert, Peter L. Bartlett (2010) A Unifying View of Multiple Kernel Learning. In Machine Learning and Knowledge Discovery in Databases (pp. 66–81). Springer Berlin Heidelberg DOI
- ShYe02: Xiaotong Shen, Jianming Ye (2002) Adaptive Model Selection.
*Journal of the American Statistical Association*, 97(457), 210–221. DOI - ShHY04: Xiaotong Shen, Hsin-Cheng Huang, Jimmy Ye (2004) Adaptive Model Selection and Assessment for Exponential Family Distributions.
*Technometrics*, 46(3), 306–317. DOI - Ston77: M. Stone (1977) An Asymptotic Equivalence of Choice of Model by Cross-Validation and Akaike’s Criterion.
*Journal of the Royal Statistical Society. Series B (Methodological)*, 39(1), 44–47. - GuEl03: Isabelle Guyon, André Elisseeff (2003) An Introduction to Variable and Feature Selection.
*Journal of Machine Learning Research*, 3(Mar), 1157–1182. - Li87: Ker-Chau Li (1987) Asymptotic Optimality for , Cross-Validation and Generalized Cross-Validation: Discrete Index Set.
*The Annals of Statistics*, 15(3), 958–975. DOI - Andr91: Donald W. K. Andrews (1991) Asymptotic optimality of generalized CL, cross-validation, and generalized cross-validation in regression with heteroskedastic errors.
*Journal of Econometrics*, 47(2), 359–377. DOI - Broe06: Petrus MT Broersen (2006)
*Automatic autocorrelation and spectral analysis*. Secaucus, NJ, USA: Springer Science & Business Media - BüKü99: Peter Bühlmann, Hans R Künsch (1999) Block length selection in the bootstrap for time series.
*Computational Statistics & Data Analysis*, 31(3), 295–310. DOI - Shao96: Jun Shao (1996) Bootstrap Model Selection.
*Journal of the American Statistical Association*, 91(434), 655–665. DOI - PaSa14: Efstathios Paparoditis, Theofanis Sapatinas (2014) Bootstrap-based testing for functional data.
*ArXiv:1409.4317 [Math, Stat]*. - Mass07: Pascal Massart (2007)
*Concentration inequalities and model selection: Ecole d’Eté de Probabilités de Saint-Flour XXXIII - 2003*. Berlin ; New York: Springer-Verlag - BaCa15: Rina Foygel Barber, Emmanuel J. Candès (2015) Controlling the false discovery rate via knockoffs.
*The Annals of Statistics*, 43(5), 2055–2085. DOI - TiTa12: Ryan J. Tibshirani, Jonathan Taylor (2012) Degrees of freedom in lasso problems.
*The Annals of Statistics*, 40(2), 1198–1232. DOI - Take76: Kei Takeuchi (1976) Distribution of informational statistics and a criterion of model fitting.
*Suri-Kagaku (Mathematical Sciences)*, 153(1), 12–18. - JaFH15: Lucas Janson, William Fithian, Trevor J. Hastie (2015) Effective degrees of freedom: a flawed metaphor.
*Biometrika*, 102(2), 479–485. DOI - LJDR16: Lisha Li, Kevin Jamieson, Giulia DeSalvo, Afshin Rostamizadeh, Ameet Talwalkar (2016) Efficient Hyperparameter Optimization and Infinitely Many Armed Bandits.
*ArXiv:1603.06560 [Cs, Stat]*. - CaWB08: Emmanuel J. Candès, Michael B. Wakin, Stephen P. Boyd (2008) Enhancing Sparsity by Reweighted ℓ 1 Minimization.
*Journal of Fourier Analysis and Applications*, 14(5–6), 877–905. DOI - AnKo85: Craig F. Ansley, Robert Kohn (1985) Estimation, Filtering, and Smoothing in State Space Models with Incompletely Specified Initial Conditions.
*The Annals of Statistics*, 13(4), 1286–1316. DOI - Stei81: Charles M. Stein (1981) Estimation of the Mean of a Multivariate Normal Distribution.
*The Annals of Statistics*, 9(6), 1135–1151. DOI - TLTT14: Jonathan Taylor, Richard Lockhart, Ryan J. Tibshirani, Robert Tibshirani (2014) Exact Post-selection Inference for Forward Stepwise and Least Angle Regression.
*ArXiv:1401.3889 [Stat]*. - ChLY16: Ngai Hang Chan, Ye Lu, Chun Yip Yau (2016) Factor Modelling for High-Dimensional Time Series: Inference and Model Selection.
*Journal of Time Series Analysis*, n/a-n/a. DOI - KoKi96: Sadanori Konishi, Genshiro Kitagawa (1996) Generalised information criteria in model selection.
*Biometrika*, 83(4), 875–890. DOI - ZhRY06: Peng Zhao, Guilherme Rocha, Bin Yu (2006) Grouped and hierarchical model selection through composite absolute penalties.
- Efro86: Bradley Efron (1986) How biased is the apparent error rate of a prediction rule?
*Journal of the American Statistical Association*, 81(394), 461–470. DOI - KoKi08: Sadanori Konishi, G. Kitagawa (2008)
*Information criteria and statistical modeling*. New York: Springer - CoBa17: D. R. Cox, H. S. Battey (2017) Large numbers of explanatory variables, a semi-descriptive analysis.
*Proceedings of the National Academy of Sciences*, 114(32), 8592–8595. DOI - CaSu17: T. Tony Cai, Wenguang Sun (2017) Large-Scale Global and Simultaneous Inference: Estimation and Testing in Very High Dimensions.
*Annual Review of Economics*, 9(1), 411–439. DOI - BLZS15: Adam Bloniarz, Hanzhong Liu, Cun-Hui Zhang, Jasjeet Sekhon, Bin Yu (2015) Lasso adjustments of treatment effect estimates in randomized experiments.
*ArXiv:1507.03652 [Math, Stat]*. - MeYu09: Nicolai Meinshausen, Bin Yu (2009) Lasso-type recovery of sparse representations for high-dimensional data.
*The Annals of Statistics*, 37(1), 246–270. DOI - BiMa06: Lucien Birgé, Pascal Massart (2006) Minimal Penalties for Gaussian Model Selection.
*Probability Theory and Related Fields*, 138(1–2), 33–73. DOI - Roya86: Richard M. Royall (1986) Model Robust Confidence Intervals Using Maximum Likelihood Estimators.
*International Statistical Review / Revue Internationale de Statistique*, 54(2), 221–226. DOI - ClHj08: Gerda Claeskens, Nils Lid Hjort (2008)
*Model selection and model averaging*. Cambridge ; New York: Cambridge University Press - BuAn02: Kenneth P. Burnham, David Raymond Anderson (2002)
*Model selection and multimodel inference: a practical information-theoretic approach*. New York: Springer - HMCH08: X. Hong, R. J. Mitchell, S. Chen, C. J. Harris, K. Li, G. W. Irwin (2008) Model selection approaches for non-linear system identification: a review.
*International Journal of Systems Science*, 39(10), 925–946. DOI - Birg08: Lucien Birgé (2008) Model selection for density estimation with L2-loss.
*ArXiv:0808.1416 [Math, Stat]*. - AlWi12: Pierre Alquier, Olivier Wintenberger (2012) Model selection for weakly dependent time series forecasting.
*Bernoulli*. - JoOm04: Jerald B. Johnson, Kristian S. Omland (2004) Model selection in ecology and evolution.
*Trends in Ecology & Evolution*, 19(2), 101–108. DOI - AgNR16: Alireza Aghasi, Nam Nguyen, Justin Romberg (2016) Net-Trim: A Layer-wise Convex Pruning of Deep Neural Networks.
*ArXiv:1611.05162 [Cs, Stat]*. - GeHw82: Stuart Geman, Chii-Ruey Hwang (1982) Nonparametric Maximum Likelihood Estimation by the Method of Sieves.
*The Annals of Statistics*, 10(2), 401–414. DOI - JaTa15: Kevin Jamieson, Ameet Talwalkar (2015) Non-stochastic Best Arm Identification and Hyperparameter Optimization.
*ArXiv:1502.07943 [Cs, Stat]*. - RaWu01: C. R. Rao, Y. Wu (2001) On model selection. In Institute of Mathematical Statistics Lecture Notes - Monograph Series (Vol. 38, pp. 1–57). Beachwood, OH: Institute of Mathematical Statistics
- VaBC12: Stijn Vansteelandt, Maarten Bekaert, Gerda Claeskens (2012) On model selection and model misspecification in causal inference.
*Statistical Methods in Medical Research*, 21(1), 7–30. DOI - ZhYu06: Peng Zhao, Bin Yu (2006) On model selection consistency of Lasso.
*Journal of Machine Learning Research*, 7(Nov), 2541–2563. - Qian96: Guoqi Qian (1996) On model selection in robust linear regression
- QiKü98: Guoqi Qian, Hans R. Künsch (1998) On model selection via stochastic complexity in robust linear regression.
*Journal of Statistical Planning and Inference*, 75(1), 91–116. DOI - CaTa10: Gavin C. Cawley, Nicola L. C. Talbot (2010) On Over-fitting in Model Selection and Subsequent Selection Bias in Performance Evaluation.
*Journal of Machine Learning Research*, 11, 2079−2107. - ZoLi08: Hui Zou, Runze Li (2008) One-step sparse estimates in nonconcave penalized likelihood models.
*The Annals of Statistics*, 36(4), 1509–1533. DOI - ShHu06: Xiaotong Shen, Hsin-Cheng Huang (2006) Optimal Model Assessment, Selection, and Combination.
*Journal of the American Statistical Association*, 101(474), 554–568. DOI - CFJL16: Emmanuel J. Candès, Yingying Fan, Lucas Janson, Jinchi Lv (2016) Panning for Gold: Model-free Knockoffs for High-dimensional Controlled Variable Selection.
*ArXiv Preprint ArXiv:1610.02351*. - BLTG06: Peter J. Bickel, Bo Li, Alexandre B. Tsybakov, Sara A. van de Geer, Bin Yu, Teófilo Valdés, … Aad van der Vaart (2006) Regularization in statistics.
*Test*, 15(2), 271–344. DOI - Mach93: José A.F. Machado (1993) Robust Model Selection and M-Estimation.
*Econometric Theory*, 9(03), 478–493. DOI - Ronc00: E. Ronchetti (2000) Robust Regression Methods and Model Selection. In Data Segmentation and Model Selection for Computer Vision (pp. 31–40). Springer New York DOI
- QiHa96: Guoqi Qian, R. K. Hans (1996) Some notes on Rissanen’s stochastic complexity
- ElVi13: E. Elhamifar, R. Vidal (2013) Sparse Subspace Clustering: Algorithm, Theory, and Applications.
*IEEE Transactions on Pattern Analysis and Machine Intelligence*, 35(11), 2765–2781. DOI - Shib89: Ritei Shibata (1989) Statistical Aspects of Model Selection. In From Data to Model (pp. 215–240). Springer Berlin Heidelberg DOI
- FaLv08: Jianqing Fan, Jinchi Lv (2008) Sure independence screening for ultrahigh dimensional feature space.
*Journal of the Royal Statistical Society: Series B (Statistical Methodology)*, 70(5), 849–911. DOI - ZhRY09: Peng Zhao, Guilherme Rocha, Bin Yu (2009) The composite absolute penalties family for grouped and hierarchical variable selection.
*The Annals of Statistics*, 37(6A), 3468–3497. DOI - DaBa16: Ran Dai, Rina Foygel Barber (2016) The knockoff filter for FDR control in group-sparse and multitask regression.
*ArXiv Preprint ArXiv:1602.03589*. - TRTW15: Ryan J. Tibshirani, Alessandro Rinaldo, Robert Tibshirani, Larry Wasserman (2015) Uniform Asymptotic Inference and the Bootstrap After Model Selection.
*ArXiv:1506.06266 [Math, Stat]*. - FaLi01: Jianqing Fan, Runze Li (2001) Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties.
*Journal of the American Statistical Association*, 96(456), 1348–1360. DOI