When you have a number predictors or regularisation terms in your model, you need to choose how many to use, based on the you have, and by these various models. This is a kind of complement to statistical learning theory where you hope to quantify how complicated a model you should bother fitting to a given amount of data.

If your predictors are discrete and small in number, you can do this in the
traditional fashion,
by *stepwise model selection*, and
you might discuss the degrees of freedom
of the model and the data.
If you are in the luxurious position of having a small tractable number of
parameters and the ability to perform controlled trials, then you do
ANOVA.

When you have regularisation parameters, we tend to phrase this as
smoothing
and talk about *smoothing parameter selection*, which we can do in various ways.
I'm fond of degrees-of-freedom penalties
because they aren't worse than cross-validation, but much quicker.
However, I'm not yet sure how to make that work in
sparse regression.

Multiple testing is model selection writ large, where you can considering many hypothesis tests, possible effectively infinitely many hypothesis tests, or you have a combinatorial explosion of possible predictors to include.

TODO: document connection with graphical models and thus conditional independence tests.

Bayesian model selection is also a thing, although the framing must be a little different, since in the Bayesian method in principle I keep all my models about and then weight them; but we still might wish to discard some models for reasons of computational tractability or what-have-you.

## Consistency

If the model order *itself* is the parameter of interest, how do you do
consistent inference of that?

An exhausting, exhaustive review of various model selection procedures with an eye to consistency, is given in RaWu01.

## Cross validation

See cross validation.

## For densities

## Under sparsity

## Hyperparameter search

How do you choose your hyperparameters? NB hyperparameters might not always be
about model selection *per se*; there are also ones that are about, e.g.
convergence rate. Anyway. Also one could well regard hyperparameters as normal
parameters.

Turns out you can cast this as a bandit problem, or a sequential Bayesian optimisation problem.

## Bayesian

…means not *quite* the same thing:
Bayesian model selection.

## Refs

- Wahb85: Grace Wahba (1985) A Comparison of GCV and GML for Choosing the Smoothing Parameter in the Generalized Spline Smoothing Problem.
*The Annals of Statistics*, 13(4), 1378–1402. DOI - ThCl13: Kukatharmini Tharmaratnam, Gerda Claeskens (2013) A comparison of robust versions of the AIC based on M-, S- and MM-estimators.
*Statistics*, 47(1), 216–235. DOI - BuNo95: P. Burman, D. Nolan (1995) A general Akaike-type criterion for model selection in robust regression.
*Biometrika*, 82(4), 877–886. DOI - BeGa09: Yoav Benjamini, Yulia Gavrilov (2009) A simple forward selection procedure based on false discovery rate control.
*The Annals of Applied Statistics*, 3(1), 179–198. DOI - RaWu89: Radhakrishna Rao, Yuehua Wu (1989) A strongly consistent procedure for model selection in a regression problem.
*Biometrika*, 76(2), 369–374. DOI - KlRB10: Marius Kloft, Ulrich Rückert, Peter L. Bartlett (2010) A Unifying View of Multiple Kernel Learning. In Machine Learning and Knowledge Discovery in Databases (pp. 66–81). Springer Berlin Heidelberg DOI
- ShYe02: Xiaotong Shen, Jianming Ye (2002) Adaptive Model Selection.
*Journal of the American Statistical Association*, 97(457), 210–221. DOI - ShHY04: Xiaotong Shen, Hsin-Cheng Huang, Jimmy Ye (2004) Adaptive Model Selection and Assessment for Exponential Family Distributions.
*Technometrics*, 46(3), 306–317. DOI - Ston77: M. Stone (1977) An Asymptotic Equivalence of Choice of Model by Cross-Validation and Akaike’s Criterion.
*Journal of the Royal Statistical Society. Series B (Methodological)*, 39(1), 44–47. - GuEl03: Isabelle Guyon, André Elisseeff (2003) An Introduction to Variable and Feature Selection.
*Journal of Machine Learning Research*, 3(Mar), 1157–1182. - Li87: Ker-Chau Li (1987) Asymptotic Optimality for , Cross-Validation and Generalized Cross-Validation: Discrete Index Set.
*The Annals of Statistics*, 15(3), 958–975. DOI - Andr91: Donald W. K. Andrews (1991) Asymptotic optimality of generalized CL, cross-validation, and generalized cross-validation in regression with heteroskedastic errors.
*Journal of Econometrics*, 47(2), 359–377. DOI - Broe06: Petrus MT Broersen (2006)
*Automatic autocorrelation and spectral analysis*. Secaucus, NJ, USA: Springer Science & Business Media - BüKü99: Peter Bühlmann, Hans R Künsch (1999) Block length selection in the bootstrap for time series.
*Computational Statistics & Data Analysis*, 31(3), 295–310. DOI - Shao96: Jun Shao (1996) Bootstrap Model Selection.
*Journal of the American Statistical Association*, 91(434), 655–665. DOI - PaSa14: Efstathios Paparoditis, Theofanis Sapatinas (2014) Bootstrap-based testing for functional data.
*ArXiv:1409.4317 [Math, Stat]*. - Mass07: Pascal Massart (2007)
*Concentration inequalities and model selection: Ecole d’Eté de Probabilités de Saint-Flour XXXIII - 2003*. Berlin ; New York: Springer-Verlag - BaCa15: Rina Foygel Barber, Emmanuel J. Candès (2015) Controlling the false discovery rate via knockoffs.
*The Annals of Statistics*, 43(5), 2055–2085. DOI - TiTa12: Ryan J. Tibshirani, Jonathan Taylor (2012) Degrees of freedom in lasso problems.
*The Annals of Statistics*, 40(2), 1198–1232. DOI - Take76: Kei Takeuchi (1976) Distribution of informational statistics and a criterion of model fitting.
*Suri-Kagaku (Mathematical Sciences)*, 153(1), 12–18. - JaFH15: Lucas Janson, William Fithian, Trevor J. Hastie (2015) Effective degrees of freedom: a flawed metaphor.
*Biometrika*, 102(2), 479–485. DOI - LJDR16: Lisha Li, Kevin Jamieson, Giulia DeSalvo, Afshin Rostamizadeh, Ameet Talwalkar (2016) Efficient Hyperparameter Optimization and Infinitely Many Armed Bandits.
*ArXiv:1603.06560 [Cs, Stat]*. - CaWB08: Emmanuel J. Candès, Michael B. Wakin, Stephen P. Boyd (2008) Enhancing Sparsity by Reweighted ℓ 1 Minimization.
*Journal of Fourier Analysis and Applications*, 14(5–6), 877–905. DOI - AnKo85: Craig F. Ansley, Robert Kohn (1985) Estimation, Filtering, and Smoothing in State Space Models with Incompletely Specified Initial Conditions.
*The Annals of Statistics*, 13(4), 1286–1316. DOI - Stei81: Charles M. Stein (1981) Estimation of the Mean of a Multivariate Normal Distribution.
*The Annals of Statistics*, 9(6), 1135–1151. DOI - TLTT14: Jonathan Taylor, Richard Lockhart, Ryan J. Tibshirani, Robert Tibshirani (2014) Exact Post-selection Inference for Forward Stepwise and Least Angle Regression.
*ArXiv:1401.3889 [Stat]*. - ChLY16: Ngai Hang Chan, Ye Lu, Chun Yip Yau (2016) Factor Modelling for High-Dimensional Time Series: Inference and Model Selection.
*Journal of Time Series Analysis*, n/a-n/a. DOI - KoKi96: Sadanori Konishi, Genshiro Kitagawa (1996) Generalised information criteria in model selection.
*Biometrika*, 83(4), 875–890. DOI - ZhRY06: Peng Zhao, Guilherme Rocha, Bin Yu (2006) Grouped and hierarchical model selection through composite absolute penalties.
- Efro86: Bradley Efron (1986) How biased is the apparent error rate of a prediction rule?
*Journal of the American Statistical Association*, 81(394), 461–470. DOI - KoKi08: Sadanori Konishi, G. Kitagawa (2008)
*Information criteria and statistical modeling*. New York: Springer - CoBa17: D. R. Cox, H. S. Battey (2017) Large numbers of explanatory variables, a semi-descriptive analysis.
*Proceedings of the National Academy of Sciences*, 114(32), 8592–8595. DOI - CaSu17: T. Tony Cai, Wenguang Sun (2017) Large-Scale Global and Simultaneous Inference: Estimation and Testing in Very High Dimensions.
*Annual Review of Economics*, 9(1), 411–439. DOI - BLZS15: Adam Bloniarz, Hanzhong Liu, Cun-Hui Zhang, Jasjeet Sekhon, Bin Yu (2015) Lasso adjustments of treatment effect estimates in randomized experiments.
*ArXiv:1507.03652 [Math, Stat]*. - MeYu09: Nicolai Meinshausen, Bin Yu (2009) Lasso-type recovery of sparse representations for high-dimensional data.
*The Annals of Statistics*, 37(1), 246–270. DOI - BiMa06: Lucien Birgé, Pascal Massart (2006) Minimal Penalties for Gaussian Model Selection.
*Probability Theory and Related Fields*, 138(1–2), 33–73. DOI - Roya86: Richard M. Royall (1986) Model Robust Confidence Intervals Using Maximum Likelihood Estimators.
*International Statistical Review / Revue Internationale de Statistique*, 54(2), 221–226. DOI - ClHj08: Gerda Claeskens, Nils Lid Hjort (2008)
*Model selection and model averaging*. Cambridge ; New York: Cambridge University Press - BuAn02: Kenneth P. Burnham, David Raymond Anderson (2002)
*Model selection and multimodel inference: a practical information-theoretic approach*. New York: Springer - HMCH08: X. Hong, R. J. Mitchell, S. Chen, C. J. Harris, K. Li, G. W. Irwin (2008) Model selection approaches for non-linear system identification: a review.
*International Journal of Systems Science*, 39(10), 925–946. DOI - Birg08: Lucien Birgé (2008) Model selection for density estimation with L2-loss.
*ArXiv:0808.1416 [Math, Stat]*. - AlWi12: Pierre Alquier, Olivier Wintenberger (2012) Model selection for weakly dependent time series forecasting.
*Bernoulli*. - JoOm04: Jerald B. Johnson, Kristian S. Omland (2004) Model selection in ecology and evolution.
*Trends in Ecology & Evolution*, 19(2), 101–108. DOI - AgNR16: Alireza Aghasi, Nam Nguyen, Justin Romberg (2016) Net-Trim: A Layer-wise Convex Pruning of Deep Neural Networks.
*ArXiv:1611.05162 [Cs, Stat]*. - GeHw82: Stuart Geman, Chii-Ruey Hwang (1982) Nonparametric Maximum Likelihood Estimation by the Method of Sieves.
*The Annals of Statistics*, 10(2), 401–414. DOI - JaTa15: Kevin Jamieson, Ameet Talwalkar (2015) Non-stochastic Best Arm Identification and Hyperparameter Optimization.
*ArXiv:1502.07943 [Cs, Stat]*. - RaWu01: C. R. Rao, Y. Wu (2001) On model selection. In Institute of Mathematical Statistics Lecture Notes - Monograph Series (Vol. 38, pp. 1–57). Beachwood, OH: Institute of Mathematical Statistics
- VaBC12: Stijn Vansteelandt, Maarten Bekaert, Gerda Claeskens (2012) On model selection and model misspecification in causal inference.
*Statistical Methods in Medical Research*, 21(1), 7–30. DOI - ZhYu06: Peng Zhao, Bin Yu (2006) On model selection consistency of Lasso.
*Journal of Machine Learning Research*, 7(Nov), 2541–2563. - Qian96: Guoqi Qian (1996) On model selection in robust linear regression
- QiKü98: Guoqi Qian, Hans R. Künsch (1998) On model selection via stochastic complexity in robust linear regression.
*Journal of Statistical Planning and Inference*, 75(1), 91–116. DOI - CaTa10: Gavin C. Cawley, Nicola L. C. Talbot (2010) On Over-fitting in Model Selection and Subsequent Selection Bias in Performance Evaluation.
*Journal of Machine Learning Research*, 11, 2079−2107. - ZoLi08: Hui Zou, Runze Li (2008) One-step sparse estimates in nonconcave penalized likelihood models.
*The Annals of Statistics*, 36(4), 1509–1533. DOI - ShHu06: Xiaotong Shen, Hsin-Cheng Huang (2006) Optimal Model Assessment, Selection, and Combination.
*Journal of the American Statistical Association*, 101(474), 554–568. DOI - CFJL16: Emmanuel J. Candès, Yingying Fan, Lucas Janson, Jinchi Lv (2016) Panning for Gold: Model-free Knockoffs for High-dimensional Controlled Variable Selection.
*ArXiv Preprint ArXiv:1610.02351*. - BLTG06: Peter J. Bickel, Bo Li, Alexandre B. Tsybakov, Sara A. van de Geer, Bin Yu, Teófilo Valdés, … Aad van der Vaart (2006) Regularization in statistics.
*Test*, 15(2), 271–344. DOI - Mach93: José A.F. Machado (1993) Robust Model Selection and M-Estimation.
*Econometric Theory*, 9(03), 478–493. DOI - Ronc00: E. Ronchetti (2000) Robust Regression Methods and Model Selection. In Data Segmentation and Model Selection for Computer Vision (pp. 31–40). Springer New York DOI
- QiHa96: Guoqi Qian, R. K. Hans (1996) Some notes on Rissanen’s stochastic complexity
- ElVi13: E. Elhamifar, R. Vidal (2013) Sparse Subspace Clustering: Algorithm, Theory, and Applications.
*IEEE Transactions on Pattern Analysis and Machine Intelligence*, 35(11), 2765–2781. DOI - Shib89: Ritei Shibata (1989) Statistical Aspects of Model Selection. In From Data to Model (pp. 215–240). Springer Berlin Heidelberg DOI
- FaLv08: Jianqing Fan, Jinchi Lv (2008) Sure independence screening for ultrahigh dimensional feature space.
*Journal of the Royal Statistical Society: Series B (Statistical Methodology)*, 70(5), 849–911. DOI - ZhRY09: Peng Zhao, Guilherme Rocha, Bin Yu (2009) The composite absolute penalties family for grouped and hierarchical variable selection.
*The Annals of Statistics*, 37(6A), 3468–3497. DOI - DaBa16: Ran Dai, Rina Foygel Barber (2016) The knockoff filter for FDR control in group-sparse and multitask regression.
*ArXiv Preprint ArXiv:1602.03589*. - TRTW15: Ryan J. Tibshirani, Alessandro Rinaldo, Robert Tibshirani, Larry Wasserman (2015) Uniform Asymptotic Inference and the Bootstrap After Model Selection.
*ArXiv:1506.06266 [Math, Stat]*. - FaLi01: Jianqing Fan, Runze Li (2001) Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties.
*Journal of the American Statistical Association*, 96(456), 1348–1360. DOI