When you have a number predictors or regularisation terms in your model, you need to choose how many to use, based on the amount of data you have, and how well it is explained by these various models. This is a kind of complement to statistical learning theory where you hope to quantify how complicated a model you should bother fitting to a given amount of data.

If your predictors are discrete and small in number, you can do this in the traditional fashion,
by *stepwise model selection*, and
you might discuss the degrees of freedom
of the model and the data.
If you are in the luxurious position of having a small tractable number of parameters and the ability to perform controlled trials, then you do
ANOVA.

When you have regularisation parameters, we tend to phrase this as
smoothing
and talk about *smoothing parameter selection*, which we can do in various ways.
I’m fond of degrees-of-freedom penalties because they aren’t worse than cross-validation, but much quicker.
However, I’m not yet sure how to make that work in
sparse regression.

Multiple testing is model selection writ large, where you can considering many hypothesis tests, possible effectively infinitely many hypothesis tests, or you have a combinatorial explosion of possible predictors to include.

TODO: document connection with graphical models and thus conditional independence tests.

Bayesian model selection is also a thing, although the framing must be a little different, since in the Bayesian method in principle I keep all my models about and then weight them; but we still might wish to discard some models for reasons of computational tractability or what-have-you.

## Consistency

If the model order *itself* is the parameter of interest, how do you do consistent inference of that?

An exhausting, exhaustive review of various model selection procedures with an eye to consistency, is given in RaWu01.

## Cross validation

See cross validation.

## Under sparsity

Fiddly. See sparse model selection.

## Hyperparameter search

How do you choose your hyperparameters? NB hyperparameters might not always be about model selection *per se*; there are also ones that are about, e.g. convergence rate. Anyway. Also one could well regard hyperparameters as normal parameters.

Turns out you can cast this as a bandit problem, or a sequential Bayesian optimisation problem.

## Bayesian

…means not *quite* the same thing: Bayesian model selection.

## Reads

- BeGa09
- Benjamini, Y., & Gavrilov, Y. (2009) A simple forward selection procedure based on false discovery rate control.
*The Annals of Applied Statistics*, 3(1), 179–198. DOI. - BLZS15
- Bloniarz, A., Liu, H., Zhang, C.-H., Sekhon, J., & Yu, B. (2015) Lasso adjustments of treatment effect estimates in randomized experiments.
*arXiv:1507.03652 [math, Stat]*. - BüKü99
- Bühlmann, P., & Künsch, H. R.(1999) Block length selection in the bootstrap for time series.
*Computational Statistics & Data Analysis*, 31(3), 295–310. DOI. - BuNo95
- Burman, P., & Nolan, D. (1995) A general Akaike-type criterion for model selection in robust regression.
*Biometrika*, 82(4), 877–886. DOI. - BuAn02
- Burnham, K. P., & Anderson, D. R.(2002) Model selection and multimodel inference: a practical information-theoretic approach. (2nd ed.). New York: Springer
- CaTa10
- Cawley, G. C., & Talbot, N. L. C.(2010) On Over-fitting in Model Selection and Subsequent Selection Bias in Performance Evaluation.
*Journal of Machine Learning Research*, 11, 2079−2107. - ClHj08
- Claeskens, G., & Hjort, N. L.(2008) Model selection and model averaging. . Cambridge ; New York: Cambridge University Press
- GeHw82
- Geman, S., & Hwang, C.-R. (1982) Nonparametric Maximum Likelihood Estimation by the Method of Sieves.
*The Annals of Statistics*, 10(2), 401–414. DOI. - GuEl03
- Guyon, I., & Elisseeff, A. (2003) An Introduction to Variable and Feature Selection.
*Journal of Machine Learning Research*, 3(Mar), 1157–1182. - HMCH08
- Hong, X., Mitchell, R. J., Chen, S., Harris, C. J., Li, K., & Irwin, G. W.(2008) Model selection approaches for non-linear system identification: a review.
*International Journal of Systems Science*, 39(10), 925–946. DOI. - JaTa15
- Jamieson, K., & Talwalkar, A. (2015) Non-stochastic Best Arm Identification and Hyperparameter Optimization.
*arXiv:1502.07943 [cs, Stat]*. - JaFH13
- Janson, L., Fithian, W., & Hastie, T. (2013) Effective Degrees of Freedom: A Flawed Metaphor.
*arXiv:1312.7851 [stat]*. - KlRB10
- Kloft, M., Rückert, U., & Bartlett, P. L.(2010) A Unifying View of Multiple Kernel Learning. In J. L. Balcázar, F. Bonchi, A. Gionis, & M. Sebag (Eds.), Machine Learning and Knowledge Discovery in Databases (pp. 66–81). Springer Berlin Heidelberg DOI.
- KoKi96
- Konishi, S., & Kitagawa, G. (1996) Generalised information criteria in model selection.
*Biometrika*, 83(4), 875–890. DOI. - KoKi08
- Konishi, S., & Kitagawa, G. (2008) Information criteria and statistical modeling. . New York: Springer
- LJDR16
- Li, L., Jamieson, K., DeSalvo, G., Rostamizadeh, A., & Talwalkar, A. (2016) Efficient Hyperparameter Optimization and Infinitely Many Armed Bandits.
*arXiv:1603.06560 [cs, Stat]*. - PaSa14
- Paparoditis, E., & Sapatinas, T. (2014) Bootstrap-based testing for functional data.
*arXiv:1409.4317 [math, Stat]*. - Qian96
- Qian, G. (1996) On model selection in robust linear regression.
- QiHa96
- Qian, G., & Hans, R. K.(1996) Some notes on Rissanen’s stochastic complexity.
- QiKü98
- Qian, G., & Künsch, H. R.(1998) On model selection via stochastic complexity in robust linear regression.
*Journal of Statistical Planning and Inference*, 75(1), 91–116. DOI. - RaWu01
- Rao, C. R., & Wu, Y. (2001) On model selection. In Institute of Mathematical Statistics Lecture Notes - Monograph Series (Vol. 38, pp. 1–57). Beachwood, OH: Institute of Mathematical Statistics
- RaWu89
- Rao, R., & Wu, Y. (1989) A strongly consistent procedure for model selection in a regression problem.
*Biometrika*, 76(2), 369–374. DOI. - Ronc00
- Ronchetti, E. (2000) Robust Regression Methods and Model Selection. In A. Bab-Hadiashar & D. Suter (Eds.), Data Segmentation and Model Selection for Computer Vision (pp. 31–40). Springer New York
- Roya86
- Royall, R. M.(1986) Model Robust Confidence Intervals Using Maximum Likelihood Estimators.
*International Statistical Review / Revue Internationale de Statistique*, 54(2), 221–226. DOI. - Shao96
- Shao, J. (1996) Bootstrap Model Selection.
*Journal of the American Statistical Association*, 91(434), 655–665. DOI. - Shib89
- Shibata, R. (1989) Statistical Aspects of Model Selection. In P. J. C. Willems (Ed.), From Data to Model (pp. 215–240). Springer Berlin Heidelberg
- Ston77
- Stone, M. (1977) An Asymptotic Equivalence of Choice of Model by Cross-Validation and Akaike’s Criterion.
*Journal of the Royal Statistical Society. Series B (Methodological)*, 39(1), 44–47. - Take76
- Takeuchi, K. (1976) Distribution of informational statistics and a criterion of model fitting.
*Suri-Kagaku (Mathematical Sciences)*, 153(1), 12–18. - TLTT14
- Taylor, J., Lockhart, R., Tibshirani, R. J., & Tibshirani, R. (2014) Exact Post-selection Inference for Forward Stepwise and Least Angle Regression.
*arXiv:1401.3889 [stat]*. - ThCl13
- Tharmaratnam, K., & Claeskens, G. (2013) A comparison of robust versions of the AIC based on M-, S- and MM-estimators.
*Statistics*, 47(1), 216–235. DOI. - TRTW15
- Tibshirani, R. J., Rinaldo, A., Tibshirani, R., & Wasserman, L. (2015) Uniform Asymptotic Inference and the Bootstrap After Model Selection.
*arXiv:1506.06266 [math, Stat]*. - VaBC12
- Vansteelandt, S., Bekaert, M., & Claeskens, G. (2012) On model selection and model misspecification in causal inference.
*Statistical Methods in Medical Research*, 21(1), 7–30. DOI.