# Model/hyperparameter selection

Usefulness: 🔧
Novelty: 💡
Uncertainty: 🤪 🤪 🤪
Incompleteness: 🚧 🚧 🚧

Choosing which of an ensemble of models to use, or, which is the same things more or less, the number of predictors, or the regularisation. This is a kind of complement to statistical learning theory where you hope to quantify how complicated a model you should bother fitting to a given amount of data.

If your predictors are discrete and small in number, you can do this in the traditional fashion, by stepwise model selection, and you might discuss the degrees of freedom of the model and the data. If you are in the luxurious position of having a small tractable number of parameters and the ability to perform controlled trials, then you do ANOVA.

When you have penalisation parameters, we sometimes phrase this as regularisation and talk about regularisation parameter selection, or hyperparameter selection, which we can do in various ways. Methods for this include degrees-of-freedom penalties, cross-validation etc.. However, I’m not yet sure how to make that work in sparse regression.

Multiple testing is model selection writ large, where you can considering many hypothesis tests, possible effectively infinitely many hypothesis tests, or you have a combinatorial explosion of possible predictors to include.

🚧 document connection with graphical models and thus conditional independence tests.

## Bayesian

Bayesian model selection is also a thing, although the framing is a little different. In the classic Bayesian methodI keep all my models about although some might become very unlikely. But apparently I can also throw some out entirely? Presumably for reasons of computational tractability or what-have-you.

## Consistency

If the model order itself is the parameter of interest, how do you do consistent inference on that? AIC, for example, is derived for optimising prediction loss, not model selection. (Doesn’t BIC do better?)

An exhausting, exhaustive review of various model selection procedures with an eye to consistency, is given in RaWu01.

## Cross validation

See cross validation.

## Under sparsity

How do you choose your hyperparameters? NB hyperparameters might not always be about model selection per se; there are also ones that are about, e.g. convergence rate. Anyway. Also one could well regard hyperparameters as normal parameters.

Turns out you can cast this as a bandit problem, or a sequential Bayesian optimisation problem.

# Refs

Aghasi, Alireza, Nam Nguyen, and Justin Romberg. 2016. “Net-Trim: A Layer-Wise Convex Pruning of Deep Neural Networks,” November. http://arxiv.org/abs/1611.05162.

Alquier, Pierre, and Olivier Wintenberger. 2012. “Model Selection for Weakly Dependent Time Series Forecasting.” Bernoulli. http://arxiv.org/abs/0902.2924.

Andersen, Per Kragh, Ornulf Borgan, Richard D. Gill, and Niels Keiding. 1997. Statistical Models Based on Counting Processes. Corr. 2. print. Springer Series in Statistics. New York, NY: Springer.

Andrews, Donald W. K. 1991. “Asymptotic Optimality of Generalized CL, Cross-Validation, and Generalized Cross-Validation in Regression with Heteroskedastic Errors.” Journal of Econometrics 47 (2): 359–77. https://doi.org/10.1016/0304-4076(91)90107-O.

Ansley, Craig F., and Robert Kohn. 1985. “Estimation, Filtering, and Smoothing in State Space Models with Incompletely Specified Initial Conditions.” The Annals of Statistics 13 (4): 1286–1316. https://doi.org/10.1214/aos/1176349739.

Barber, Rina Foygel, and Emmanuel J. Candès. 2015. “Controlling the False Discovery Rate via Knockoffs.” The Annals of Statistics 43 (5): 2055–85. https://doi.org/10.1214/15-AOS1337.

Benjamini, Yoav, and Yulia Gavrilov. 2009. “A Simple Forward Selection Procedure Based on False Discovery Rate Control.” The Annals of Applied Statistics 3 (1): 179–98. https://doi.org/10.1214/08-AOAS194.

Bickel, Peter J., Bo Li, Alexandre B. Tsybakov, Sara A. van de Geer, Bin Yu, Teófilo Valdés, Carlos Rivero, Jianqing Fan, and Aad van der Vaart. 2006. “Regularization in Statistics.” Test 15 (2): 271–344. https://doi.org/10.1007/BF02607055.

Birgé, Lucien. 2008. “Model Selection for Density Estimation with L2-Loss,” August. http://arxiv.org/abs/0808.1416.

Birgé, Lucien, and Pascal Massart. 2006. “Minimal Penalties for Gaussian Model Selection.” Probability Theory and Related Fields 138 (1-2): 33–73. https://doi.org/10.1007/s00440-006-0011-8.

Bloniarz, Adam, Hanzhong Liu, Cun-Hui Zhang, Jasjeet Sekhon, and Bin Yu. 2015. “Lasso Adjustments of Treatment Effect Estimates in Randomized Experiments,” July. http://arxiv.org/abs/1507.03652.

Broersen, Petrus MT. 2006. Automatic Autocorrelation and Spectral Analysis. Secaucus, NJ, USA: Springer. http://dsp-book.narod.ru/AASA.pdf.

Burman, P., and D. Nolan. 1995. “A General Akaike-Type Criterion for Model Selection in Robust Regression.” Biometrika 82 (4): 877–86. https://doi.org/10.1093/biomet/82.4.877.

Burnham, Kenneth P., and David Raymond Anderson. 2002. Model Selection and Multimodel Inference: A Practical Information-Theoretic Approach. 2nd ed. New York: Springer.

Bühlmann, Peter, and Hans R Künsch. 1999. “Block Length Selection in the Bootstrap for Time Series.” Computational Statistics & Data Analysis 31 (3): 295–310. https://doi.org/10.1016/S0167-9473(99)00014-6.

Cai, T. Tony, and Wenguang Sun. 2017. “Large-Scale Global and Simultaneous Inference: Estimation and Testing in Very High Dimensions.” Annual Review of Economics 9 (1): 411–39. https://doi.org/10.1146/annurev-economics-063016-104355.

Candès, Emmanuel J., Yingying Fan, Lucas Janson, and Jinchi Lv. 2016. “Panning for Gold: Model-Free Knockoffs for High-Dimensional Controlled Variable Selection.” arXiv Preprint arXiv:1610.02351. https://arxiv.org/abs/1610.02351.

Candès, Emmanuel J., Michael B. Wakin, and Stephen P. Boyd. 2008. “Enhancing Sparsity by Reweighted ℓ 1 Minimization.” Journal of Fourier Analysis and Applications 14 (5-6): 877–905. https://doi.org/10.1007/s00041-008-9045-x.

Cawley, Gavin C., and Nicola L. C. Talbot. 2010. “On over-Fitting in Model Selection and Subsequent Selection Bias in Performance Evaluation.” Journal of Machine Learning Research 11 (July): 2079−2107. http://jmlr.csail.mit.edu/papers/v11/cawley10a.html.

Chan, Ngai Hang, Ye Lu, and Chun Yip Yau. 2016. “Factor Modelling for High-Dimensional Time Series: Inference and Model Selection.” Journal of Time Series Analysis, January, n/a–n/a. https://doi.org/10.1111/jtsa.12207.

Chernozhukov, Victor, Denis Chetverikov, Mert Demirer, Esther Duflo, Christian Hansen, Whitney Newey, and James Robins. 2016. “Double/Debiased Machine Learning for Treatment and Causal Parameters,” July. http://arxiv.org/abs/1608.00060.

Chernozhukov, Victor, Christian Hansen, Yuan Liao, and Yinchu Zhu. 2018. “Inference for Heterogeneous Effects Using Low-Rank Estimations,” December. http://arxiv.org/abs/1812.08089.

Chernozhukov, Victor, Whitney K. Newey, and Rahul Singh. 2018. “Learning L2 Continuous Regression Functionals via Regularized Riesz Representers,” September. http://arxiv.org/abs/1809.05224.

Claeskens, Gerda, and Nils Lid Hjort. 2008. Model Selection and Model Averaging. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge ; New York: Cambridge University Press.

Cox, D. R., and H. S. Battey. 2017. “Large Numbers of Explanatory Variables, a Semi-Descriptive Analysis.” Proceedings of the National Academy of Sciences 114 (32): 8592–5. https://doi.org/10.1073/pnas.1703764114.

Dai, Ran, and Rina Foygel Barber. 2016. “The Knockoff Filter for FDR Control in Group-Sparse and Multitask Regression.” arXiv Preprint arXiv:1602.03589. https://arxiv.org/abs/1602.03589.

Ding, J., V. Tarokh, and Y. Yang. 2018. “Model Selection Techniques: An Overview.” IEEE Signal Processing Magazine 35 (6): 16–34. https://doi.org/10.1109/MSP.2018.2867638.

Efron, Bradley. 1986. “How Biased Is the Apparent Error Rate of a Prediction Rule?” Journal of the American Statistical Association 81 (394): 461–70. https://doi.org/10.1080/01621459.1986.10478291.

Elhamifar, E., and R. Vidal. 2013. “Sparse Subspace Clustering: Algorithm, Theory, and Applications.” IEEE Transactions on Pattern Analysis and Machine Intelligence 35 (11): 2765–81. https://doi.org/10.1109/TPAMI.2013.57.

Fan, Jianqing, and Runze Li. 2001. “Variable Selection via Nonconcave Penalized Likelihood and Its Oracle Properties.” Journal of the American Statistical Association 96 (456): 1348–60. https://doi.org/10.1198/016214501753382273.

Fan, Jianqing, and Jinchi Lv. 2008. “Sure Independence Screening for Ultrahigh Dimensional Feature Space.” Journal of the Royal Statistical Society: Series B (Statistical Methodology) 70 (5): 849–911. https://doi.org/10.1111/j.1467-9868.2008.00674.x.

Geman, Stuart, and Chii-Ruey Hwang. 1982. “Nonparametric Maximum Likelihood Estimation by the Method of Sieves.” The Annals of Statistics 10 (2): 401–14. https://doi.org/10.1214/aos/1176345782.

Guyon, Isabelle, and André Elisseeff. 2003. “An Introduction to Variable and Feature Selection.” Journal of Machine Learning Research 3 (Mar): 1157–82. http://www.jmlr.org/papers/v3/guyon03a.html.

Hong, X., R. J. Mitchell, S. Chen, C. J. Harris, K. Li, and G. W. Irwin. 2008. “Model Selection Approaches for Non-Linear System Identification: A Review.” International Journal of Systems Science 39 (10): 925–46. https://doi.org/10.1080/00207720802083018.

Ishwaran, Hemant, and J. Sunil Rao. 2005. “Spike and Slab Variable Selection: Frequentist and Bayesian Strategies.” The Annals of Statistics 33 (2): 730–73. https://doi.org/10.1214/009053604000001147.

Jamieson, Kevin, and Ameet Talwalkar. 2015. “Non-Stochastic Best Arm Identification and Hyperparameter Optimization,” February. http://arxiv.org/abs/1502.07943.

Janson, Lucas, William Fithian, and Trevor J. Hastie. 2015. “Effective Degrees of Freedom: A Flawed Metaphor.” Biometrika 102 (2): 479–85. https://doi.org/10.1093/biomet/asv019.

Johnson, Jerald B., and Kristian S. Omland. 2004. “Model Selection in Ecology and Evolution.” Trends in Ecology & Evolution 19 (2): 101–8. https://doi.org/10.1016/j.tree.2003.10.013.

Kloft, Marius, Ulrich Rückert, and Peter L. Bartlett. 2010. “A Unifying View of Multiple Kernel Learning.” In Machine Learning and Knowledge Discovery in Databases, edited by José Luis Balcázar, Francesco Bonchi, Aristides Gionis, and Michèle Sebag, 66–81. Lecture Notes in Computer Science. Springer Berlin Heidelberg. https://doi.org/10.1007/978-3-642-15883-4_5.

Konishi, Sadanori, and G. Kitagawa. 2008. Information Criteria and Statistical Modeling. Springer Series in Statistics. New York: Springer.

Konishi, Sadanori, and Genshiro Kitagawa. 1996. “Generalised Information Criteria in Model Selection.” Biometrika 83 (4): 875–90. https://doi.org/10.1093/biomet/83.4.875.

Li, Ker-Chau. 1987. “Asymptotic Optimality for $C_p, C_L$, Cross-Validation and Generalized Cross-Validation: Discrete Index Set.” The Annals of Statistics 15 (3): 958–75. https://doi.org/10.1214/aos/1176350486.

Li, Lisha, Kevin Jamieson, Giulia DeSalvo, Afshin Rostamizadeh, and Ameet Talwalkar. 2016. “Efficient Hyperparameter Optimization and Infinitely Many Armed Bandits,” March. http://arxiv.org/abs/1603.06560.

Machado, José A. F. 1993. “Robust Model Selection and M-Estimation.” Econometric Theory 9 (03): 478–93. https://doi.org/10.1017/S0266466600007775.

Massart, Pascal. 2007. Concentration Inequalities and Model Selection: Ecole d’Eté de Probabilités de Saint-Flour XXXIII - 2003. Lecture Notes in Mathematics 1896. Berlin ; New York: Springer-Verlag. http://www.cmap.polytechnique.fr/~merlet/articles/probas_massart_stf03.pdf.

Meinshausen, Nicolai, and Bin Yu. 2009. “Lasso-Type Recovery of Sparse Representations for High-Dimensional Data.” The Annals of Statistics 37 (1): 246–70. https://doi.org/10.1214/07-AOS582.

Paparoditis, Efstathios, and Theofanis Sapatinas. 2014. “Bootstrap-Based Testing for Functional Data,” September. http://arxiv.org/abs/1409.4317.

Qian, Guoqi, and Hans R. Künsch. 1998. “On Model Selection via Stochastic Complexity in Robust Linear Regression.” Journal of Statistical Planning and Inference 75 (1): 91–116. https://doi.org/10.1016/S0378-3758(98)00138-4.

Rao, C. R., and Y. Wu. 2001. “On Model Selection.” In Institute of Mathematical Statistics Lecture Notes - Monograph Series, 38:1–57. Beachwood, OH: Institute of Mathematical Statistics. http://projecteuclid.org/euclid.lnms/1215540960.

Rao, Radhakrishna, and Yuehua Wu. 1989. “A Strongly Consistent Procedure for Model Selection in a Regression Problem.” Biometrika 76 (2): 369–74. https://doi.org/10.1093/biomet/76.2.369.

Ročková, Veronika, and Edward I. George. 2018. “The Spike-and-Slab LASSO.” Journal of the American Statistical Association 113 (521): 431–44. https://doi.org/10.1080/01621459.2016.1260469.

Ronchetti, E. 2000. “Robust Regression Methods and Model Selection.” In Data Segmentation and Model Selection for Computer Vision, edited by Alireza Bab-Hadiashar and David Suter, 31–40. Springer New York. https://doi.org/10.1007/978-0-387-21528-0_2.

Royall, Richard M. 1986. “Model Robust Confidence Intervals Using Maximum Likelihood Estimators.” International Statistical Review / Revue Internationale de Statistique 54 (2): 221–26. https://doi.org/10.2307/1403146.

Shao, Jun. 1996. “Bootstrap Model Selection.” Journal of the American Statistical Association 91 (434): 655–65. https://doi.org/10.2307/2291661.

Shen, Xiaotong, and Hsin-Cheng Huang. 2006. “Optimal Model Assessment, Selection, and Combination.” Journal of the American Statistical Association 101 (474): 554–68. https://doi.org/10.1198/016214505000001078.

Shen, Xiaotong, Hsin-Cheng Huang, and Jimmy Ye. 2004. “Adaptive Model Selection and Assessment for Exponential Family Distributions.” Technometrics 46 (3): 306–17. https://doi.org/10.1198/004017004000000338.

Shen, Xiaotong, and Jianming Ye. 2002. “Adaptive Model Selection.” Journal of the American Statistical Association 97 (457): 210–21. https://doi.org/10.1198/016214502753479356.

Shibata, Ritei. 1989. “Statistical Aspects of Model Selection.” In From Data to Model, edited by Professor Jan C. Willems, 215–40. Springer Berlin Heidelberg. https://doi.org/10.1007/978-3-642-75007-6_5.

Stein, Charles M. 1981. “Estimation of the Mean of a Multivariate Normal Distribution.” The Annals of Statistics 9 (6): 1135–51. https://doi.org/10.1214/aos/1176345632.

Stone, M. 1977. “An Asymptotic Equivalence of Choice of Model by Cross-Validation and Akaike’s Criterion.” Journal of the Royal Statistical Society. Series B (Methodological) 39 (1): 44–47. http://www.stat.washington.edu/courses/stat527/s14/readings/Stone1977.pdf.

Takeuchi, Kei. 1976. “Distribution of informational statistics and a criterion of model fitting.” Suri-Kagaku (Mathematical Sciences) 153 (1): 12–18.

Taylor, Jonathan, Richard Lockhart, Ryan J. Tibshirani, and Robert Tibshirani. 2014. “Exact Post-Selection Inference for Forward Stepwise and Least Angle Regression,” January. http://arxiv.org/abs/1401.3889.

Tharmaratnam, Kukatharmini, and Gerda Claeskens. 2013. “A Comparison of Robust Versions of the AIC Based on M-, S- and MM-Estimators.” Statistics 47 (1): 216–35. https://doi.org/10.1080/02331888.2011.568120.

Tibshirani, Ryan J., Alessandro Rinaldo, Robert Tibshirani, and Larry Wasserman. 2015. “Uniform Asymptotic Inference and the Bootstrap After Model Selection,” June. http://arxiv.org/abs/1506.06266.

Tibshirani, Ryan J., and Jonathan Taylor. 2012. “Degrees of Freedom in Lasso Problems.” The Annals of Statistics 40 (2): 1198–1232. https://doi.org/10.1214/12-AOS1003.

Vansteelandt, Stijn, Maarten Bekaert, and Gerda Claeskens. 2012. “On Model Selection and Model Misspecification in Causal Inference.” Statistical Methods in Medical Research 21 (1): 7–30. https://doi.org/10.1177/0962280210387717.

Wahba, Grace. 1985. “A Comparison of GCV and GML for Choosing the Smoothing Parameter in the Generalized Spline Smoothing Problem.” The Annals of Statistics 13 (4): 1378–1402. https://doi.org/10.1214/aos/1176349743.

Zhao, Peng, Guilherme Rocha, and Bin Yu. 2006. “Grouped and Hierarchical Model Selection Through Composite Absolute Penalties.” http://digitalassets.lib.berkeley.edu/sdtr/ucb/text/703.pdf.

———. 2009. “The Composite Absolute Penalties Family for Grouped and Hierarchical Variable Selection.” The Annals of Statistics 37 (6A): 3468–97. https://doi.org/10.1214/07-AOS584.

Zhao, Peng, and Bin Yu. 2006. “On Model Selection Consistency of Lasso.” Journal of Machine Learning Research 7 (Nov): 2541–63. http://www.jmlr.org/papers/volume7/zhao06a/zhao06a.

Zou, Hui, and Runze Li. 2008. “One-Step Sparse Estimates in Nonconcave Penalized Likelihood Models.” The Annals of Statistics 36 (4): 1509–33. https://doi.org/10.1214/009053607000000802.