Understanding the “Degrees of freedom” of a model. Estimating that trace penalty matrix.
TODO: Explain AIC, \(C_p\), SURE, and BICtype degrees of freedom, and whatever variants there are out there.
As seen in robust estimation and AIC/BIC. Complexity penalties crop up in model selection. (i.e. choosing the complexity of model appropriate to your data.)
Efron (Efro04) is an excellent introduction, compressing 30 years of theory into 2 pages. Massart (Mass00) seems more modern in flavour to me:
The reader who is not familiar with model selection via [complexity] penalization can legitimately ask the question: where does the idea of penalization come from? It is possible to answer this question at two different levels:
at some intuitive level by presenting the heuristics of one of the first criterion of this kind which has been introduced by Akaike (1973);
at some technical level by explaining why such a strategy of model selection has some chances to succeed.
Yuan and Lin (YuLi06) are an example of the kind of argumentation I need to use to use linear model approximation for general application of DOF in sparse model selection.
Zou et al (ZoHT07):
Degrees of freedom is a familiar phrase for many statisticians. In linear regression the degrees of freedom is the number of estimated predictors. Degrees of freedom is often used to quantify the model complexity of a statistical modeling procedure (Hastie and Tibshirani HaTi90). However, generally speaking, there is no exact correspondence between the degrees of freedom and the number of parameters in the model (Ye, Ye98). […] Stein’s unbiased risk estimation (SURE) theory (Stei81) gives a rigorous definition of the degrees of freedom for any fitting procedure. […] Efron (Efro04) showed that \(C_p\) is an unbiased estimator of the true prediction error, and in some settings it offers substantially better accuracy than crossvalidation and related nonparametric methods. Thus degrees of freedom plays an important role in model assessment and selection. Donoho and Johnstone (DoJo95) used the SURE theory to derive the degrees of freedom of soft thresholding and showed that it leads to an adaptive wavelet shrinkage procedure called SureShrink. Ye (Ye98) and Shen and Ye (ShYe02) showed that the degrees of freedom can capture the inherent uncertainty in modeling and frequentist model selection. Shen and Ye (ShYe02) and Shen, Huang and Ye (ShHY04) further proved that the degrees of freedom provides an adaptive model selection criterion that performs better than the fixedpenalty model selection criteria.
Information Criteria
Akaike and friends. With Mestimation, (e.g. maximum likelihood estimation and robust estimation) these are marvelous and general shortcuts to do model selection. (i.e. choosing the complexity of model appropriate to your data) without resorting to computionally expensive cross validation.
For all of these, a thing called the number of effective degrees of freedom is important. There are several different definitions for that, and they only sometimes coincide, so I leave that for a different notebook.
Information criteria can effectively then e.g. do the same thing crossvalidation at a small fraction of the computational cost. In fact, they are asymptotically the same  see below.
Estimated cross entropy (KLdivergence between model and… what?) seems to be the flavour of the minute in machine learning. (Because cross validation is unbearably slow?) Here is Chrosopher Olah’s excellent visual explanation of it These do relate, no? Actually I haven’t read this.
To learn:

How this interacts with robust estimators

How to use AIC with nonparametric or high dimensional methods (GIC)

How it relates to minimum description length (e.g. BHLL08)
Influential current Englishlanguage texts in this area are BuAn02, ClHj08 and KoKi08. The first of these is highly cited and brought the AIC method into the mainstream in the West from where it had been on the specalised fringes. The latter two focus on extensions such as TIC and GIC.
TBD: general description.
TBD: clarify relationship to Minimum Description Length, Rissanenstyle.
In the literature, selection criteria are usually classified into two categories: consistent (e.g., the Bayesian information criterion BIC, Schwarz, 1978) and efficient (e.g., the Akaike information criterion AIC, Akaike, 1974; the generalized crossvalidation GCV, Craven and Wahba, 1979). A consistent criterion identifies the true model with a probability that approaches 1 in large samples when a set of candidate models contains the true model. An efficient criterion selects the model so that its average squared error is asymptotically equivalent to the minimum offered by the candidate models when the true model is approximated by a family of candidate models. Detailed discussions on efficiency and consistency can be found in Shibata (1981, 1984), Li (1987), Shao (1997) and McQuarrie and Tsai (1998).
NonKL degrees of freedom
… is what I actually am interested in atm.
Efficient: AIC etc
Akaike Information Criterion (AIC)
The classic.
Takeuchi Information Criterion (TIC)
Apparently this one was influential in Japan, but untranslated into English, so only belately common in the west. Good explanations are in ClHj08 and KoKi08. Relaxes the assumption that the model is Fisher efficient (i.e. that the true generating process is included in your model, and with enough data you’d discover that.)
Konishi and Kitegawa’s Generalised Information Criterion (GIC)
Taking information criteria to general (e.g. robust, penalised) Mestimation instead of purely ML estimation; also relaxing the assumption that we even have the “true” model in our class. (KoKi96); C&C BuNo95, probably others. In paricular, you are no longer trying to fit the midel by minimising leastsquares errors, for example. ClHj08 mention the “Robustified Information Criterion” in passing, which may relate?
TBD: Explain my laborious reasoning that generalised Akaike information criteria for penalised regression don’t seem work when the penalty term is not differentiable, (cross validation works fine though, abd possibly also BIC) and the issues that therefore arise in model selection for such models in the sparse case.
Focussed information criterion (FIC)
Claeskens and Hjort define this (ClHj08, chapter 6):
The model selection methods presented earlier (such as AIC and the BIC) have one thing in common: they select one single ‘best model’, which should then be used to explain all aspects of the mechanisms underlying the data and predict all future data points. The tolerance discussion in chapter 5 showed that sometimes one model is best for estimating one type of estimand, whereas another model is best for another estimand. The point of view expressed via the [FIC] is that a ‘best model’ should depend on the parameter under focus, such as the mean, or the variance, or the particular covariate values etc Thus the FIC allows and encourages different models to be selected for different parameters of interest.
This sounds very logical; of course, then one must do more work to make it go.
Network information criterion
MuYA94: “an estimator of the expected loss of a loss function \(\ell(\theta)+\lambda H(\theta)\) where \(\H(\theta)\) is a regularisation term”.
Regularization information criterion
Shib89  is this distinct from GIC?
Bootstrap information criterion
A compromise between the computational cheapness of information criteria and the practical simplicity of crossvalidation.
KoKi08 ch 8. See ClHj08 6.3 for a bootstrapFIC.
Consistency of model order selected  AIC
Akaike Information criteria are not asymptotically consistent (see KoKi08.) in the sense that if there is a true model, you do not asymptotically select in the large sample limit with P=1. However, the distribution of model orders does not get worse as n increases. BuAn02 6.3 and KoKi08 3.5.2 discuss this; In a sense it would be surprising if it did do especially well in selecting model order; since our criteria is rather designed to minimise prediction error, not model selection error. Model order is more or less a nuisance parameter in this framework.
TBC.
Crossvalidation equivalence
KoKi08, 10.1.4 discuss the asymptotic equivalence of AIC/TIC/GIC and cross validation under various circumstances, attributing the equivalence results to Ston77 and Shib89. ClHj08 proves a similar result.
Automatic GIC
TBD; I know that KoKi96 give formulae for loss functions for ANY Mestimation and penalisation procedure, but in general the degrees of freedom matrix trace calculation is nasty, and only inprinciple estimable from the data, requiring a matrix product of the hessian at every data point. This is not necessarily computationally tractable  I know of formulae only for GLMs and robust regression with \(\ell_2\) penalties. Can we get such penalties for more general ML fits?
GIC & the LASSO
I thought this didn’t work because we needed the second derivative of the penalty; but see ZhLT10.
Information criteria at scale
Bigdata information criteria. AIC is already computationally cheaper than cross validation. What about when my data is so large that I would like to select my mode before looking at all of it with suchandsuch a guarantee of goodness? Can I do AIC at scale? if I am fitting a model using SGD, can I estimate my model order using partial data? How? I’m interested in doing this in a way that preserves the property of being computationally cheaper than crossvalidating.
Here’s an example… Bondell et al (BoKG10):
In order to avoid complete enumeration of all possible \(2^{p+q}\) models, Wolfinger (1993) and Diggle, Liang and Zeger (1994) recommended the Restricted Information Criterion (denoted by REML.IC), in that, by using the most complex mean structure, selection is first performed on the variancecovariance structure by computing the AIC and/or BIC. Given the best covariance structure, selection is then performed on the fixed effects. Alternatively, Pu and Niu (2006) proposed the EGIC (Extended GIC), where using the BIC, selection is first performed on the fixed effects by including all of the random effects into the model. Once the fixed effect structure is chosen, selection is then performed on the random effects.
In general I’d like to avoid enumerating the models as much as possible and simply select relevent predictors with high probability, compresssivesensing style.
Consistent: Bayesian Information Criteria
a.k.a. Schwarz Information Criterion. Also coinvented by the unstoppable Akaike. (Schw78, Akai78)
This is a different family to the original AIC. This has a justification in terms of MDL and of Bayes risk? Different regularity conditions, something something…
How would this work with regularisation? Apparently Mach93 gives some extensions to introduce robust fitting extensions much as the GIC introduces such to the AIC. ClHj08 give an easy summary and more general settings.
Consistent and/or efficient: Nishii’s Generalised Information Criterion
Nish84, commended by ZhLT10 as a unifying formalism for these efficient/consistent others, which includes efficient and consistenttype information penalties as special cases I don’t know much about this.
Refs
Understanding the “Degrees of freedom” of a model. Estimating that trace penalty matrix.
TODO: Explain AIC, \(C_p\), SURE, and BICtype degrees of freedom, and whatever variants there are out there.
As seen in robust estimation and AIC/BIC. Complexity penalties crop up in model selection. (i.e. choosing the complexity of model appropriate to your data.)
Efron (Efro04) is an excellent introduction, compressing 30 years of theory into 2 pages. Massart (Mass00) seems more modern in flavour to me:
The reader who is not familiar with model selection via [complexity] penalization can legitimately ask the question: where does the idea of penalization come from? It is possible to answer this question at two different levels:
at some intuitive level by presenting the heuristics of one of the first criterion of this kind which has been introduced by Akaike (1973);
at some technical level by explaining why such a strategy of model selection has some chances to succeed.
Yuan and Lin (YuLi06) are an example of the kind of argumentation I need to use to use linear model approximation for general application of DOF in sparse model selection.
Zou et al (ZoHT07):
Degrees of freedom is a familiar phrase for many statisticians. In linear regression the degrees of freedom is the number of estimated predictors. Degrees of freedom is often used to quantify the model complexity of a statistical modeling procedure (Hastie and Tibshirani HaTi90). However, generally speaking, there is no exact correspondence between the degrees of freedom and the number of parameters in the model (Ye, Ye98). […] Stein’s unbiased risk estimation (SURE) theory (Stei81) gives a rigorous definition of the degrees of freedom for any fitting procedure. […] Efron (Efro04) showed that \(C_p\) is an unbiased estimator of the true prediction error, and in some settings it offers substantially better accuracy than crossvalidation and related nonparametric methods. Thus degrees of freedom plays an important role in model assessment and selection. Donoho and Johnstone (DoJo95) used the SURE theory to derive the degrees of freedom of soft thresholding and showed that it leads to an adaptive wavelet shrinkage procedure called SureShrink. Ye (Ye98) and Shen and Ye (ShYe02) showed that the degrees of freedom can capture the inherent uncertainty in modeling and frequentist model selection. Shen and Ye (ShYe02) and Shen, Huang and Ye (ShHY04) further proved that the degrees of freedom provides an adaptive model selection criterion that performs better than the fixedpenalty model selection criteria.
Information Criteria
Akaike and friends. With Mestimation, (e.g. maximum likelihood estimation and robust estimation) these are marvelous and general shortcuts to do model selection. (i.e. choosing the complexity of model appropriate to your data) without resorting to computionally expensive cross validation.
For all of these, a thing called the number of effective degrees of freedom is important. There are several different definitions for that, and they only sometimes coincide, so I leave that for a different notebook.
Information criteria can effectively then e.g. do the same thing crossvalidation at a small fraction of the computational cost. In fact, they are asymptotically the same  see below.
Estimated cross entropy (KLdivergence between model and… what?) seems to be the flavour of the minute in machine learning. (Because cross validation is unbearably slow?) Here is Chrosopher Olah’s excellent visual explanation of it These do relate, no? Actually I haven’t read this.
To learn:

How this interacts with robust estimators

How to use AIC with nonparametric or high dimensional methods (GIC)

How it relates to minimum description length (e.g. BHLL08)
Influential current Englishlanguage texts in this area are BuAn02, ClHj08 and KoKi08. The first of these is highly cited and brought the AIC method into the mainstream in the West from where it had been on the specalised fringes. The latter two focus on extensions such as TIC and GIC.
TBD: general description.
TBD: clarify relationship to Minimum Description Length, Rissanenstyle.
In the literature, selection criteria are usually classified into two categories: consistent (e.g., the Bayesian information criterion BIC, Schwarz, 1978) and efficient (e.g., the Akaike information criterion AIC, Akaike, 1974; the generalized crossvalidation GCV, Craven and Wahba, 1979). A consistent criterion identifies the true model with a probability that approaches 1 in large samples when a set of candidate models contains the true model. An efficient criterion selects the model so that its average squared error is asymptotically equivalent to the minimum offered by the candidate models when the true model is approximated by a family of candidate models. Detailed discussions on efficiency and consistency can be found in Shibata (1981, 1984), Li (1987), Shao (1997) and McQuarrie and Tsai (1998).
NonKL degrees of freedom
… is what I actually am interested in atm.
Efficient: AIC etc
Akaike Information Criterion (AIC)
The classic.
Takeuchi Information Criterion (TIC)
Apparently this one was influential in Japan, but untranslated into English, so only belately common in the west. Good explanations are in ClHj08 and KoKi08. Relaxes the assumption that the model is Fisher efficient (i.e. that the true generating process is included in your model, and with enough data you’d discover that.)
Konishi and Kitegawa’s Generalised Information Criterion (GIC)
Taking information criteria to general (e.g. robust, penalised) Mestimation instead of purely ML estimation; also relaxing the assumption that we even have the “true” model in our class. (KoKi96); C&C BuNo95, probably others. In paricular, you are no longer trying to fit the midel by minimising leastsquares errors, for example. ClHj08 mention the “Robustified Information Criterion” in passing, which may relate?
TBD: Explain my laborious reasoning that generalised Akaike information criteria for penalised regression don’t seem work when the penalty term is not differentiable, (cross validation works fine though, abd possibly also BIC) and the issues that therefore arise in model selection for such models in the sparse case.
Focussed information criterion (FIC)
Claeskens and Hjort define this (ClHj08, chapter 6):
The model selection methods presented earlier (such as AIC and the BIC) have one thing in common: they select one single ‘best model’, which should then be used to explain all aspects of the mechanisms underlying the data and predict all future data points. The tolerance discussion in chapter 5 showed that sometimes one model is best for estimating one type of estimand, whereas another model is best for another estimand. The point of view expressed via the [FIC] is that a ‘best model’ should depend on the parameter under focus, such as the mean, or the variance, or the particular covariate values etc Thus the FIC allows and encourages different models to be selected for different parameters of interest.
This sounds very logical; of course, then one must do more work to make it go.
Network information criterion
MuYA94: “an estimator of the expected loss of a loss function \(\ell(\theta)+\lambda H(\theta)\) where \(\H(\theta)\) is a regularisation term”.
Regularization information criterion
Shib89  is this distinct from GIC?
Bootstrap information criterion
A compromise between the computational cheapness of information criteria and the practical simplicity of crossvalidation.
KoKi08 ch 8. See ClHj08 6.3 for a bootstrapFIC.
Consistency of model order selected  AIC
Akaike Information criteria are not asymptotically consistent (see KoKi08.) in the sense that if there is a true model, you do not asymptotically select in the large sample limit with P=1. However, the distribution of model orders does not get worse as n increases. BuAn02 6.3 and KoKi08 3.5.2 discuss this; In a sense it would be surprising if it did do especially well in selecting model order; since our criteria is rather designed to minimise prediction error, not model selection error. Model order is more or less a nuisance parameter in this framework.
TBC.
Crossvalidation equivalence
KoKi08, 10.1.4 discuss the asymptotic equivalence of AIC/TIC/GIC and cross validation under various circumstances, attributing the equivalence results to Ston77 and Shib89. ClHj08 proves a similar result.
Automatic GIC
TBD; I know that KoKi96 give formulae for loss functions for ANY Mestimation and penalisation procedure, but in general the degrees of freedom matrix trace calculation is nasty, and only inprinciple estimable from the data, requiring a matrix product of the hessian at every data point. This is not necessarily computationally tractable  I know of formulae only for GLMs and robust regression with \(\ell_2\) penalties. Can we get such penalties for more general ML fits?
GIC & the LASSO
I thought this didn’t work because we needed the second derivative of the penalty; but see ZhLT10.
Information criteria at scale
Bigdata information criteria. AIC is already computationally cheaper than cross validation. What about when my data is so large that I would like to select my mode before looking at all of it with suchandsuch a guarantee of goodness? Can I do AIC at scale? if I am fitting a model using SGD, can I estimate my model order using partial data? How? I’m interested in doing this in a way that preserves the property of being computationally cheaper than crossvalidating.
Here’s an example… Bondell et al (BoKG10):
In order to avoid complete enumeration of all possible \(2^{p+q}\) models, Wolfinger (1993) and Diggle, Liang and Zeger (1994) recommended the Restricted Information Criterion (denoted by REML.IC), in that, by using the most complex mean structure, selection is first performed on the variancecovariance structure by computing the AIC and/or BIC. Given the best covariance structure, selection is then performed on the fixed effects. Alternatively, Pu and Niu (2006) proposed the EGIC (Extended GIC), where using the BIC, selection is first performed on the fixed effects by including all of the random effects into the model. Once the fixed effect structure is chosen, selection is then performed on the random effects.
In general I’d like to avoid enumerating the models as much as possible and simply select relevent predictors with high probability, compresssivesensing style.
Consistent: Bayesian Information Criteria
a.k.a. Schwarz Information Criterion. Also coinvented by the unstoppable Akaike. (Schw78, Akai78)
This is a different family to the original AIC. This has a justification in terms of MDL and of Bayes risk? Different regularity conditions, something something…
How would this work with regularisation? Apparently Mach93 gives some extensions to introduce robust fitting extensions much as the GIC introduces such to the AIC. ClHj08 give an easy summary and more general settings.
Consistent and/or efficient: Nishii’s Generalised Information Criterion
Nish84, commended by ZhLT10 as a unifying formalism for these efficient/consistent others, which includes efficient and consistenttype information penalties as special cases I don’t know much about this.
Refs
 ThCl13: (2013) A comparison of robust versions of the AIC based on M, S and MMestimators. Statistics, 47(1), 216–235. DOI
 BuNo95: (1995) A general Akaiketype criterion for model selection in robust regression. Biometrika, 82(4), 877–886. DOI
 HMXZ09: (2009) A group bridge approach for variable selection. Biometrika, 96(2), 339–355. DOI
 Akai78: (1978) A new look at the Bayes procedure. Biometrika, 65(1), 53–59. DOI
 ChLW14: (2014) A Practical Scheme and Fast Algorithm to Tune the Lasso With Optimality Guarantees. ArXiv:1410.0247 [Math, Stat].
 RaWu89: (1989) A strongly consistent procedure for model selection in a regression problem. Biometrika, 76(2), 369–374. DOI
 SKWG14: (2014) A unifying approach to the estimation of the conditional Akaike information in generalized linear mixed models. Electronic Journal of Statistics, 8(1), 201–225. DOI
 DoJo95: (1995) Adapting to Unknown Smoothness via Wavelet Shrinkage. Journal of the American Statistical Association, 90(432), 1200–1224. DOI
 ShYe02: (2002) Adaptive Model Selection. Journal of the American Statistical Association, 97(457), 210–221. DOI
 ShHY04: (2004) Adaptive Model Selection and Assessment for Exponential Family Distributions. Technometrics, 46(3), 306–317. DOI
 CaSh98: (1998) An Akaike information criterion for model selection in the presence of incomplete data. Journal of Statistical Planning and Inference, 67(1), 45–65. DOI
 Ston77: (1977) An Asymptotic Equivalence of Choice of Model by CrossValidation and Akaike’s Criterion. Journal of the Royal Statistical Society. Series B (Methodological), 39(1), 44–47.
 Li87: (1987) Asymptotic Optimality for \(C_p, C_L\), CrossValidation and Generalized CrossValidation: Discrete Index Set. The Annals of Statistics, 15(3), 958–975. DOI
 Nish84: (1984) Asymptotic Properties of Criteria for Selection of Variables in Multiple Regression. The Annals of Statistics, 12(2), 758–765. DOI
 ClKO09: (2009) Asymptotic properties of penalized spline estimators. Biometrika, 96(3), 529–544. DOI
 KoKi03: (2003) Asymptotic theory for information criteria in model selection—functional approach. Journal of Statistical Planning and Inference, 114(1–2), 45–61. DOI
 BaHy01: (2001) Bandwidth selection for kernel conditional density estimation. Computational Statistics & Data Analysis, 36(3), 279–298. DOI
 Ston79: (1979) Comments on Model Selection Criteria of Akaike and Schwarz. Journal of the Royal Statistical Society. Series B (Methodological), 41(2), 276–278.
 Mass07: (2007) Concentration inequalities and model selection: Ecole d’Eté de Probabilités de SaintFlour XXXIII  2003. Berlin ; New York: SpringerVerlag
 Bune04: (2004) Consistent covariate selection and post model selection inference in semiparametric regression. The Annals of Statistics, 32(3), 898–927. DOI
 Tibs15: (2015) Degrees of Freedom and Model Search. Statistica Sinica, 25(3), 1265–1296.
 JaFH15: (2015) Effective degrees of freedom: a flawed metaphor. Biometrika, 102(2), 479–485. DOI
 LiLe16: (2016) Efficient Feature Selection With Large and Highdimensional Data. ArXiv:1609.07195 [Stat].
 Barr86: (1986) Entropy and the Central Limit Theorem. The Annals of Probability, 14(1), 336342. DOI
 Schw78: (1978) Estimating the Dimension of a Model. The Annals of Statistics, 6(2), 461–464. DOI
 ImKo99: (1999) Estimation of Bspline Nonparametric Regression Models using Information.
 Stei81: (1981) Estimation of the Mean of a Multivariate Normal Distribution. The Annals of Statistics, 9(6), 1135–1151. DOI
 ChCh08: (2008) Extended Bayesian information criteria for model selection with large model spaces. Biometrika, 95(3), 759–771. DOI
 Sugi78: (1978) Further analysts of the data by Akaike’ s Information Criterion and the finite corrections. Communications in Statistics  Theory and Methods, 7(1), 13–26. DOI
 KoKi96: (1996) Generalised information criteria in model selection. Biometrika, 83(4), 875–890. DOI
 HaTi90: (1990) Generalized additive models (Vol. 43). CRC Press
 Efro86: (1986) How biased is the apparent error rate of a prediction rule? Journal of the American Statistical Association, 81(394), 461–470. DOI
 KoKi08: (2008) Information criteria and statistical modeling. New York: Springer
 Akai73: (1973) Information Theory and an Extension of the Maximum Likelihood Principle. In Proceeding of the Second International Symposium on Information Theory (pp. 199–213). Budapest: Akademiai Kiado
 LeBa06: (2006) Information Theory and Mixing LeastSquares Regressions. IEEE Transactions on Information Theory, 52(8), 3396–3410. DOI
 BoKG10: (2010) Joint Variable Selection for Fixed and Random Effects in Linear MixedEffects Models. Biometrics, 66(4), 1069–1077. DOI
 Akai81: (1981) Likelihood of a model and information criteria. Journal of Econometrics, 16(1), 3–14. DOI
 Akai73: (1973) Maximum likelihood identification of Gaussian autoregressive moving average models. Biometrika, 60(2), 255–265. DOI
 BHLL08: (2008) MDL, penalized likelihood, and statistical risk. In Information Theory Workshop, 2008. ITW’08. IEEE (pp. 247–257). IEEE DOI
 BiMa06: (2006) Minimal Penalties for Gaussian Model Selection. Probability Theory and Related Fields, 138(1–2), 33–73. DOI
 LiBa00: (2000) Mixture Density Estimation. In Advances in Neural Information Processing Systems 12 (pp. 279–285). MIT Press
 BuBA97: (1997) Model Selection: An Integral Part of Inference. Biometrics, 53(2), 603–618. DOI
 YuLi06: (2006) Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 68(1), 49–67. DOI
 ClHj08: (2008) Model selection and model averaging. Cambridge ; New York: Cambridge University Press
 BuAn02: (2002) Model selection and multimodel inference: a practical informationtheoretic approach. New York: Springer
 Riss78: (1978) Modeling by shortest data description. Automatica, 14(5), 465–471. DOI
 BuAn04: (2004) Multimodel Inference Understanding AIC and BIC in Model Selection. Sociological Methods & Research, 33(2), 261–304. DOI
 MuYA94: (1994) Network information criteriondetermining the number of hidden units for an artificial neural network model. IEEE Transactions on Neural Networks, 5(6), 865–872. DOI
 AnKI08: (2008) Nonlinear regression modeling via regularized radial basis function networks. Journal of Statistical Planning and Inference, 138(11), 3616–3633. DOI
 Ye98: (1998) On Measuring and Correcting the Effects of Data Mining and Model Selection. Journal of the American Statistical Association, 93(441), 120–131. DOI
 RaWu01: (2001) On model selection. In Institute of Mathematical Statistics Lecture Notes  Monograph Series (Vol. 38, pp. 1–57). Beachwood, OH: Institute of Mathematical Statistics
 QiKü98: (1998) On model selection via stochastic complexity in robust linear regression. Journal of Statistical Planning and Inference, 75(1), 91–116. DOI
 Kato09: (2009) On the degrees of freedom in shrinkage estimation. Journal of Multivariate Analysis, 100(7), 1338–1352. DOI
 ZoHT07: (2007) On the “degrees of freedom” of the lasso. The Annals of Statistics, 35(5), 2173–2192. DOI
 Tadd13: (2013) Onestep estimator paths for concave regularization. ArXiv:1308.5623 [Stat].
 ShHu06: (2006) Optimal Model Assessment, Selection, and Combination. Journal of the American Statistical Association, 101(474), 554–568. DOI
 HuTs89: (1989) Regression and time series model selection in small samples. Biometrika, 76(2), 297–307. DOI
 Tibs96: (1996) Regression Shrinkage and Selection via the Lasso. Journal of the Royal Statistical Society. Series B (Methodological), 58(1), 267–288.
 BLTG06: (2006) Regularization in statistics. Test, 15(2), 271–344. DOI
 ZhLT10: (2010) Regularization Parameter Selections via Generalized Information Criterion. Journal of the American Statistical Association, 105(489), 312–323. DOI
 HuCB08: (2008) Risk of penalized least squares, greedy selection and l1 penalization for flexible function libraries.
 Mach93: (1993) Robust Model Selection and MEstimation. Econometric Theory, 9(03), 478–493. DOI
 HuST98: (1998) Smoothing Parameter Selection in Nonparametric Regression Using an Improved Akaike Information Criterion. Journal of the Royal Statistical Society. Series B (Statistical Methodology), 60(2), 271–293.
 Mass00: (2000) Some applications of concentration inequalities to statistics. In Annales de la Faculté des sciences de Toulouse: Mathématiques (Vol. 9, pp. 245–303).
 QiHa96: (1996) Some notes on Rissanen’s stochastic complexity
 Shib89: (1989) Statistical Aspects of Model Selection. In From Data to Model (pp. 215–240). Springer Berlin Heidelberg DOI
 DKFP11: (2011) The degrees of freedom of the Lasso for general design matrix. ArXiv:1111.1162 [Cs, Math, Stat].
 Efro04: (2004) The Estimation of Prediction Error. Journal of the American Statistical Association, 99(467), 619–632. DOI
 BaRY98: (1998) The minimum description length principle in coding and modeling. IEEE Transactions on Information Theory, 44(6), 2743–2760. DOI
 Cava97: (1997) Unifying the derivations for the Akaike and corrected Akaike information criteria. Statistics & Probability Letters, 33(2), 201–208. DOI
 LiLi08: (2008) Variable selection in semiparametric regression modeling. The Annals of Statistics, 36(1), 261–286. DOI
 FaLi01: (2001) Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties. Journal of the American Statistical Association, 96(456), 1348–1360. DOI
 KaRo14: (2014) When does more regularization imply fewer degrees of freedom? Sufficient conditions and counterexamples. Biometrika, 101(4), 771–784. DOI