Penalised/regularised regression

On regression estimation with penalties on the model parameters. I am especially interested when the penalties are sparsifying penalties, and I have more notes to sparse regression.

Here I consider general penalties: ridge etc. At least in principle - I have no active projects using penalties without sparsifying them at the moment.

Why might I use such penalties? One reason would be that $L_2$ penalties have simple forms for their information criteria, as shown by Konishi and Kitagawa (KoKi08 5.2.4).

To discuss:

Ridge penalties, relationship with robust regression etc.

In nonparametric statistics we might estimate simultaneously what look like many, many parameters, which we constrain in some clever fashion, which usually boils down to something we can interpret as a “penalty” on the parameters.

“Penalization” has a genealogy unknown to me, but is probably the least abstruse for common, general usage.

The “regularisation” nomenclature claims descent from Tikhonov, (eg TiGl65 etc) who wanted to solve ill-conditioned integral and differential equations, so it’s somewhat more general. “Smoothing” seems to be common in the spline and kernel estimate communities of Wahba (Wahb90) and Silverman (Silv82) et al, who usually actually want to smooth curves. When you say “smoothing” you usually mean that you can express your predictions as a “linear smoother”/hat matrix, which has certain nice properties in generalised cross validation.

“smoothing” is not a great general term, since penalisation does not necessarily cause “smoothness” - for example, some penalties cause the coefficients to become sparse and therefore, from the perspective of coefficients, it promotes non-smooth vectors.

In every case, you wish to solve an ill-conditioned inverse problem, so you tame it by adding a penalty to solutions you feel one should be reluctant to accept.

TODO: specifics

What should we regularize to attain specific kinds of solutions?

Here’s one thing i saw recently:

Venkat Chandrasekaran Learning Semidefinite Regularizers via Matrix Factorization

Abstract: Regularization techniques are widely employed in the solution of inverse problems in data analysis and scientific computing due to their effectiveness in addressing difficulties due to ill-posedness. In their most common manifestation, these methods take the form of penalty functions added to the objective in optimization-based approaches for solving inverse problems. The purpose of the penalty function is to induce a desired structure in the solution, and these functions are specified based on prior domain-specific expertise. We consider the problem of learning suitable regularization functions from data in settings in which prior domain knowledge is not directly available. Previous work under the title of ‘dictionary learning’ or ‘sparse coding’ may be viewed as learning a polyhedral regularizer from data. We describe generalizations of these methods to learn semidefinite regularizers by computing structured factorizations of data matrices. Our algorithmic approach for computing these factorizations combines recent techniques for rank minimization problems along with operator analogs of Sinkhorn scaling. The regularizers obtained using our framework can be employed effectively in semidefinite programming relaxations for solving inverse problems. (Joint work with Yong Sheng Soh)

Refs

Akai73a
Akaike, H. (1973a) Information Theory and an Extension of the Maximum Likelihood Principle. In P. F. Caski (Ed.), Proceeding of the Second International Symposium on Information Theory (pp. 199–213). Budapest: Akademiai Kiado
Akai73b
Akaike, H. (1973b) Maximum likelihood identification of Gaussian autoregressive moving average models. Biometrika, 60(2), 255–265. DOI.
AzKS15
Azizyan, M., Krishnamurthy, A., & Singh, A. (2015) Extreme Compressive Sampling for Covariance Estimation. arXiv:1506.00898 [Cs, Math, Stat].
BCFS14
Banerjee, A., Chen, S., Fazayeli, F., & Sivakumar, V. (2014) Estimation with Norm Regularization. In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, & K. Q. Weinberger (Eds.), Advances in Neural Information Processing Systems 27 (pp. 1556–1564). Curran Associates, Inc.
BHLL08
Barron, A. R., Huang, C., Li, J. Q., & Luo, X. (2008) MDL, penalized likelihood, and statistical risk. In Information Theory Workshop, 2008. ITW’08. IEEE (pp. 247–257). IEEE DOI.
Batt92
Battiti, R. (1992) First-and second-order methods for learning: between steepest descent and Newton’s method. Neural Computation, 4(2), 141–166. DOI.
BüGe11
Bühlmann, P., & Geer, S. van de. (2011) Additive models and many smooth univariate functions. In Statistics for High-Dimensional Data (pp. 77–97). Springer Berlin Heidelberg
BüGe15
Bühlmann, P., & van de Geer, S. (2015) High-dimensional inference in misspecified linear models. arXiv:1503.06426 [Stat], 9(1), 1449–1473. DOI.
BuNo95
Burman, P., & Nolan, D. (1995) A general Akaike-type criterion for model selection in robust regression. Biometrika, 82(4), 877–886. DOI.
CaFe13
Candès, E. J., & Fernandez-Granda, C. (2013) Super-Resolution from Noisy Data. Journal of Fourier Analysis and Applications, 19(6), 1229–1254. DOI.
CaPl10
Candès, E. J., & Plan, Y. (2010) Matrix Completion With Noise. Proceedings of the IEEE, 98(6), 925–936. DOI.
Cava97
Cavanaugh, J. E.(1997) Unifying the derivations for the Akaike and corrected Akaike information criteria. Statistics & Probability Letters, 33(2), 201–208. DOI.
ChWa00
Chen, Y.-C., & Wang, Y.-X. (n.d.) Discussion on “Confidence Intervals and Hypothesis Testing for High-Dimensional Regression”.
Efro04
Efron, B. (2004) The Estimation of Prediction Error. Journal of the American Statistical Association, 99(467), 619–632. DOI.
FlHS13
Flynn, C. J., Hurvich, C. M., & Simonoff, J. S.(2013) Efficiency for Regularization Parameter Selection in Penalized Likelihood Estimation of Misspecified Models. arXiv:1302.2068 [Stat].
GiSB14
Giryes, R., Sapiro, G., & Bronstein, A. M.(2014) On the Stability of Deep Networks. arXiv:1412.5896 [Cs, Math, Stat].
GuLi05
Gui, J., & Li, H. (2005) Penalized Cox regression analysis in the high-dimensional and low-sample size settings, with applications to microarray gene expression data. Bioinformatics, 21(13), 3001–3008. DOI.
HaTi90
Hastie, T. J., & Tibshirani, R. J.(1990) Generalized additive models. (Vol. 43). CRC Press
HaTW15
Hastie, T. J., Tibshirani, Rob, & Wainwright, M. J.(2015) Statistical Learning with Sparsity: The Lasso and Generalizations. . Boca Raton: Chapman and Hall/CRC
HaKD13
Hawe, S., Kleinsteuber, M., & Diepold, K. (2013) Analysis operator learning and its application to image reconstruction. IEEE Transactions on Image Processing, 22(6), 2138–2150. DOI.
HeIS15
Hegde, C., Indyk, P., & Schmidt, L. (2015) A nearly-linear time framework for graph-structured sparsity. In Proceedings of the 32nd International Conference on Machine Learning (ICML-15) (pp. 928–937).
HoKe70
Hoerl, A. E., & Kennard, R. W.(1970) Ridge Regression: Biased Estimation for Nonorthogonal Problems. Technometrics, 12(1), 55–67. DOI.
JaFH15
Janson, L., Fithian, W., & Hastie, T. J.(2015) Effective degrees of freedom: a flawed metaphor. Biometrika, 102(2), 479–485. DOI.
JaMo14
Javanmard, A., & Montanari, A. (2014) Confidence Intervals and Hypothesis Testing for High-dimensional Regression. Journal of Machine Learning Research, 15(1), 2869–2909.
KaRo14
Kaufman, S., & Rosset, S. (2014) When does more regularization imply fewer degrees of freedom? Sufficient conditions and counterexamples. Biometrika, 101(4), 771–784. DOI.
KoKi96
Konishi, S., & Kitagawa, G. (1996) Generalised information criteria in model selection. Biometrika, 83(4), 875–890. DOI.
KoKi08
Konishi, S., & Kitagawa, G. (2008) Information criteria and statistical modeling. . New York: Springer
Mont12
Montanari, A. (2012) Graphical models concepts in compressed sensing. Compressed Sensing: Theory and Applications, 394–438.
NeTr08
Needell, D., & Tropp, J. A.(2008) CoSaMP: Iterative signal recovery from incomplete and inaccurate samples. arXiv:0803.2392 [Cs, Math].
RaRe09
Rahimi, A., & Recht, B. (2009) Weighted Sums of Random Kitchen Sinks: Replacing minimization with randomization in learning. In Advances in neural information processing systems (pp. 1313–1320). Curran Associates, Inc.
ShHu06
Shen, X., & Huang, H.-C. (2006) Optimal Model Assessment, Selection, and Combination. Journal of the American Statistical Association, 101(474), 554–568. DOI.
ShHY04
Shen, X., Huang, H.-C., & Ye, J. (2004) Adaptive Model Selection and Assessment for Exponential Family Distributions. Technometrics, 46(3), 306–317. DOI.
ShYe02
Shen, X., & Ye, J. (2002) Adaptive Model Selection. Journal of the American Statistical Association, 97(457), 210–221. DOI.
Silv82
Silverman, B. W.(1982) On the Estimation of a Probability Density Function by the Maximum Penalized Likelihood Method. The Annals of Statistics, 10(3), 795–810. DOI.
SFHT11
Simon, N., Friedman, J., Hastie, T., & Tibshirani, R. (2011) Regularization Paths for Cox’s Proportional Hazards Model via Coordinate Descent. Journal of Statistical Software, 39(5).
Stei81
Stein, C. M.(1981) Estimation of the Mean of a Multivariate Normal Distribution. The Annals of Statistics, 9(6), 1135–1151. DOI.
TiGl65
Tikhonov, A. N., & Glasko, V. B.(1965) Use of the regularization method in non-linear problems. USSR Computational Mathematics and Mathematical Physics, 5(3), 93–107. DOI.
Uema15
Uematsu, Y. (2015) Penalized Likelihood Estimation in High-Dimensional Time Series Models and its Application. arXiv:1504.06706 [Math, Stat].
Geer14
van de Geer, S. (2014) Statistical Theory for High-Dimensional Models. arXiv:1409.8557 [Math, Stat].
Wahb90
Wahba, G. (1990) Spline Models for Observational Data. . SIAM
WuLa08
Wu, T. T., & Lange, K. (2008) Coordinate descent algorithms for lasso penalized regression. The Annals of Applied Statistics, 2(1), 224–244. DOI.
Ye98
Ye, J. (1998) On Measuring and Correcting the Effects of Data Mining and Model Selection. Journal of the American Statistical Association, 93(441), 120–131. DOI.
ZhZh14
Zhang, C.-H., & Zhang, S. S.(2014) Confidence intervals for low dimensional parameters in high dimensional linear models. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 76(1), 217–242. DOI.
ZhLT10
Zhang, Y., Li, R., & Tsai, C.-L. (2010) Regularization Parameter Selections via Generalized Information Criterion. Journal of the American Statistical Association, 105(489), 312–323. DOI.
ZoHa05
Zou, H., & Hastie, T. (2005) Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67(2), 301–320. DOI.