# Sparse regression

Penalised regression where the penalties are sparsifying. The prediction losses could be anything – likelihood, least-squares, robust Huberised losses, absolute deviation etc.

I will play fast and loose with terminology here regarding theoretical and empirical losses, and the statistical models we attempt to fit.

In nonparametric statistics we might estimate simultaneously what look like many, many parameters, which we constrain in some clever fashion, which usually boils down to something we can interpret as a smoothing parameters, controlling how many factors we still have to consider, from a subset of the original.

I will usually discuss our intent to minimise prediction error, but one could also try to minimise model selection error too.

Then we have a simultaneous estimation and model selection procedure, probably a specific sparse model selection procedure and we possibly have to choose clever optimisation method to do the whole thing fast. Related to compressed sensing, but here we consider sampling complexity and measurement error.

TODO: make comprehensible

TODO: examples

TODO: disambiguate the optimisation technologies at play – iteratively reweighted least squares etc.

Now! A set of headings under which I will try to understand some things, mostly the LASSO variants.

## LASSO

Quadratic loss penalty, absolute coefficient penalty.

The original, and the one actually included in all the statistics software.

This is not a model, per se, but a fitting procedure. We can, however, try to see what models and designs this procedure happens to work with.

TBD. This is the one with famous oracle properties. Hsi Zou's paper on this (Zou06) is very readable. I am having trouble digesting Sara van de Geer's paper (Geer08) on the Generalised Lasso, but it seems to offer me guarantees for something very similar to the Adaptive Lasso, but with far more general assumptions on the model and loss functions, and some finite sample guarnatees.

## LARS

A confusing one; LASSO and LARS are not the same thing but you can use one to calculate the other? Something like that? I need to work this one through with a pencil and paper.

## Graph LASSO

As used in graphical models. TBD

## Elastic net

Combination of $L_1$ and $L_2$ penalties. TBD.

## Grouped LASSO

AFAICT this is the usual LASSO but with grouped factors. See Yuan and Lin (YuLi06).

## Model selection

Can be fiddly with sparse regression, which couples variable selection tightly with parameter estimation. See sparse model selection.

## Debiassed LASSO

See GBRD14 and Geer14c. Same as Geer08, or not?

## Sparse basis expansions

Wavelets etc; mostly handled under sparse dictionary bases.

## Sparse neural nets

That is, sparse regressions as the layers in a neural network? Sure thing. (WCLH16)

“Smoothly Clipped Absolute Deviation”. TBD.

TBD

## Other prediction losses

This should work for heavy-tailed noise. MAD prediction penalty, lasso-coefficient penalty. What do you call this even?

See PoKo97 and WaLJ07 for some implementations using maximum prediction error. I believe the Dantzig selector uses the infinity-norm. Robust prediction losses? How do they relate?

## Bayesian Lasso

Laplace priors on linear regression coefficients, includes normal lasso as a MAP estimate. Not particularly sparse. I have not need for this right now, but I did read Dan Simpson's critique

## Implementations

Hastie, Friedman eta's glmnet for R is fast and well-regarded, and has a MATLAB version. Here's how to use it for adaptive lasso.

SPAMS (C++, MATLAB, R, python) by Mairal, looks interesting. It's an optimisation library for many, many sparse problems.

liblinear also include lasso-type solvers, as well as support-vector regression.

## Tidbits

Sparse regression as a universal classifier explainer? Local Interpretable Model-agnostic Explanations might do it (RiSG16).