# Sparse regression

Penalised regression where the penalties are sparsifying. The prediction losses could be anything – likelihood, least-squares, robust Huberised losses, absolute deviation etc.

I will play fast and loose with terminology here regarding theoretical and empirical losses, and the statistical models we attempt to fit.

In nonparametric statistics we might estimate simultaneously what look like many, many parameters, which we constrain in some clever fashion, which usually boils down to something we can interpret as a smoothing parameters, controlling how many factors we still have to consider, from a subset of the original.

I will usually discuss our intent to minimise prediction error, but one could also try to minimise model selection error too.

Then we have a simultaneous estimation and model selection procedure, probably a specific sparse model selection procedure and we possibly have to choose clever optimisation method to do the whole thing fast. Related to compressed sensing, but here we consider sampling complexity and measurement error.

TODO: make comprehensible

TODO: examples

TODO: disambiguate the optimisation technologies at play – iteratively reweighted least squares etc.

Now! A set of headings under which I will try to understand some things, mostly the LASSO variants.

## LASSO

Quadratic loss penalty, absolute coefficient penalty.

The original, and the one actually included in all the statistics software.

This is not a model, per se, but a fitting procedure. We can, however, try to see what models and designs this procedure happens to work with.

## Adaptive LASSO

TBD. This is the one with famous oracle properties. Hsi Zou’s paper on this (Zou06) is very readable. I am having trouble digesting Sara van de Geer’s paper (Geer08) on the Generalised Lasso, but it seems to offer me guarantees for something very similar to the Adaptive Lasso, but with far more general assumptions on the model and loss functions, and some finite sample guarnatees.

## LARS

A confusing one; LASSO and LARS are not the same thing but you can use one to calculate the other? Something like that? I need to work this one through with a pencil and paper.

## Graph LASSO

As used in graphical models. TBD

## Elastic net

Combination of $$L_1$$ and $$L_2$$ penalties. TBD.

## Grouped LASSO

AFAICT this is the usual LASSO but with grouped factors. See Yuan and Lin (YuLi06).

## Model selection

Can be fiddly with sparse regression, which couples variable selection tightly with parameter estimation. See sparse model selection.

## Debiassed LASSO

There exist a few versions, but the one I have needed is van de Geer and Bühlmann 2014, section 2.1. See also and Geer14c. (TODO: relation to Geer08?)

## Sparse basis expansions

Wavelets etc; mostly handled under sparse dictionary bases.

## Sparse neural nets

That is, sparse regressions as the layers in a neural network? Sure thing. (WCLH16)

## Other coefficient penalties

Put a weird penalty on the coefficients! E.g. “Smoothly Clipped Absolute Deviation” (SCAD). TBD.

## Other prediction losses

Put a weird penalty on the error! MAD prediction penalty, lasso-coefficient penalty, etc.

See PoKo97 and WaLJ07 for e.g. some implementations using maximum prediction error.

## Implementations

Hastie, Friedman eta’s glmnet for R is fast and well-regarded, and has a MATLAB version. Here’s how to use it for adaptive lasso.

SPAMS (C++, MATLAB, R, python) by Mairal, looks interesting. It’s an optimisation library for many, many sparse problems.

liblinear also include lasso-type solvers, as well as support-vector regression.

## Tidbits

Sparse regression as a universal classifier explainer? Local Interpretable Model-agnostic Explanations uses LASSO for this. (See the blog post, or the source.