# Density estimation

A statistical estimation problem where you are not trying to estimate a function of a distribution of random observations, but the distribution itself. In a sense, all of statistics implicitly does density estimation, but this is usually purely instrumental in the course of discovering the real parameter of interest.

So, estimating distributions nonparametrically we do as with the usual function approximation. We might set ourselves different loss functions than in regression also; Instead of, e.g. expected (L_p$prediction error we might use a traditional function approximation$L_p) loss, or a probability divergence measure.

Unlike general functional approximation, the problem is constrained; Our density has to integrate to unity by definition; it may not be negative.

The most common one, that we use implicity all the time, is to simply take the empirical distribution as a distribution estimate; that is, taking the data as a model for itself. This has various non-useful features such as being rough and rather hard to visualise, unless your spectacle prescription can resolve Dirac deltas.

Two common smoother methods are kernel density estimates and mixture models (aka radial basis functions), but you might also like to try splines or arbitrary functional bases such as dictionary/wavelet bases or product kernel methods or log/non-negative regression, or, relatedly, gaussian process methods? (I am told they are related; but I know little of them)

Questions:

• When would I actually want to estimate, specifically, a density?
• Visualisation

• nonparametric regression without any better ideas

• …?

• What are appropriate purpose for each of probability metrics?

• What about non-parametric conditional density estimation? Are there any general ways to do this?

• What does the kernel trick get me here?

• Can I use spectral methods to do this outside of kernel estimation?

## Divergence measures/contrasts

There are many choices for loss functions between densities here; any of the probability metrics will do. For reasons of tradition or convenience, when the object of interest is the density itself, certain choices dominate:

• $$L_2$$ with respect to the density over Lebesgue measure on the state space, which we call the MISE.

• KL-divergence. (May not do what you want if you care about performance near 0. See Hall87.)

• Hellinger distance

But there are others.

## Risk Bounds

But having chosen the divergence you wish to minimise, you now have to choose with respect to which criterion you wish to minimse it? Minimax? in probability? In expectation? …? Every combination is a different publication. Hmf.

## Minimising Expected (or whatever) MISE

Surprisingly complex. This works fine for Kernel Density Estimators where it turns out just be a Wiener filter where you have to choose a bandwidth. How do you do this for parametric estimators, though?

## Connection to point processes

There is a connection between spatial point process intensity estimation and density estimation. See Densities and intensities.

## k-NN estimates.

Filed here because too small to do elsewhere

To use nearest neighbor methods, the integer k must be selected. This is similar to bandwidth selection, although here k is discrete, not continuous. K.C. Li (Annals of Statistics, 1987) showed that for the knn regression estimator under conditional homoskedasticity, it is asymptotically optimal to pick k by Mallows, Generalized CV, or CV. Andrews (Journal of Econometrics, 1991) generalized this result to the case of heteroskedasticity, and showed that CV is asymptotically optimal.

## kernel density estimators

### Fancy ones

HT Gery Geenens for a lecture he just gave on convolution kernel density estimation, where he drew a parallel between additive noise in kde estimation and multiplicative noise in non-negative-valued variables.

## Refs

Necessarily scattershot; you probably want to see the refs for a particular density estimation method.