A statistical estimation problem where you are not trying to estimate a function of a distribution of random observations, but the distribution itself. In a sense, all of statistics implicitly does density estimation, but this is usually purely instrumental in the course of discovering the real parameter of interest.
So, estimating distributions nonparametrically we do as with the usual function approximation. We might set ourselves different loss functions than in regression also; Instead of, e.g. expected L_p loss, or a probability divergence measure.
Unlike general functional approximation, the problem is constrained; Our density has to integrate to unity by definition; it may not be negative.
The most common one, that we use implicity all the time, is to simply take the empirical distribution as a distribution estimate; that is, taking the data as a model for itself. This has various nonuseful features such as being rough and rather hard to visualise, unless your spectacle prescription can resolve Dirac deltas.
Two common smoother methods are kernel density estimates and mixture models (aka radial basis functions), but you might also like to try splines or arbitrary functional bases such as dictionary/wavelet bases or product kernel methods or log/nonnegative regression, or, relatedly, gaussian process methods? (I am told they are related; but I know little of them)
Questions:
 When would I actually want to estimate, specifically, a density?
Visualisation
nonparametric regression without any better ideas
…?

What are appropriate purpose for each of probability metrics?

What about nonparametric conditional density estimation? Are there any general ways to do this?

What does the kernel trick get me here?

Can I use spectral methods to do this outside of kernel estimation?
Divergence measures/contrasts
There are many choices for loss functions between densities here; any of the probability metrics will do. For reasons of tradition or convenience, when the object of interest is the density itself, certain choices dominate:

with respect to the density over Lebesgue measure on the state space, which we call the MISE.

KLdivergence. (May not do what you want if you care about performance near 0. See Hall87.)

Hellinger distance
But there are others.
Risk Bounds
But having chosen the divergence you wish to minimise, you now have to choose with respect to which criterion you wish to minimse it? Minimax? in probability? In expectation? …? Every combination is a different publication. Hmf.
Minimising Expected (or whatever) MISE
Surprisingly complex. This works fine for Kernel Density Estimators where it turns out just be a Wiener filter where you have to choose a bandwidth. How do you do this for parametric estimators, though?
Connection to point processes
There is a connection between spatial point process intensity estimation and density estimation. See Densities and intensities.
kNN estimates.
Filed here because too small to do elsewhere
To use nearest neighbor methods, the integer k must be selected. This is similar to bandwidth selection, although here k is discrete, not continuous. K.C. Li (Annals of Statistics, 1987) showed that for the knn regression estimator under conditional homoskedasticity, it is asymptotically optimal to pick k by Mallows, Generalized CV, or CV. Andrews (Journal of Econometrics, 1991) generalized this result to the case of heteroskedasticity, and showed that CV is asymptotically optimal.
kernel density estimators
Fancy ones
HT Gery Geenens for a lecture he just gave on convolution kernel density estimation, where he drew a parallel between additive noise in kde estimation and multiplicative noise in nonnegativevalued variables.
Refs
Necessarily scattershot; you probably want to see the refs for a particular density estimation method.
 ReRT11: Patricia ReynaudBouret, Vincent Rivoirard, Christine TuleauMalot (2011) Adaptive density estimation: A curse of support? Journal of Statistical Planning and Inference, 141(1), 115–139. DOI
 Nore10: Andriy Norets (2010) Approximation of conditional densities by smooth mixtures of regressions. The Annals of Statistics, 38(3), 1733–1766. DOI
 Li87: KerChau Li (1987) Asymptotic Optimality for , CrossValidation and Generalized CrossValidation: Discrete Index Set. The Annals of Statistics, 15(3), 958–975. DOI
 Andr91: Donald W. K. Andrews (1991) Asymptotic optimality of generalized CL, crossvalidation, and generalized crossvalidation in regression with heteroskedastic errors. Journal of Econometrics, 47(2), 359–377. DOI
 DeLu01: Luc Devroye, Gábor Lugosi (2001) Combinatorial methods in density estimation. New York: Springer
 Efro07: Sam Efromovich (2007) Conditional density estimation in a regression setting. The Annals of Statistics, 35(6), 2504–2535. DOI
 STSK10: Masashi Sugiyama, Ichiro Takeuchi, Taiji Suzuki, Takafumi Kanamori, Hirotaka Hachiya, Daisuke Okanohara (2010) Conditional density estimation via leastsquares density ratio estimation. In International Conference on Artificial Intelligence and Statistics (pp. 781–788).
 Scho05: Frederic Paik Schoenberg (2005) Consistent parametric estimation of the intensity of a spatial–temporal point process. Journal of Statistical Planning and Inference, 128(1), 79–93. DOI
 SaTs10: Sylvain Sardy, Paul Tseng (2010) Density Estimation by Total Variation Penalized Likelihood Driven by the Sparsity ℓ1 Information Criterion. Scandinavian Journal of Statistics, 37(2), 321–337. DOI
 KoMi06: Roger Koenker, Ivan Mizera (2006) Density estimation by total variation regularization. Advances in Statistical Modeling and Inference, 613–634.
 Elli91: Steven P. Ellis (1991) Density estimation for point processes. Stochastic Processes and Their Applications, 39(2), 345–358. DOI
 ZeMe97: Assaf J. Zeevi, Ronny Meir (1997) Density Estimation Through Convex Combinations of Densities: Approximation and Estimation Bounds. Neural Networks: The Official Journal of the International Neural Network Society, 10(1), 99–109. DOI
 BeDi89: Mark Berman, Peter Diggle (1989) Estimating Weighted Integrals of the SecondOrder Intensity of a Spatial Point Process. Journal of the Royal Statistical Society. Series B (Methodological), 51(1), 81–92.
 Ibra01: I. Ibragimov (2001) Estimation of analytic functions. In Institute of Mathematical Statistics Lecture Notes  Monograph Series (pp. 359–383). Beachwood, OH: Institute of Mathematical Statistics
 CuSS08: John P. Cunningham, Krishna V. Shenoy, Maneesh Sahani (2008) Fast Gaussian process methods for point process intensity estimation. (pp. 192–199). ACM Press DOI
 DHNN16: Vu Dinh, Lam Si Tung Ho, Duy Nguyen, Binh T. Nguyen (2016) Fast learning rates with heavytailed losses. In NIPS.
 EiMa96: Paul H. C. Eilers, Brian D. Marx (1996) Flexible smoothing with Bsplines and penalties. Statistical Science, 11(2), 89–121. DOI
 ShSh10: Hideaki Shimazaki, Shigeru Shinomoto (2010) Kernel bandwidth optimization in spike rate estimation. Journal of Computational Neuroscience, 29(1–2), 171–182. DOI
 Birg08: Lucien Birgé (2008) Model selection for density estimation with L2loss. ArXiv:0808.1416 [Math, Stat].
 Bosq98: Denis Bosq (1998) Nonparametric statistics for stochastic processes: estimation and prediction. New York: Springer
 HaIb90: Rafael Hasminskii, Ildar Ibragimov (1990) On Density Estimation in the View of Kolmogorov’s Ideas in Approximation Theory. The Annals of Statistics, 18(3), 999–1010. DOI
 Lies11: MarieColette N. M. van Lieshout (2011) On Estimation of the Intensity Function of a Point Process. Methodology and Computing in Applied Probability, 14(3), 567–578. DOI
 Hall87: Peter Hall (1987) On KullbackLeibler Loss and Density Estimation. The Annals of Statistics, 15(4), 1491–1519. DOI
 LGMR17: Holden Lee, Rong Ge, Tengyu Ma, Andrej Risteski, Sanjeev Arora (2017) On the ability of neural nets to express distributions. In arXiv:1702.07028 [cs].
 Cox65: D. R. Cox (1965) On the Estimation of the Intensity Function of a Stationary Point Process. Journal of the Royal Statistical Society. Series B (Methodological), 27(2), 332–337.
 PaZe16: Victor M. Panaretos, Yoav Zemel (2016) Separation of Amplitude and Phase Variation in Point Processes. The Annals of Statistics, 44(2), 771–812. DOI
 GiKM08: K. Giesecke, H. Kakavand, M. Mousavi (2008) Simulating point processes by intensity projection. In Simulation Conference, 2008. WSC 2008. Winter (pp. 560–568). DOI
 BaLi13: Heather Battey, Han Liu (2013) Smooth projected density estimation. ArXiv:1308.3968 [Stat].
 Gu93: Chong Gu (1993) Smoothing Spline Density Estimation: A Dimensionless Automatic Algorithm. Journal of the American Statistical Association, 88(422), 495–504. DOI
 Papa74: F. Papangelou (1974) The conditional intensity of general point processes and an application to line processes. Zeitschrift Für Wahrscheinlichkeitstheorie Und Verwandte Gebiete, 28(3), 207–226. DOI