A method of semiparametric density estimation.
This is also very close to clustering, and indeed there are lots of papers noting the connection.
We start from “classic” flavoured Gaussian Mixture Models, and pausing for a rest stop at expectation maximisation, with a moment to muse how we can unify this with kernel density estimation, and the suggestive smoothing connection of functional regression, terminating at adaptive mixture sieves, wondering momentarily if orthogonally decomposable tensors have anything to add. But we are not done, because we have a knotty model selection problem.
To learn:
regularity conditions for ML asymptotics and which ML results you can use, computational complexity.
Connections
Connection with kernel PCA ( SKSB98 ) and metric multidimensional scaling ( Will01) and such are explored in kernel approximation.
Mixture zoo
The following categories are not mutually exclusive; In fact I'm mentioning them all to work out what exactly the intersections are.
BaLi13, ZeMe97 and Geer96 discuss some useful results common to various convex combination estimators.
Dasg08 ch 33 is a highspeed nofiller allkiller summary of various convergence results and mixture types, including a connection to DonohoJin “Higher criticism” and nonparametric deconvolution and multiple testing (ch34).
Chee11 goes into dissertationdepth.
“Classic Mixtures”
Finite locationscale mixtures.
Your data are vectors .
Your density looks like this:
Traditionally, is given, and given as the normal density, but any “nice” unimodal density will do, and we can appeal to, e.g. ZeMe97 or LiBa00 to argue that we can get “close” to any density in large classes by choosing the Gaussian for big .
Also traditionally, is given by magic.
Fitting the parameters of this model, then, involves choosing
Why would you bother? We know that this class is dense in the space of all densities under the total variation metric (ChLi09), which is a rationale if not a guarantee of its usefulness. Moreover, we know that it's computationally tractable.

Dustin Mixon does a beautiful explanation of his paper and a few others in the field, in The Voronoi Means Conjecture
As I mentioned in a previous blog post, I have a recent paper with Soledad Villar and Rachel Ward that analyzes an SDP relaxation of the kmeans problem. It turns out that the solution to the relaxation can be interpreted as a denoised version of the original dataset, sending many of the data points very close to what appear to be kmeansoptimal centroids. This is similar in spirit to Dasgupta’s random projection, and we use a similar rounding scheme to estimate the optimal kmeanscentroids. Using these centroid estimates as Gaussian center estimates, we are able to prove performance bounds of the form when , meaning the performance doesn’t depend on the dimension, but rather the model order.
But does it make sense that the performance should even depend on the model order?

Also reading: Geer96, which leverages the convex combination issue to get bounds relating KL and Hellinger convergence of density estimators (including kernel density estimators)
Radial basis functions
Finite mixtures by another name, from the function approximation literature. When your component density has an assumed form
Here is a scale parameter for the density function. This corresponds to a spherical approximating function, rather than estimating a full multidimensional bandwidth.
Kernel density estimators
If you have as many mixture components as data points, you have kernel density estimate. This is clearly also a finite mixture model, just a limiting case. To keep the number of parameters manageable you usually assume that the mixture components themselves all share the same scale parameters, although
Normal Variance mixtures
Mixtures of normals where you only vary scale parameters. I clearly don't know enough about this to write the entry; this is just a note to myself. TBD. zdistributions and generalized hyperbolic distributions are the keywords. Various interesting properties about infinitely divisible distributions, and they include many other distributions as special cases.
nonparametric mixtures
Noticing that a classic location mixture is a convolution between a continuous density and an atomic density, the question arises whether you can convolve two more general densities. Yes, you can. Estimate a nonparametric mixing density. Now you have a nonparametric estimation problem, something like, estimate . See, e.g. Chee11 who didn't invent it, but did collect a large literature on this into one place.
Bayesian dirichlet mixtures
TBD; Something about using a Dirichlet process for the… weights of the mixture components? Giving you a posterior distribution over coutnably finite mixture parameters? something like that?
Nonaffine mixtures
Do the mixture components have to have location/scale parameterisations? Not necessarily.
For, e.g. a tail estimate where we want to divine properties of heavy tails, this might not do what we want. Estimating the scale parameters is already a PITA; what can we say about more general shape parameters? This is probably not a theoretical issue, in that the asymptotic behaviour of the Mestimators of a Betamixture don't change. Practically, however, the specific optimisation problem might get very hard. Mind you, it's notoriously not that easy even with locationscale parameters/ Can we actually find our global maximum?
Convex neural networks
Maybe? What are these? See BRVD05 and let me know.
Matrix factorization approximations
Surely someone has done this, since it looks like an obvious idea at the intersection of kernel methods, matrix factorisation and matrix concentration inequalities. Maybe it got filed in the clustering literature.
Dasg99 probably counts, and MiVW16 introduces others, but most of these seem not to address the approximation problem. but the clustering problem. Clustering doesn't fit perfectly with our purpose here; we don't necessarily care about assigning things correctly to clusters; rather, we want to approximate the overall density well.
The restrictedisometrylike property here seems to be that component centres may not coincide; can we avoid that?
See MiVW16 for some interesting connections at least:
The study of theoretical guarantees for learning mixtures of Gaussians started with the work of Dasgupta [Dasg99]. His work presented an algorithm based on random projections and showed this algorithm approximates the centers of Gaussians […] in separated by[…] the biggest singular value among all [covariance matrices]. After this work, several generalizations and improvements appeared. […] To date, techniques used for learning mixtures of Gaussians include expectation maximization [DaSc07], spectral methods [VeWa04, KuKa10, AwSh12], projections (random and deterministic) [Dasg99, MoVa10, ArKa01], and the method of moments [MoVa10].
Every existing performance guarantee exhibits one of the following forms:
the algorithm correctly clusters all points according to Gaussian mixture component, or
the algorithm wellapproximates the center of each Gaussian (a la Dasgupta [Dasg99]).
Results of type (1), which include [VeWa04, KuKa10, AwSh12, AcMc07], require the minimum separation between the Gaussians centers to have a multiplicative factor of polylog N, where N is the number of points. This stems from a requirement that every point be closer to their Gaussian’s center (in some sense) than the other centers, so that the problem of cluster recovery is wellposed. Note that in the case of spherical Gaussians, the Gaussian components can be truncated to match the stochastic ball model in this regime, where the semidefinite program we present is already known to be tight with high probability [ABCK15, IMPV15]. Results of type (2) tend to be specifically tailored to exploit unique properties of the Gaussians, and as such are not easily generalizable to other data models. […] For instance, if , then . In high dimensions, since the entries of the Gaussians are independent, concentration of measure implies that most of the points will reside in a thin shell with center μ and radius about . This property allows algorithms to cluster even concentric Gaussians as long as the covariances are sufficiently different. However, algorithms that allow for no separation between the Gaussian centers require a sample complexity which is exponential in k [MoVa10].
Hmm.
Estimation methods
(local) maximum likelihood
A classic method; There are some subtleties here since the global maximum can be badly behaved; you have to mess around with local roots of the likelihood equation and thereby lose some of the the lovely asymptotics of MLE methods in exponential families.
However, I am not sure which properties you lose. McRa14, for example, makes hte sweeping assertion that the AIC condtions don't hold, but the BIC ones (whatever they are) do hold. BIC is “feels” unsatisfying, however.
Method of moments
Particularly popular in recent times for mixtures. Have not yet divined why.
Minimum distance
Minimise the distance between the empirical PDF and the estimated PDF in some metric. For reasons I have not yet digested, one is probably best to do this in the Hellinger metric if one wishes convenient convergence, (Bera77), although kernel density estimates tend to prefer , as with e.g, regression smoothing problems. How on earth you numerically minimise Hellinger distance from data is something I won't think about for now, although I admit to being curious.
Regression smoothing formulation
Not quite mixture models; you have a smoothness penalty on the quadratic componened, a logquadratic regression spline, then the results are “nearly” guassian mixture models. See EiMa96.
Adaptive mixtures
Here's one latelypopular extension to the finite mixture mode: Choosing the number of mixture components adaptively, using some kind of model selection procedure, as per Prie94 with the “sieve”  MuYA94 uses an information criterion.
Sieve method
Argh! so many variants. What I would like for my mixture sieve
Akaike Information criterion
Use an Akaiketype information criterion
See BaRY98, BHLL08, AnKI08, MuYA94…
KoKi08 6.1. is a compressed summary intro to general regularized basis expansion in a regression setting, i.e. we approximate an arbitrary function. Density approximation is more constrained, since know that our mixture must integrate to 1. Also we don't have a separate error term; rather, we assume our components completely summarise the randomness. Usually, although not always, we further require the components be nonnegative functions with nonnegative weights to give us specifically a convex combination of functions. Anyway, presumably we can extract a similar result from that?
McRa14 claims this doesn't work here; but the BIC/MDL approach does. I'm curious which regularity conditions are violated?
quantization and coding theory
The information theoretic cousin. Nonuniform quantisation in communication theory is when you optimally dsitribute the density of your quantization symbols according to the density of the signal, in order to compress a signal while still reconstructing it as precisely as possible. This connection is most commonly raised in the multidimensional case, when it is “Vector quantization” or VQ to its friends. See, e.g. PaDi51, NaKi88, Gray84, GeGr12. This is then a coding theory problem.
From reading the literature it is not immediately apparent how, precisely, vector quantisation is related to mixture density estimation, although there is a family resemblance. In vector quantisation you do something like reduce the signal to a list of Voronoi cells and the coordinates of their centres, then code a signal to the nearest centre; Squinting right makes this look like a mixture problem. LeSe01 and LeSe99 make this connection precise. Investigate.
Now, how do you choose this optimal code from measurements of the signal? THAT is the statistical question.
Minimum description length//BIC
Related to both the previous previous, in some way I do not yet understand.
Rissanen's Minimum Description length, as applied to mixture density estimation? Putatively related to the information criteria method in the form of the Bayesian Information Criterion, which is an purportedly an MDL measure. (Should look into that, eh?) Andrew Barron and coworkers seem to own the statistical MDL approach to mixture estimation. See, Barr91, BaCo91, BaRY98, with literature reviews in BHLL08. BHLL08 constructs discretized mixture models as “two stage codes”, and achieves prediction risk bounds for finite samples using them.
Unsatifactory thing #1: Model selection
Grrr. See model selection in mixtures.
Unsatisfactory thing #2: scale parameter selection theory
All the really good results take the scale parameter as given.
What if, as in the original GMM, we are happy to have our mixture components parameters vary? This is fine, as far as it goes, but the scale parameter selection is the reason why we are bothering with this class of model, typically, otherwise this is simply a weird convex deconvolution problem, which is not so interesting. In particular, how to handle the scale parameter selection in model selection?
Unsatisfactory thing #3: approximation loss
There's a lot of theory about how well we can learn things about the centres of clusters, but less theory about how we can approximate an overall density. In particular, identifying the centres and scales is subject to usual ML results on identifiability and asymptotic estimator distribution, but if those are all just nuisance parameters, what do you have left?
Miscellaney
http://blog.sigfpe.com/2016/10/expectationmaximizationwithless.html
Refs
A method of semiparametric density estimation.
This is also very close to clustering, and indeed there are lots of papers noting the connection.
We start from “classic” flavoured Gaussian Mixture Models, and pausing for a rest stop at expectation maximisation, with a moment to muse how we can unify this with kernel density estimation, and the suggestive smoothing connection of functional regression, terminating at adaptive mixture sieves, wondering momentarily if orthogonally decomposable tensors have anything to add. But we are not done, because we have a knotty model selection problem.
To learn:
regularity conditions for ML asymptotics and which ML results you can use, computational complexity.
Connections
Connection with kernel PCA ( SKSB98 ) and metric multidimensional scaling ( Will01) and such are explored in kernel approximation.
Mixture zoo
The following categories are not mutually exclusive; In fact I'm mentioning them all to work out what exactly the intersections are.
BaLi13, ZeMe97 and Geer96 discuss some useful results common to various convex combination estimators.
Dasg08 ch 33 is a highspeed nofiller allkiller summary of various convergence results and mixture types, including a connection to DonohoJin “Higher criticism” and nonparametric deconvolution and multiple testing (ch34).
Chee11 goes into dissertationdepth.
“Classic Mixtures”
Finite locationscale mixtures.
Your data are vectors .
Your density looks like this:
Traditionally, is given, and given as the normal density, but any “nice” unimodal density will do, and we can appeal to, e.g. ZeMe97 or LiBa00 to argue that we can get “close” to any density in large classes by choosing the Gaussian for big .
Also traditionally, is given by magic.
Fitting the parameters of this model, then, involves choosing
Why would you bother? We know that this class is dense in the space of all densities under the total variation metric (ChLi09), which is a rationale if not a guarantee of its usefulness. Moreover, we know that it's computationally tractable.

Dustin Mixon does a beautiful explanation of his paper and a few others in the field, in The Voronoi Means Conjecture
As I mentioned in a previous blog post, I have a recent paper with Soledad Villar and Rachel Ward that analyzes an SDP relaxation of the kmeans problem. It turns out that the solution to the relaxation can be interpreted as a denoised version of the original dataset, sending many of the data points very close to what appear to be kmeansoptimal centroids. This is similar in spirit to Dasgupta’s random projection, and we use a similar rounding scheme to estimate the optimal kmeanscentroids. Using these centroid estimates as Gaussian center estimates, we are able to prove performance bounds of the form when , meaning the performance doesn’t depend on the dimension, but rather the model order.
But does it make sense that the performance should even depend on the model order?

Also reading: Geer96, which leverages the convex combination issue to get bounds relating KL and Hellinger convergence of density estimators (including kernel density estimators)
Radial basis functions
Finite mixtures by another name, from the function approximation literature. When your component density has an assumed form
Here is a scale parameter for the density function. This corresponds to a spherical approximating function, rather than estimating a full multidimensional bandwidth.
Kernel density estimators
If you have as many mixture components as data points, you have kernel density estimate. This is clearly also a finite mixture model, just a limiting case. To keep the number of parameters manageable you usually assume that the mixture components themselves all share the same scale parameters, although
Normal Variance mixtures
Mixtures of normals where you only vary scale parameters. I clearly don't know enough about this to write the entry; this is just a note to myself. TBD. zdistributions and generalized hyperbolic distributions are the keywords. Various interesting properties about infinitely divisible distributions, and they include many other distributions as special cases.
nonparametric mixtures
Noticing that a classic location mixture is a convolution between a continuous density and an atomic density, the question arises whether you can convolve two more general densities. Yes, you can. Estimate a nonparametric mixing density. Now you have a nonparametric estimation problem, something like, estimate . See, e.g. Chee11 who didn't invent it, but did collect a large literature on this into one place.
Bayesian dirichlet mixtures
TBD; Something about using a Dirichlet process for the… weights of the mixture components? Giving you a posterior distribution over coutnably finite mixture parameters? something like that?
Nonaffine mixtures
Do the mixture components have to have location/scale parameterisations? Not necessarily.
For, e.g. a tail estimate where we want to divine properties of heavy tails, this might not do what we want. Estimating the scale parameters is already a PITA; what can we say about more general shape parameters? This is probably not a theoretical issue, in that the asymptotic behaviour of the Mestimators of a Betamixture don't change. Practically, however, the specific optimisation problem might get very hard. Mind you, it's notoriously not that easy even with locationscale parameters/ Can we actually find our global maximum?
Convex neural networks
Maybe? What are these? See BRVD05 and let me know.
Matrix factorization approximations
Surely someone has done this, since it looks like an obvious idea at the intersection of kernel methods, matrix factorisation and matrix concentration inequalities. Maybe it got filed in the clustering literature.
Dasg99 probably counts, and MiVW16 introduces others, but most of these seem not to address the approximation problem. but the clustering problem. Clustering doesn't fit perfectly with our purpose here; we don't necessarily care about assigning things correctly to clusters; rather, we want to approximate the overall density well.
The restrictedisometrylike property here seems to be that component centres may not coincide; can we avoid that?
See MiVW16 for some interesting connections at least:
The study of theoretical guarantees for learning mixtures of Gaussians started with the work of Dasgupta [Dasg99]. His work presented an algorithm based on random projections and showed this algorithm approximates the centers of Gaussians […] in separated by[…] the biggest singular value among all [covariance matrices]. After this work, several generalizations and improvements appeared. […] To date, techniques used for learning mixtures of Gaussians include expectation maximization [DaSc07], spectral methods [VeWa04, KuKa10, AwSh12], projections (random and deterministic) [Dasg99, MoVa10, ArKa01], and the method of moments [MoVa10].
Every existing performance guarantee exhibits one of the following forms:
the algorithm correctly clusters all points according to Gaussian mixture component, or
the algorithm wellapproximates the center of each Gaussian (a la Dasgupta [Dasg99]).
Results of type (1), which include [VeWa04, KuKa10, AwSh12, AcMc07], require the minimum separation between the Gaussians centers to have a multiplicative factor of polylog N, where N is the number of points. This stems from a requirement that every point be closer to their Gaussian’s center (in some sense) than the other centers, so that the problem of cluster recovery is wellposed. Note that in the case of spherical Gaussians, the Gaussian components can be truncated to match the stochastic ball model in this regime, where the semidefinite program we present is already known to be tight with high probability [ABCK15, IMPV15]. Results of type (2) tend to be specifically tailored to exploit unique properties of the Gaussians, and as such are not easily generalizable to other data models. […] For instance, if , then . In high dimensions, since the entries of the Gaussians are independent, concentration of measure implies that most of the points will reside in a thin shell with center μ and radius about . This property allows algorithms to cluster even concentric Gaussians as long as the covariances are sufficiently different. However, algorithms that allow for no separation between the Gaussian centers require a sample complexity which is exponential in k [MoVa10].
Hmm.
Estimation methods
(local) maximum likelihood
A classic method; There are some subtleties here since the global maximum can be badly behaved; you have to mess around with local roots of the likelihood equation and thereby lose some of the the lovely asymptotics of MLE methods in exponential families.
However, I am not sure which properties you lose. McRa14, for example, makes hte sweeping assertion that the AIC condtions don't hold, but the BIC ones (whatever they are) do hold. BIC is “feels” unsatisfying, however.
Method of moments
Particularly popular in recent times for mixtures. Have not yet divined why.
Minimum distance
Minimise the distance between the empirical PDF and the estimated PDF in some metric. For reasons I have not yet digested, one is probably best to do this in the Hellinger metric if one wishes convenient convergence, (Bera77), although kernel density estimates tend to prefer , as with e.g, regression smoothing problems. How on earth you numerically minimise Hellinger distance from data is something I won't think about for now, although I admit to being curious.
Regression smoothing formulation
Not quite mixture models; you have a smoothness penalty on the quadratic componened, a logquadratic regression spline, then the results are “nearly” guassian mixture models. See EiMa96.
Adaptive mixtures
Here's one latelypopular extension to the finite mixture mode: Choosing the number of mixture components adaptively, using some kind of model selection procedure, as per Prie94 with the “sieve”  MuYA94 uses an information criterion.
Sieve method
Argh! so many variants. What I would like for my mixture sieve
Akaike Information criterion
Use an Akaiketype information criterion
See BaRY98, BHLL08, AnKI08, MuYA94…
KoKi08 6.1. is a compressed summary intro to general regularized basis expansion in a regression setting, i.e. we approximate an arbitrary function. Density approximation is more constrained, since know that our mixture must integrate to 1. Also we don't have a separate error term; rather, we assume our components completely summarise the randomness. Usually, although not always, we further require the components be nonnegative functions with nonnegative weights to give us specifically a convex combination of functions. Anyway, presumably we can extract a similar result from that?
McRa14 claims this doesn't work here; but the BIC/MDL approach does. I'm curious which regularity conditions are violated?
quantization and coding theory
The information theoretic cousin. Nonuniform quantisation in communication theory is when you optimally dsitribute the density of your quantization symbols according to the density of the signal, in order to compress a signal while still reconstructing it as precisely as possible. This connection is most commonly raised in the multidimensional case, when it is “Vector quantization” or VQ to its friends. See, e.g. PaDi51, NaKi88, Gray84, GeGr12. This is then a coding theory problem.
From reading the literature it is not immediately apparent how, precisely, vector quantisation is related to mixture density estimation, although there is a family resemblance. In vector quantisation you do something like reduce the signal to a list of Voronoi cells and the coordinates of their centres, then code a signal to the nearest centre; Squinting right makes this look like a mixture problem. LeSe01 and LeSe99 make this connection precise. Investigate.
Now, how do you choose this optimal code from measurements of the signal? THAT is the statistical question.
Minimum description length//BIC
Related to both the previous previous, in some way I do not yet understand.
Rissanen's Minimum Description length, as applied to mixture density estimation? Putatively related to the information criteria method in the form of the Bayesian Information Criterion, which is an purportedly an MDL measure. (Should look into that, eh?) Andrew Barron and coworkers seem to own the statistical MDL approach to mixture estimation. See, Barr91, BaCo91, BaRY98, with literature reviews in BHLL08. BHLL08 constructs discretized mixture models as “two stage codes”, and achieves prediction risk bounds for finite samples using them.
Unsatifactory thing #1: Model selection
Grrr. See model selection in mixtures.
Unsatisfactory thing #2: scale parameter selection theory
All the really good results take the scale parameter as given.
What if, as in the original GMM, we are happy to have our mixture components parameters vary? This is fine, as far as it goes, but the scale parameter selection is the reason why we are bothering with this class of model, typically, otherwise this is simply a weird convex deconvolution problem, which is not so interesting. In particular, how to handle the scale parameter selection in model selection?
Unsatisfactory thing #3: approximation loss
There's a lot of theory about how well we can learn things about the centres of clusters, but less theory about how we can approximate an overall density. In particular, identifying the centres and scales is subject to usual ML results on identifiability and asymptotic estimator distribution, but if those are all just nuisance parameters, what do you have left?
Miscellaney
http://blog.sigfpe.com/2016/10/expectationmaximizationwithless.html
Refs
 ChLi09: Elliott Ward Cheney, William Allan Light (2009) A Course in Approximation Theory. American Mathematical Soc.
 AnHK12: Animashree Anandkumar, Daniel Hsu, Sham M. Kakade (2012) A Method of Moments for Mixture Models and Hidden Markov Models. In Journal of Machine Learning Research.
 DaSc07: Sanjoy Dasgupta, Leonard Schulman (2007) A Probabilistic Analysis of EM for Mixtures of Separated, Spherical Gaussians. Journal of Machine Learning Research, 8(Feb), 203–226.
 XPPP08: JianWu Xu, A.R.C. Paiva, Il Park, J.C. Principe (2008) A Reproducing Kernel Hilbert Space Framework for InformationTheoretic Learning. IEEE Transactions on Signal Processing, 56(12), 5891–5902. DOI
 VeWa04: Santosh Vempala, Grant Wang (2004) A spectral algorithm for learning mixture models. Journal of Computer and System Sciences, 68(4), 841–860. DOI
 SiGo08: Ajit P. Singh, Geoffrey J. Gordon (2008) A unified view of matrix factorization models. In Machine Learning and Knowledge Discovery in Databases (pp. 358–373). Springer
 BePR11: K. Bertin, E. Le Pennec, V. Rivoirard (2011) Adaptive Dantzig density estimation. Annales de l’Institut Henri Poincaré, Probabilités et Statistiques, 47(1), 43–74. DOI
 Prie94: Carey E. Priebe (1994) Adaptive Mixtures. Journal of the American Statistical Association, 89(427), 796–806. DOI
 LeSe01: Daniel D. Lee, H. Sebastian Seung (2001) Algorithms for Nonnegative Matrix Factorization. In Advances in Neural Information Processing Systems 13 (pp. 556–562). MIT Press
 PrMa00: Carey E. Priebe, David J. Marchette (2000) Alternating kernel and mixture density estimates. Computational Statistics & Data Analysis, 35(1), 43–65. DOI
 PeWe07: J. Peng, Y. Wei (2007) Approximating K‐means‐type Clustering via Semidefinite Programming. SIAM Journal on Optimization, 18(1), 186–205. DOI
 Barr94: Andrew R. Barron (1994) Approximation and Estimation Bounds for Artificial Neural Networks. Machine Learning, 14(1), 115–133. DOI
 Nore10: Andriy Norets (2010) Approximation of conditional densities by smooth mixtures of regressions. The Annals of Statistics, 38(3), 1733–1766. DOI
 Geer97: Sara van de Geer (1997) Asymptotic normality in mixture models. ESAIM: Probability and Statistics, 1, 17–33.
 Geer03: Sara van de Geer (2003) Asymptotic theory for maximum likelihood in nonparametric mixture models. Computational Statistics & Data Analysis, 41(3–4), 453–464. DOI
 Dasg08: Anirban DasGupta (2008) Asymptotic Theory of Statistics and Probability. New York: Springer New York
 QuBe16: Qichao Que, Mikhail Belkin (2016) Back to the future: Radial Basis Function networks revisited. In Proceedings of the 19th International Conference on Artificial Intelligence and Statistics (AISTATS) 2016.
 BeRV16: Mikhail Belkin, Luis Rademacher, James Voss (2016) Basis Learning as an Algorithmic Primitive. In Journal of Machine Learning Research (pp. 446–487).
 Bach14: Francis Bach (2014) Breaking the Curse of Dimensionality with Convex Neural Networks. ArXiv:1412.8690 [Cs, Math, Stat].
 DaLS12: Amit Daniely, Nati Linial, Michael Saks (2012) Clustering is difficult only when it does not matter. ArXiv:1205.4891 [Cs].
 MiVW16: Dustin G. Mixon, Soledad Villar, Rachel Ward (2016) Clustering subgaussian mixtures by semidefinite programming. ArXiv:1602.06612 [Cs, Math, Stat].
 KuKa10: A. Kumar, R. Kannan (2010) Clustering with Spectral Norm and the kMeans Algorithm. In 2010 51st Annual IEEE Symposium on Foundations of Computer Science (FOCS) (pp. 299–308). DOI
 Barr91: Andrew R. Barron (1991) Complexity Regularization with Application to Artificial Neural Networks. In Nonparametric Functional Estimation and Related Topics (pp. 561–576). Springer Netherlands DOI
 BaSa13: Heather Battey, Alessio Sancetta (2013) Conditional estimation for dependent functional data. Journal of Multivariate Analysis, 120, 1–17. DOI
 BRVD05: Yoshua Bengio, Nicolas L. Roux, Pascal Vincent, Olivier Delalleau, Patrice Marcotte (2005) Convex neural networks. In Advances in neural information processing systems (pp. 123–130).
 HDYD12: G. Hinton, Li Deng, Dong Yu, G.E. Dahl, A. Mohamed, N. Jaitly, … B. Kingsbury (2012) Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups. IEEE Signal Processing Magazine, 29(6), 82–97. DOI
 DJKP96: David L. Donoho, Iain M. Johnstone, Gerard Kerkyacharian, Dominique Picard (1996) Density Estimation by Wavelet Thresholding. The Annals of Statistics, 24(2), 508–539. DOI
 ZeMe97: Assaf J. Zeevi, Ronny Meir (1997) Density Estimation Through Convex Combinations of Densities: Approximation and Estimation Bounds. Neural Networks: The Official Journal of the International Neural Network Society, 10(1), 99–109. DOI
 LeSc12: Gyemin Lee, Clayton Scott (2012) EM algorithms for multivariate Gaussian mixture models with truncated and censored data. Computational Statistics & Data Analysis, 56(9), 2816–2829. DOI
 GhVa01: Subhashis Ghosal, Aad W. van der Vaart (2001) Entropies and rates of convergence for maximum likelihood and Bayes estimation for mixtures of normal densities. The Annals of Statistics, 29(5), 1233–1263. DOI
 ZeMM98: A. J. Zeevi, R. Meir, V. Maiorov (1998) Error bounds for functional approximation and estimation using mixtures of experts. IEEE Transactions on Information Theory, 44(3), 1010–1025. DOI
 Ibra01: I. Ibragimov (2001) Estimation of analytic functions. In Institute of Mathematical Statistics Lecture Notes  Monograph Series (pp. 359–383). Beachwood, OH: Institute of Mathematical Statistics
 Deha16: Guillaume P. Dehaene (2016) Expectation Propagation performs a smoothed gradient descent. ArXiv:1612.05053 [Stat].
 SKSB98: Bernhard Schölkopf, Phil Knirsch, Alex Smola, Chris Burges (1998) Fast Approximation of Support Vector Kernel Expansions, and an Interpretation of Clustering as Approximation in Feature Spaces. In Mustererkennung 1998 (pp. 125–132). Springer Berlin Heidelberg DOI
 AcMc07: Dimitris Achlioptas, Frank Mcsherry (2007) Fast Computation of Lowrank Matrix Approximations. J. ACM, 54(2). DOI
 McJo88: G. J. McLachlan, P. N. Jones (1988) Fitting Mixture Models to Grouped and Truncated Data via the EM Algorithm. Biometrics, 44(2), 571–578. DOI
 EiMa96: Paul H. C. Eilers, Brian D. Marx (1996) Flexible smoothing with Bsplines and penalties. Statistical Science, 11(2), 89–121. DOI
 LeNP06: Youngjo. Lee, John A. Nelder, Yudi Pawitan (2006) Generalized linear models with random effects. Boca Raton, FL: Chapman & Hall/CRC
 BiRo06: Lucien Birgé, Yves Rozenholc (2006) How many bins should be put in a regular histogram. ESAIM: Probability and Statistics, 10, 24–45. DOI
 AwSh12: Pranjal Awasthi, Or Sheffet (2012) Improved SpectralNorm Bounds for Clustering. ArXiv Preprint ArXiv:1206.3204.
 Bish91: Chris Bishop (1991) Improving the Generalization Properties of Radial Basis Function Neural Networks. Neural Computation, 3(4), 579–588. DOI
 KoKi08: Sadanori Konishi, G. Kitagawa (2008) Information criteria and statistical modeling. New York: Springer
 LeBa06: G. Leung, A.R. Barron (2006) Information Theory and Mixing LeastSquares Regressions. IEEE Transactions on Information Theory, 52(8), 3396–3410. DOI
 Orr96: Mark JL Orr (1996) Introduction to radial basis function networks. Technical Report, Center for Cognitive Science, University of Edinburgh
 ShSh10: Hideaki Shimazaki, Shigeru Shinomoto (2010) Kernel bandwidth optimization in spike rate estimation. Journal of Computational Neuroscience, 29(1–2), 171–182. DOI
 ScSM97: Bernhard Schölkopf, Alexander Smola, KlausRobert Müller (1997) Kernel principal component analysis. In Artificial Neural Networks — ICANN’97 (pp. 583–588). Springer Berlin Heidelberg DOI
 ShSa06: Fei Sha, Lawrence K. Saul (2006) Large margin hidden Markov models for automatic speech recognition. In Advances in neural information processing systems (pp. 1249–1256).
 RaKJ17: Aditi Raghunathan, Ravishankar Krishnaswamy, Prateek Jain (2017) Learning Mixture of Gaussians with Streaming Data. In Advances in Neural Information Processing Systems (pp. 6608–6617).
 ArKa01: Sanjeev Arora, Ravi Kannan (2001) Learning Mixtures of Arbitrary Gaussians. In Proceedings of the Thirtythird Annual ACM Symposium on Theory of Computing (pp. 247–257). New York, NY, USA: ACM DOI
 Dasg99: Sanjoy Dasgupta (1999) Learning mixtures of Gaussians. In Foundations of Computer Science, 1999. 40th Annual Symposium on (pp. 634–644). IEEE DOI
 HsKa13: Daniel Hsu, Sham M. Kakade (2013) Learning Mixtures of Spherical Gaussians: Moment Methods and Spectral Decompositions. In Proceedings of the 4th Conference on Innovations in Theoretical Computer Science (pp. 11–20). New York, NY, USA: ACM DOI
 RaDe14: Guillaume Rabusseau, François Denis (2014) Learning Negative Mixture Models by Tensor Decompositions. ArXiv:1403.4224 [Cs].
 LeSe99: Daniel D. Lee, H. Sebastian Seung (1999) Learning the parts of objects by nonnegative matrix factorization. Nature, 401(6755), 788–791. DOI
 Chri15: Christian Bauckhage (2015) Lecture Notes on Data Science: Soft kMeans Clustering. DOI
 BHLL08: Andrew R. Barron, Cong Huang, Jonathan Q. Li, Xi Luo (2008) MDL, penalized likelihood, and statistical risk. In Information Theory Workshop, 2008. ITW’08. IEEE (pp. 247–257). IEEE DOI
 BaCo91: A. R. Barron, T. M. Cover (1991) Minimum complexity density estimation. IEEE Transactions on Information Theory, 37(4), 1034–1054. DOI
 Bera77: Rudolf Beran (1977) Minimum Hellinger Distance Estimates for Parametric Models. The Annals of Statistics, 5(3), 445–463. DOI
 ReWa84: R. Redner, H. Walker (1984) Mixture Densities, Maximum Likelihood and the EM Algorithm. SIAM Review, 26(2), 195–239. DOI
 LiBa00: Jonathan Q. Li, Andrew R. Barron (2000) Mixture Density Estimation. In Advances in Neural Information Processing Systems 12 (pp. 279–285). MIT Press
 Bish94: Christopher Bishop (1994) Mixture Density Networks. Microsoft Research.
 WuYZ07: Qiang Wu, Yiming Ying, DingXuan Zhou (2007) Multikernel regularized classifiers. Journal of Complexity, 23(1), 108–134. DOI
 MuYA94: N. Murata, S. Yoshizawa, S. Amari (1994) Network information criteriondetermining the number of hidden units for an artificial neural network model. IEEE Transactions on Neural Networks, 5(6), 865–872. DOI
 TsBö16: Michael Tschannen, Helmut Bölcskei (2016) Noisy subspace clustering via matching pursuits. ArXiv:1612.03450 [Cs, Math, Stat].
 AnKI08: Tomohiro Ando, Sadanori Konishi, Seiya Imoto (2008) Nonlinear regression modeling via regularized radial basis function networks. Journal of Statistical Planning and Inference, 138(11), 3616–3633. DOI
 BaLi14: Heather Battey, Oliver Linton (2014) Nonparametric estimation of multivariate elliptic densities via finite mixture sieves. Journal of Multivariate Analysis, 123, 43–67. DOI
 GeHw82: Stuart Geman, ChiiRuey Hwang (1982) Nonparametric Maximum Likelihood Estimation by the Method of Sieves. The Annals of Statistics, 10(2), 401–414. DOI
 BaLi16: Heather Battey, Han Liu (2016) Nonparametrically filtered parametric density estimation.
 Will01: Christopher K. I. Williams (2001) On a Connection between Kernel PCA and Metric Multidimensional Scaling. In Advances in Neural Information Processing Systems 13 (Vol. 46, pp. 675–681). MIT Press DOI
 XuJo96: Lei Xu, Michael I. Jordan (1996) On Convergence Properties of the EM Algorithm for Gaussian Mixtures. Neural Computation, 8(1), 129–151. DOI
 Hall87: Peter Hall (1987) On KullbackLeibler Loss and Density Estimation. The Annals of Statistics, 15(4), 1491–1519. DOI
 AcMc05: Dimitris Achlioptas, Frank McSherry (2005) On Spectral Learning of Mixtures of Distributions. In Learning Theory (pp. 458–469). Springer Berlin Heidelberg DOI
 HeSN07: Georg Heigold, Ralf Schlüter, Hermann Ney (2007) On the Equivalence of Gaussian HMM and Gaussian HMMLike Hidden Conditional Random Fields. In Eighth Annual Conference of the International Speech Communication Association.
 McRa14: Geoffrey J. McLachlan, Suren Rathnayake (2014) On the number of components in a Gaussian mixture model. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 4(5), 341–355. DOI
 Fan91: Jianqing Fan (1991) On the Optimal Rates of Convergence for Nonparametric Deconvolution Problems. The Annals of Statistics, 19(3), 1257–1272. DOI
 LiXG12: Dawei Li, Lihong Xu, Erik Goodman (2012) Online EM Variants for Multivariate Normal Mixture Model in Background Learning and Moving Foreground Detection. Journal of Mathematical Imaging and Vision, 48(1), 114–133. DOI
 Chen95: Jiahua Chen (1995) Optimal Rate of Convergence for Finite Mixture Models. The Annals of Statistics, 23(1), 221–233. DOI
 RoWa97: Kathryn Roeder, Larry Wasserman (1997) Practical Bayesian Density Estimation Using Mixtures of Normals. Journal of the American Statistical Association, 92(439), 894–902. DOI
 WoSh95: Wing Hung Wong, Xiaotong Shen (1995) Probability Inequalities for Likelihood Ratios and Convergence Rates of Sieve MLES. The Annals of Statistics, 23(2), 339–362. DOI
 IMPV15: Takayuki Iguchi, Dustin G. Mixon, Jesse Peterson, Soledad Villar (2015) Probably certifiably correct kmeans clustering. ArXiv:1509.07983 [Cs, Math, Stat].
 WeVe12: Claudio Weidmann, Martin Vetterli (2012) Rate Distortion Behavior of Sparse Sources. IEEE Transactions on Information Theory, 58(8), 4969–4992. DOI
 GeWa00: Christopher R. Genovese, Larry Wasserman (2000) Rates of convergence for the Gaussian mixture sieve. Annals of Statistics, 1105–1127.
 Geer96: Sara van de Geer (1996) Rates of convergence for the maximum likelihood estimator in mixture models. Journal of Nonparametric Statistics, 6(4), 293–310. DOI
 Orr99: Mark J. L. Orr (1999) Recent Advances in Radial Basis Function Networks. Technical Report www.ed.ac.uk/ mjo/papers/recad.ps, Institute for Adaptive and Neural Computation
 RuYZ11: Lingyan Ruan, Ming Yuan, Hui Zou (2011) Regularized Parameter Estimation in HighDimensional Gaussian Mixture Models. Neural Computation, 23(6), 1605–1622. DOI
 ABCK15: Pranjal Awasthi, Afonso S. Bandeira, Moses Charikar, Ravishankar Krishnaswamy, Soledad Villar, Rachel Ward (2015) Relax, No Need to Round: Integrality of Clustering Formulations. In Proceedings of the 2015 Conference on Innovations in Theoretical Computer Science (pp. 191–200). New York, NY, USA: ACM DOI
 RaPM05: Alexander Rakhlin, Dmitry Panchenko, Sayan Mukherjee (2005) Risk bounds for mixture density estimation. ESAIM: Probability and Statistics, 9, 220–229. DOI
 HuCB08: Cong Huang, G. L. H. Cheang, Andrew R. Barron (2008) Risk of penalized least squares, greedy selection and l1 penalization for flexible function libraries.
 BHBR16: Nicholas Boyd, Trevor Hastie, Stephen Boyd, Benjamin Recht, Michael Jordan (2016) Saturating Splines and Feature Selection. ArXiv:1609.06764 [Stat].
 ToAi15: Panos Toulis, Edoardo M. Airoldi (2015) Scalable estimation strategies based on stochastic approximations: classical results and new insights. Statistics and Computing, 25(4), 781–795. DOI
 MoVa10: A. Moitra, G. Valiant (2010) Settling the Polynomial Learnability of Mixtures of Gaussians. In 2010 51st Annual IEEE Symposium on Foundations of Computer Science (FOCS) (pp. 93–102). DOI
 QiCh94: Shie Qian, Dapang Chen (1994) Signal representation using adaptive normalized Gaussian functions. Signal Processing, 36(1), 1–11. DOI
 KBGP16: Nicolas Keriven, Anthony Bourrier, Rémi Gribonval, Patrick Pérez (2016) Sketching for LargeScale Learning of Mixture Models. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 6190–6194). DOI
 BaLi13: Heather Battey, Han Liu (2013) Smooth projected density estimation. ArXiv:1308.3968 [Stat].
 Tipp01: Michael E. Tipping (2001) Sparse Bayesian learning and the relevance vector machine. The Journal of Machine Learning Research, 1, 211–244. DOI
 TiNh01: Michael E. Tipping, Cambridge Cb Nh (2001) Sparse Kernel Principal Component Analysis. In Advances in Neural Information Processing Systems 13 (pp. 633–639). MIT Press
 BaRY98: A. Barron, J. Rissanen, Bin Yu (1998) The minimum description length principle in coding and modeling. IEEE Transactions on Information Theory, 44(6), 2743–2760. DOI
 LiZJ05: Haifeng Li, Keshu Zhang, Tao Jiang (2005) The regularized EM algorithm. In Proceedings of the national conference on artificial intelligence (Vol. 20, p. 807). Menlo Park, CA; Cambridge, MA; London; AAAI Press; MIT Press; 1999
 Tipp00: Michael E. Tipping (2000) The Relevance Vector Machine. In Advances in Neural Information Processing Systems (pp. 652–658). MIT Press
 RNSS16: Avik Ray, Joe Neeman, Sujay Sanghavi, Sanjay Shakkottai (2016) The Search Problem in Mixture Models. ArXiv:1610.00843 [Cs, Stat].
 Barr93: A.R. Barron (1993) Universal approximation bounds for superpositions of a sigmoidal function. IEEE Transactions on Information Theory, 39(3), 930–945. DOI
 Riss84: J. Rissanen (1984) Universal coding, information, prediction, and estimation. IEEE Transactions on Information Theory, 30(4), 629–636. DOI
 Gray84: R. Gray (1984) Vector quantization. IEEE ASSP Magazine, 1(2), 4–29. DOI
 GeGr12: Allen Gersho, Robert M. Gray (2012) Vector Quantization and Signal Compression. Springer Science & Business Media
 VaWe96: AW van der Vaart, Jon Wellner (1996) Weak Convergence and Empirical Processes: With Applications to Statistics. Springer Science & Business Media