The Living Thing / Notebooks :

Covariance functions

Mercer kernels, positive definite operators, spare reproducing kernels for that Hilbert space I bought on eBay real cheap

Usefulness: 🔧 🔧
Novelty: 💡
Uncertainty: 🤪 🤪 🤪
Incompleteness: 🚧 🚧

In spatial statistics, Gaussian processes, kernel machines and covariance estimation we are concerned with covariances between different indices of stochastic processes.

Covariance kernels apply to much more general creatures than just Gaussian processes; any process with finite second moments has a covariance function. They are especially useful for Gaussian process methods, since, basically, Gaussian processes are uniquely specified by their mean and covariance kernels, and also have convenient algebraic properties by virtue of being Gaussian, and there are some famous special cases where you can do elegant things with them in this context. More generally, they arise in pretty much anything where representer theorems hold. (Schölkopf, Herbrich, and Smola 2001; Yu et al. 2013)

What follows are some useful kernels to have in my toolkit, mostly over \(\mathbb{R}^n\) or at least some space with a metric. There are many more than I could fit here, over many more spaces than I need. (Real vectors, strings, other kernels, probability distributions etc.)

For these I have freely raided David Duvenaud’s crib notes which became a thesis chapter (Duvenaud 2014). Also wikipedia and (Abrahamsen 1997; Genton 2002).

General real covariance kernels

A function \(K:\mathcal{T}\times\mathcal{T}\to\mathbb{R}\) can be a covariance kernel if

  1. It is symmetric in its arguments \(K(s,t)=K(t,s)\) (more generally, conjugate symmetric – \(K(s,t)=K^*(t,s)\), but I think maybe my life will be simpler if I ignore the complex case for the moment.)
  2. It is positive semidefinite.

That positive semidefiniteness means that for arbitrary real numbers \(c_1,\dots,c_k\) and arbitrary indices \(t_1,\dots,t_k\)

\[ \sum_{i=1}^{k} \sum_{j=1}^{k} c_{i} c_{j} K(t_{i}, t_{j}) \geq 0 \]

The interpretation here is that this is the same as the covariance of different random variables induced by the process, and we need them to be consistent, which implies that it is necessary that

\[ \operatorname{Var}\left\{c_{1} X_{\mathbf{t}_{1}}+\cdots+c_{k} X_{\mathbf{t}_{k}}\right\}=\sum_{i=1}^{k} \sum_{j=1}^{k} c_{i} c_{j} K\left(\mathbf{t}_{i}, \mathbf{t}_{j}\right) \geq 0 \]

This arises from the constraint on the covariance of \(\operatorname{Var}(\mathbf X\in \mathbb{R}_+^d\) which requires that for \(\mathbf {b}\in \mathbb{R}^d\)

\[ \operatorname {var} (\mathbf {b} ^{\rm {T}}\mathbf {X} )=\mathbf {b} ^{\rm {T}}\operatorname {var} (\mathbf {X} )\mathbf {b} \]

what can we say about this covariance if every element of \(\mathbf X\) is non-negative?

Amazingly (to me), this necessary condition will also be sufficient to make something a covariance kernel. Which is nice in principle. In practice designing covariance functions using positive definiteness is tricky; the space of positive definite kernels is implicit. What we normally do is find a fun class that guarantees positive definiteness and riffle through that. Most of the rest of the article is devoted to such classes.

Bonus: complex covariance kernels

I talked in terms of real kernels above because I generally observer real measurements of processes. But often complex covariances arise in a natural way too.

A function \(K:\mathcal{T}\times\mathcal{T}\to\mathbb{C}\) can be a covariance kernel if

  1. It is symmetric conjugate symmetric in its arguments – \(K(s,t)=K^*(t,s)\),
  2. It is positive semidefinite.

That positive semidefiniteness means that for arbitrary complex numbers \(c_1,\dots,c_k\) and arbitrary indices \(t_1,\dots,t_k\)

\[ \sum_{i=1}^{k} \sum_{j=1}^{k} c_{i} \overline{c_{j}} K(t_{i}, t_{j}) \geq 0. \]

Every analytic real kernel should also be a complex kernel, right? 🚧.

Wiener process kernel

From the naming, we might suspect that a Gaussian process would also describe a standard Wiener process, which after all is a process with Gaussian increments, which is certain type of dependence. It is over a boring index space, time \(t\in \mathbb{R}\), but there is indeed nothing stopping us.

We can read this right off the Wiener process Wikipedia page. For a Gaussian process \(\{W_t\}_{t\in\mathbb{R}},\)

\[ {\displaystyle \operatorname {cov} (W_{s},W_{t})=s \wedge t} \]

Here \(s \wedge t\) here means “the minimum of \(s\) and \(t\)”.

That result is standard. From it we can immediately construct the kernel \(K(s,t):=s \wedge t\).

Causal kernels

Time-indexed processes more general than a standard Wiener process. We can construct kernels that are more general than this, right? Any kernel \(k(s,t)=k(s\wedge t)\) that is positive definite. Is this class interesting? Maybe with other covariates.

🚧

Markov kernels

How can we know from inspecting a kernel whether it implies an independence structure of some kind? The Wiener process and causal kernels clearly imply certain independences. 🚧

Squared exponential

The classic, default, analytically convenient, because it is proportional to the Gaussian density and therefore cancels out with it at opportune times.

\[k_{\textrm{SE}}(x, x') = \sigma^2\exp\left(-\frac{(x - x')^2}{2\ell^2}\right)\]

Rational Quadratic

Duvenaud reckons this is everywhere but TBH I have not seen it. Included for completeness.

\[k_{\textrm{RQ}}(x, x') = \sigma^2 \left( 1 + \frac{(x - x')^2}{2 \alpha \ell^2} \right)^{-\alpha}\]

Note that \(\lim_{\alpha\to\infty} k_{\textrm{RQ}}= k_{\textrm{SE}}\).

Matérn

The Matérn stationary (and in the Euclidean case, isotropic) covariance function is one model for covariance. See Carl Edward Rasmussen’s Gaussian Process lecture notes for a readable explanation, or chapter 4 of his textbook (Rasmussen and Williams 2006).

\[ k_{\textrm{Mat}}(x, x')=\sigma^{2} \frac{2^{1-\nu}}{\Gamma(\nu)}\left(\sqrt{2 \nu} \frac{x - x'}{\rho}\right)^{\nu} K_{\nu}\left(\sqrt{2 \nu} \frac{x - x'}{\rho}\right) \]

where \(\Gamma\) is the gamma function, \(\ K_{\nu }\) is the modified Bessel function of the second kind, and \(\rho,\nu\geq 0\).

AFAICT you use this for covariances hypothesised to be less smooth than the squared exponential covariance. And other things?

Periodic

\[ k_{\textrm{Per}}(x, x') = \sigma^2\exp\left(-\frac{2\sin^2(\pi|x - x'|/p)}{\ell^2}\right) \]

Locally periodic

This is an example of a composed kernel, explained below.

\[\begin{aligned} k_{\textrm{LocPer}}(x, x') &= k_{\textrm{Per}}(x, x')k_{\textrm{SE}}(x, x') \\ &= \sigma^2\exp\left(-\frac{2\sin^2(\pi|x - x'|/p)}{\ell^2}\right) \exp\left(-\frac{(x - x')^2}{2\ell^2}\right) \end{aligned}\]

Obviously there are other possible localisations of a periodic kernel. This is a Locally Periodic kernel.

With atoms

🚧 Is this feasible? What are the constructions that allow discontinuity in the process and what difficulties to they engender?

“Integral” kernel

I just noticed the ambiguously named Integral kernel:

I’ve called the kernel the ‘integral kernel’ as we use it when we know observations of the integrals of a function, and want to estimate the function itself.

Examples include:

I would argue that all kernels are naturally defined in terms of integrals, but the author seems to mean something particular. I suspect I would call this a sampling kernel, but that name is also overloaded. Anyway, what is actually going on here? Where is it introduced? Possibly one of (Smith, Alvarez, and Lawrence 2018; O’Callaghan and Ramos 2011; Murray-Smith and Pearlmutter 2005).

Composing kernels

A sum or product (or outer sum, or tensor product) of kernels is still a kernel. For other transforms YMMV.

For example, in the case of Gaussian processes, suppose that, independently,

\[\begin{aligned} f_{1} &\sim \mathcal{GP}\left(\mu_{1}, k_{1}\right)\\ f_{2} &\sim \mathcal{GP}\left(\mu_{2}, k_{2}\right) \end{aligned}\] then

\[ f_{1}+f_{2} \sim \mathcal{GP} \left(\mu_{1}+\mu_{2}, k_{1}+k_{2}\right) \] so \(k_{1}+k_{2}\) is also a kernel.

More generally, if \(k_{1}\) and \(k_{2}\) are two kernels, and \(c_{1}\), and \(c_{2}\) are two positive real numbers, then:

\[ K(x, x')=c_{1} k_{1}(x, x')+c_{2} k_{2}(x, x') \] is again a kernel. What with the multiplication as well, we note that all polynomials of kernels where the coefficients are positive are in turn kernels. (Genton 2002)

Note that the additivity in terms of kernels is not the same as additivity in terms of induced feature spaces. The induced feature map of \(k_{1}+k_{2}\) is their concatenation rather than their sum. Suppose \(\phi_{1}(x)\) gives us the feature map of \(k_{1}\) for \(x\) and likewise \(\phi_{2}(x)\).

\[\begin{aligned} k_{1}(x, x') &=\phi_{1}(x)^{\top} \phi_{1}(x') \\ k_{2}(x, x') &=\phi_{2}(x)^{\top} \phi_{2}(x')\\ k_{1}(x, x')+k_{2}(x, x') &=\phi_{1}(x)^{\top} \phi_{1}(x')+\phi_{2}(x)^{\top} \phi_{2}(x')\\ &=\left[\begin{array}{c}{\phi_{1}(x)} \\ {\phi_{2}(x)}\end{array}\right]^{\top} \left[\begin{array}{c}{\phi_{1}(x')} \\ {\phi_{2}(x')}\end{array}\right] \end{aligned}\]

If \(k_{y}:\mathcal{Y}\times\mathcal{Y}\to\mathbb{R}\) is a kernel and \(\psi: \mathcal{X}\to\mathcal{Y}\) this is also a kernel

\[\begin{aligned} k_{x}:&\mathcal{X}\times\mathcal{X}\to\mathbb{R}\\ & (x,x')\mapsto k_{y}(\psi(x), \psi(x')) \end{aligned}\]

which apparently is now called a deep kernel.

Also if \(A\) is a positive definite operator, then of course it defines a kernel \(k_A(x,x'):=x^{\top}Ax'\)

(Genton 2002) uses the properties of covariance to construct some other nifty ones:

Let \(h:\mathcal{X}\to\mathbb{R}^{+}\) have minimum at 0. Then, using the identity for RVs

\[ \mathop{\textrm{Cov}}\left(Y_{1}, Y_{2}\right)=\left[\mathop{\textrm{Var}}\left(Y_{1}+Y_{2}\right)-\mathop{\textrm{Var}}\left(Y_{1}-Y_{2}\right)\right] / 4 \]

we find that the following is a kernel

\[ K(x, x')=\frac{1}{4}[h(x+x')-h(x-x')] \]

All these to various cunning combination strategies, which I will likely return to discuss. 🚧 Some of them are in the references. For example (Duvenaud et al. 2013) position their work in the wider field.

There is a large body of work attempting to construct a rich kernel through a weighted sum of base kernels (e.g. (Bach 2008; Christoudias, Urtasun, and Darrell 2009)). While these approaches find the optimal solution in polynomial time, speed comes at a cost: the component kernels, as well as their hyperparameters, must be specified in advance[…]

(Hinton and Salakhutdinov 2008) use a deep neural network to learn an embedding; this is a flexible approach to kernel learning but relies upon finding structure in the input density, p(x). Instead we focus on domains where most of the interesting structure is in f(x).

(Wilson and Adams 2013) derive kernels of the form \(SE × \cos(x − x_0\)), forming a basis for stationary kernels. These kernels share similarities with \(SE × Per\) but can express negative prior correlation, and could usefully be included in our grammar.

Stationary spectral kernels

(Sun et al. 2018; Bochner 1959; Kom Samo and Roberts 2015; Yaglom 1987b) construct spectral kernels in the sense that they use the spectral representation to design the kernel and guarantee it is positive definite and stationary using Bochner’s theorem

Bochner’s theorem: A complex-valued function \(K\) on \(\mathbb{R}^{d}\) is the covariance function of a weakly stationary mean square continuous complex-valued random process on \(\mathbb{R}^{d}\) if and only if it can be represented as

\[ K(\boldsymbol{\tau})=\int_{\mathbb{R}^{P}} \exp \left(2 \pi i \boldsymbol{w}^{\top} \boldsymbol{\tau}\right) \psi(\mathrm{d} \boldsymbol{w}) \] where \(\psi\) is a positive and finite measure. If \(\psi\) has a density \(S(\boldsymbol{w})\), then \(S\) is called the spectral density or power spectrum of \(K\), i.e. \(S\) and \(K\) are Fourier duals.

Non-stationary spectral kernels

(Sun et al. 2018; Remes, Heinonen, and Kaski 2017; Kom Samo and Roberts 2015) use a generalised Bochner Theorem (Yaglom 1987b) which does not presume anything about stationarity:

A complex-valued bounded continuous function \(K\) on \(\mathbb{R}^{d}\) is the covariance function of a mean square continuous complex-valued random process on \(\mathbb{R}^{d}\) if and only if it can be represented as

\[ K(\boldsymbol{s}, \boldsymbol{t})=\int_{\mathbb{R}^{d} \times \mathbb{R}^{d}} e^{2 \pi i\left(\boldsymbol{w}_{1}^{\top} \boldsymbol{s}-\boldsymbol{w}_{2}^{\top} \boldsymbol{t}\right)} \psi\left(\mathrm{d} \boldsymbol{w}_{\mathbf{1}}, \mathrm{d} \boldsymbol{w}_{\mathbf{2}}\right) \]

This is clearly more general, but it is not immediately clear how to use this extra potential; spectral representations are not an intuitive way of constructing things.

Locally stationary

(Genton 2002) defines these are kernels that have a particular structure, specifically, a kernel that can be factored into a stationary kernel (_2) and a non negative function \(K_1\) in the following way:

\[ K(\mathbf{s}, \mathbf{t})=K_{1}\left(\frac{\mathbf{s}+\mathbf{t}}{2}\right) K_{2}(\mathbf{s}-\mathbf{t}) \]

Global structure then depends on the mean location \(\frac{\mathbf{s}+\mathbf{t}}{2}\). (Genton 2002) describes some nifty spectral properties of these kernels.

Other constructions might vie for the title of “locally stationary”. To check. 🚧

Genton kernels

That’s my name for them because they seem to originate in (Genton 2002).

For any non-negative function \(h:\mathcal{T}\to\mathbb{R}^+\) with \(h(\mathbf{0})=0,\) the following is a kernel:

\[ K(\mathbf{s}, \mathbf{t})=\frac{1}{4}[h(\mathbf{s}+\mathbf{t})-h(\mathbf{s}-\mathbf{t}) \] Genton gives the example of \(h:s,t\mapsto s^\top t.\)

The motivation is the identity

\[ \operatorname { Covariance }\left(Y_{1}, Y_{2}\right)= \left[\operatorname { Variance }\left(Y_{1}+Y_{2}\right)-\operatorname { Variance }\left(Y_{1}-Y_{2}\right)\right] / 4. \]

Compactly supported

We usually think about these n the stationary isotropic case, c where we mean kernels that vanish whenever the distance between two observation \(s,t\) is larger than a certain cut-off distance \(L,\) i.e. \(\|s-t\|>L\Rightarrow K(s,t)=0\). These are great because they make the Gram matrix sparse (for example, if the cut-off is much smaller than the diameter of the observations and most observations have few covariance neighbours) and so can lead to computational efficiency even for exact inference without any special tricks. They don’t seem to be popular? Statisticians are generally nervous around inferring the support of a parameter, or assigning zero weight to any region of a prior without very good reason, and I imagine this carries over to the analogous problem of covariance kernel support?

Despite feeling that qualm intuitively myself I think there are good cases for this kind of kernel; for example, in a hierarchical kernels you can get correlation between kernels of disjoint support.

(Genton 2002) mentions

\[ \max \left\{\left(1-\frac{\|\mathbf{s}-\mathbf{t}\|}{\tilde{\theta}}\right)^{\tilde{\nu}}, 0\right\} \] and handballs us to (Gneiting 2002) for a bigger smörgåsbord of stationary compactly supported kernels. Gneiting has a couple of methods designed to produce certain smoothness properties at boundary and origin, but mostly about producing compactly supported kernels via clever integral transforms.

Kernels with desired symmetry

(Duvenaud 2014, chap. 2) summarises Ginsbourger et al’s work on kernels with desired symmetries / invariances. 🚧 This produces for example, the periodic kernel above, but also such cute tricks as priors over Möbius strips.

Learning kernels

This is usually in the context of Gaussian processes where everything can work out nicely if you are lucky. The goal for all these seems to be to maximise the marginal posterior likelihood, a.k.a. model evidence, which will be familiar from every Bayesian ML method ever.

Learning kernel hyperparameters

🚧

Learning kernel composition

Automating kernel design by some composition of simpler atomic kernels. AFAICT this started from summaries like (Genton 2002) and went via Duvenaud’s aforementioned notes to became a small industry (Lloyd et al. 2014; Duvenaud, Nickisch, and Rasmussen 2011; Duvenaud et al. 2013; Grosse et al. 2012). A prominent example was the Automated statistician project by David Duvenaud, James Robert Lloyd, Roger Grosse and colleagues, which works by greedy combinatorial search over possible compositions.

More fashionable, presumably, are the differentiable search methods. For example, the AutoGP system (Krauth et al. 2016; Bonilla, Krauth, and Dezfouli 2016) incorporates tricks like these to use gradient descent to design kernels for Gaussian processes. (Sun et al. 2018) construct deep networks of composed kernels. I imagine the Deep Gaussian Process literature is also of this kind, but have not read it.

Hyperkernels

Kernels on kernels, for learning kernels. 🚧 (Ong, Smola, and Williamson 2005, 2002; Ong and Smola 2003; Kondor and Jebara 2006)

Non-positive kernels

As in, kernels which are not positive-definite. (Ong et al. 2004) 🚧

Covariance kernels of non-negative fields

Examining the positive-definiteness of the covariance of \(\operatorname{Var}(\mathbf X\in \mathbb{R}_+^d\) which requires that for \(\mathbf {b}\in \mathbb{R}^d\)

\[ \operatorname {var} (\mathbf {b} ^{\rm {T}}\mathbf {X} )=\mathbf {b} ^{\rm {T}}\operatorname {var} (\mathbf {X} )\mathbf {b} \]

what can we say about this covariance if every element of \(\mathbf X\) is non-negative? 🚧

See also [non-negative matric factorization]({filename}

Refs

Abrahamsen, Petter. 1997. “A Review of Gaussian Random Fields and Correlation Functions.” http://publications.nr.no/publications.nr.no/directdownload/publications.nr.no/rask/old/917_Rapport.pdf.

Agarwal, Arvind, and Hal Daumé Iii. 2011. “Generative Kernels for Exponential Families.” In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, 85–92. http://proceedings.mlr.press/v15/agarwal11b.html.

Aronszajn, N. 1950. “Theory of Reproducing Kernels.” Transactions of the American Mathematical Society 68 (3): 337–404. https://doi.org/10.2307/1990404.

Álvarez, Mauricio A., Lorenzo Rosasco, and Neil D. Lawrence. 2012. “Kernels for Vector-Valued Functions: A Review.” Foundations and Trends® in Machine Learning 4 (3): 195–266. https://doi.org/10.1561/2200000036.

Bach, Francis. 2008. “Exploring Large Feature Spaces with Hierarchical Multiple Kernel Learning.” In Proceedings of the 21st International Conference on Neural Information Processing Systems, 105–12. NIPS’08. USA: Curran Associates Inc. http://papers.nips.cc/paper/3418-exploring-large-feature-spaces-with-hierarchical-multiple-kernel-learning.pdf.

Balog, Matej, Balaji Lakshminarayanan, Zoubin Ghahramani, Daniel M. Roy, and Yee Whye Teh. 2016. “The Mondrian Kernel,” June. http://arxiv.org/abs/1606.05241.

Bochner, Salomon. 1959. Lectures on Fourier Integrals. Princeton University Press.

Bonilla, Edwin V., Karl Krauth, and Amir Dezfouli. 2016. “Generic Inference in Latent Gaussian Process Models,” September. http://arxiv.org/abs/1609.00577.

Christoudias, Mario, Raquel Urtasun, and Trevor Darrell. 2009. “Bayesian Localized Multiple Kernel Learning.” UCB/EECS-2009-96. EECS Department, University of California, Berkeley. http://www2.eecs.berkeley.edu/Pubs/TechRpts/2009/EECS-2009-96.html.

Cortes, Corinna, Patrick Haffner, and Mehryar Mohri. 2004. “Rational Kernels: Theory and Algorithms.” Journal of Machine Learning Research 5 (December): 1035–62. http://dl.acm.org/citation.cfm?id=1005332.1016793.

Cressie, Noel, and Hsin-Cheng Huang. 1999. “Classes of Nonseparable, Spatio-Temporal Stationary Covariance Functions.” Journal of the American Statistical Association 94 (448): 1330–9. https://doi.org/10.1080/01621459.1999.10473885.

Delft, Anne van, and Michael Eichler. 2016. “Locally Stationary Functional Time Series,” February. http://arxiv.org/abs/1602.05125.

Duvenaud, David. 2014. “Automatic Model Construction with Gaussian Processes.” PhD Thesis, University of Cambridge. https://github.com/duvenaud/phd-thesis.

Duvenaud, David K., Hannes Nickisch, and Carl E. Rasmussen. 2011. “Additive Gaussian Processes.” In Advances in Neural Information Processing Systems, 226–34. http://papers.nips.cc/paper/4221-additive-gaussian-processes.pdf.

Duvenaud, David, James Lloyd, Roger Grosse, Joshua Tenenbaum, and Ghahramani Zoubin. 2013. “Structure Discovery in Nonparametric Regression Through Compositional Kernel Search.” In Proceedings of the 30th International Conference on Machine Learning (ICML-13), 1166–74. http://machinelearning.wustl.edu/mlpapers/papers/icml2013_duvenaud13.

Genton, Marc G. 2002. “Classes of Kernels for Machine Learning: A Statistics Perspective.” Journal of Machine Learning Research 2 (March): 299–312. http://jmlr.org/papers/volume2/genton01a/genton01a.pdf.

Girolami, Mark, and Simon Rogers. 2005. “Hierarchic Bayesian Models for Kernel Learning.” In Proceedings of the 22nd International Conference on Machine Learning - ICML ’05, 241–48. Bonn, Germany: ACM Press. https://doi.org/10.1145/1102351.1102382.

Gneiting, Tilmann. 2002. “Compactly Supported Correlation Functions.” Journal of Multivariate Analysis 83 (2): 493–508. https://doi.org/10.1006/jmva.2001.2056.

Gneiting, Tilmann, William Kleiber, and Martin Schlather. 2010. “Matérn Cross-Covariance Functions for Multivariate Random Fields.” Journal of the American Statistical Association 105 (491): 1167–77. https://doi.org/10.1198/jasa.2010.tm09420.

Grosse, Roger, Ruslan R. Salakhutdinov, William T. Freeman, and Joshua B. Tenenbaum. 2012. “Exploiting Compositionality to Explore a Large Space of Model Structures.” In Proceedings of the Conference on Uncertainty in Artificial Intelligence. http://arxiv.org/abs/1210.4856.

Hartikainen, J., and S. Särkkä. 2010. “Kalman Filtering and Smoothing Solutions to Temporal Gaussian Process Regression Models.” In 2010 IEEE International Workshop on Machine Learning for Signal Processing, 379–84. Kittila, Finland: IEEE. https://doi.org/10.1109/MLSP.2010.5589113.

Hinton, Geoffrey E, and Ruslan R Salakhutdinov. 2008. “Using Deep Belief Nets to Learn Covariance Kernels for Gaussian Processes.” In Advances in Neural Information Processing Systems 20, edited by J. C. Platt, D. Koller, Y. Singer, and S. T. Roweis, 1249–56. Curran Associates, Inc. http://papers.nips.cc/paper/3211-using-deep-belief-nets-to-learn-covariance-kernels-for-gaussian-processes.pdf.

Hofmann, Thomas, Bernhard Schölkopf, and Alexander J. Smola. 2008. “Kernel Methods in Machine Learning.” The Annals of Statistics 36 (3): 1171–1220. https://doi.org/10.1214/009053607000000677.

Kom Samo, Yves-Laurent, and Stephen Roberts. 2015. “Generalized Spectral Kernels,” June. http://arxiv.org/abs/1506.02236.

Kondor, Risi, and Tony Jebara. 2006. “Gaussian and Wishart Hyperkernels.” In Proceedings of the 19th International Conference on Neural Information Processing Systems, 729–36. NIPS’06. Cambridge, MA, USA: MIT Press. http://dl.acm.org/citation.cfm?id=2976456.2976548.

Krauth, Karl, Edwin V. Bonilla, Kurt Cutajar, and Maurizio Filippone. 2016. “AutoGP: Exploring the Capabilities and Limitations of Gaussian Process Models.” In UAI17. http://arxiv.org/abs/1610.05392.

Lawrence, Neil. 2005. “Probabilistic Non-Linear Principal Component Analysis with Gaussian Process Latent Variable Models.” Journal of Machine Learning Research 6 (Nov): 1783–1816. http://www.jmlr.org/papers/v6/lawrence05a.html.

Lloyd, James Robert, David Duvenaud, Roger Grosse, Joshua Tenenbaum, and Zoubin Ghahramani. 2014. “Automatic Construction and Natural-Language Description of Nonparametric Regression Models.” In Twenty-Eighth AAAI Conference on Artificial Intelligence. http://arxiv.org/abs/1402.4304.

Mercer, J. 1909. “Functions of Positive and Negative Type, and Their Connection with the Theory of Integral Equations.” Philosophical Transactions of the Royal Society of London. Series A, Containing Papers of a Mathematical or Physical Character 209 (441-458): 415–46. https://doi.org/10.1098/rsta.1909.0016.

Micchelli, Charles A., and Massimiliano Pontil. 2005a. “Learning the Kernel Function via Regularization.” Journal of Machine Learning Research 6 (Jul): 1099–1125. http://www.jmlr.org/papers/v6/micchelli05a.html.

———. 2005b. “On Learning Vector-Valued Functions.” Neural Computation 17 (1): 177–204. https://doi.org/10.1162/0899766052530802.

Minasny, Budiman, and Alex. B. McBratney. 2005. “The Matérn Function as a General Model for Soil Variograms.” Geoderma, Pedometrics 2003, 128 (3–4): 192–207. https://doi.org/10.1016/j.geoderma.2005.04.003.

Murphy, Kevin P. 2012. Machine Learning: A Probabilistic Perspective. 1 edition. Cambridge, MA: The MIT Press.

Murray-Smith, Roderick, and Barak A. Pearlmutter. 2005. “Transformations of Gaussian Process Priors.” In Deterministic and Statistical Methods in Machine Learning, edited by Joab Winkler, Mahesan Niranjan, and Neil Lawrence, 110–23. Lecture Notes in Computer Science. Springer Berlin Heidelberg. http://bcl.hamilton.ie/~barak/papers/MLW-Jul-2005.pdf.

O’Callaghan, Simon Timothy, and Fabio T. Ramos. 2011. “Continuous Occupancy Mapping with Integral Kernels.” In Twenty-Fifth AAAI Conference on Artificial Intelligence. https://www.aaai.org/ocs/index.php/AAAI/AAAI11/paper/view/3784.

Ong, Cheng Soon, Xavier Mary, Stéphane Canu, and Alexander J. Smola. 2004. “Learning with Non-Positive Kernels.” In Twenty-First International Conference on Machine Learning - ICML ’04, 81. Banff, Alberta, Canada: ACM Press. https://doi.org/10.1145/1015330.1015443.

Ong, Cheng Soon, and Alexander J. Smola. 2003. “Machine Learning Using Hyperkernels.” In Proceedings of the Twentieth International Conference on International Conference on Machine Learning, 568–75. ICML’03. AAAI Press. http://dl.acm.org/citation.cfm?id=3041838.3041910.

Ong, Cheng Soon, Alexander J. Smola, and Robert C. Williamson. 2002. “Hyperkernels.” In Proceedings of the 15th International Conference on Neural Information Processing Systems, 495–502. NIPS’02. Cambridge, MA, USA: MIT Press. http://dl.acm.org/citation.cfm?id=2968618.2968680.

———. 2005. “Learning the Kernel with Hyperkernels.” Journal of Machine Learning Research 6 (Jul): 1043–71. http://www.jmlr.org/papers/v6/ong05a.html.

Pfaffel, Oliver. 2012. “Wishart Processes,” January. http://arxiv.org/abs/1201.3256.

Rakotomamonjy, Alain, Francis R. Bach, Stéphane Canu, and Yves Grandvalet. 2008. “SimpleMKL.” Journal of Machine Learning Research 9 (Nov): 2491–2521. http://www.jmlr.org/papers/v9/rakotomamonjy08a.html.

Rasmussen, Carl Edward, and Christopher K. I. Williams. 2006. Gaussian Processes for Machine Learning. Adaptive Computation and Machine Learning. Cambridge, Mass: MIT Press. http://www.gaussianprocess.org/gpml/.

Remes, Sami, Markus Heinonen, and Samuel Kaski. 2017. “Non-Stationary Spectral Kernels.” In Advances in Neural Information Processing Systems 30, edited by I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, 4642–51. Curran Associates, Inc. http://papers.nips.cc/paper/7050-non-stationary-spectral-kernels.pdf.

Särkkä, Simo, and Jouni Hartikainen. 2012. “Infinite-Dimensional Kalman Filtering Approach to Spatio-Temporal Gaussian Process Regression.” In Artificial Intelligence and Statistics. http://www.jmlr.org/proceedings/papers/v22/sarkka12.html.

Särkkä, Simo, A. Solin, and J. Hartikainen. 2013. “Spatiotemporal Learning via Infinite-Dimensional Bayesian Filtering and Smoothing: A Look at Gaussian Process Regression Through Kalman Filtering.” IEEE Signal Processing Magazine 30 (4): 51–61. https://doi.org/10.1109/MSP.2013.2246292.

Schölkopf, Bernhard, Ralf Herbrich, and Alex J. Smola. 2001. “A Generalized Representer Theorem.” In Computational Learning Theory, edited by David Helmbold and Bob Williamson, 416–26. Lecture Notes in Computer Science. Springer Berlin Heidelberg. https://doi.org/10.1007/3-540-44581-1.

Schölkopf, Bernhard, and Alexander J. Smola. 2003. “A Short Introduction to Learning with Kernels.” In Advanced Lectures on Machine Learning, edited by Shahar Mendelson and Alexander J. Smola, 41–64. Lecture Notes in Computer Science 2600. Springer Berlin Heidelberg. https://doi.org/10.1007/3-540-36434-X_2.

———. 2002. Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. MIT Press.

Smith, Michael Thomas, Mauricio A. Alvarez, and Neil D. Lawrence. 2018. “Gaussian Process Regression for Binned Data,” September. http://arxiv.org/abs/1809.02010.

Stein, Michael L. 2005. “Space-Time Covariance Functions.” Journal of the American Statistical Association 100 (469): 310–21. https://doi.org/10.1198/016214504000000854.

Sun, Shengyang, Guodong Zhang, Chaoqi Wang, Wenyuan Zeng, Jiaman Li, and Roger Grosse. 2018. “Differentiable Compositional Kernel Learning for Gaussian Processes.” arXiv Preprint arXiv:1806.04326.

Székely, Gábor J., and Maria L. Rizzo. 2009. “Brownian Distance Covariance.” The Annals of Applied Statistics 3 (4): 1236–65. https://doi.org/10.1214/09-AOAS312.

Vedaldi, A., and A. Zisserman. 2012. “Efficient Additive Kernels via Explicit Feature Maps.” IEEE Transactions on Pattern Analysis and Machine Intelligence 34 (3): 480–92. https://doi.org/10.1109/TPAMI.2011.153.

Vert, Jean-Philippe, Koji Tsuda, and Bernhard Schölkopf. 2004. “A Primer on Kernel Methods.” In Kernel Methods in Computational Biology. MIT Press. http://kyb.tuebingen.mpg.de/fileadmin/user_upload/files/publications/pdfs/pdf2549.pdf.

Vishwanathan, S. V. N., Nicol N. Schraudolph, Risi Kondor, and Karsten M. Borgwardt. 2010. “Graph Kernels.” Journal of Machine Learning Research 11 (August): 1201–42. http://authors.library.caltech.edu/20528/1/Vishwanathan2010p11646J_Mach_Learn_Res.pdf.

Wilk, Mark van der, Andrew G. Wilson, and Carl E. Rasmussen. 2014. “Variational Inference for Latent Variable Modelling of Correlation Structure.” In NIPS 2014 Workshop on Advances in Variational Inference.

Wilson, Andrew Gordon, and Ryan Prescott Adams. 2013. “Gaussian Process Kernels for Pattern Discovery and Extrapolation.” In International Conference on Machine Learning. http://arxiv.org/abs/1302.4245.

Wilson, Andrew Gordon, Christoph Dann, Christopher G. Lucas, and Eric P. Xing. 2015. “The Human Kernel,” October. http://arxiv.org/abs/1510.07389.

Wilson, Andrew Gordon, and Zoubin Ghahramani. 2011. “Generalised Wishart Processes.” In Proceedings of the Twenty-Seventh Conference on Uncertainty in Artificial Intelligence, 736–44. UAI’11. Arlington, Virginia, United States: AUAI Press. http://dl.acm.org/citation.cfm?id=3020548.3020633.

———. 2012. “Modelling Input Varying Correlations Between Multiple Responses.” In Machine Learning and Knowledge Discovery in Databases, edited by Peter A. Flach, Tijl De Bie, and Nello Cristianini, 858–61. Lecture Notes in Computer Science. Springer Berlin Heidelberg.

Wu, Zongmin. 1995. “Compactly Supported Positive Definite Radial Functions.” Advances in Computational Mathematics 4 (1): 283–92. https://doi.org/10.1007/BF03177517.

Yaglom, A. M. 1987a. Correlation Theory of Stationary and Related Random Functions. Springer Series in Statistics. New York: Springer-Verlag.

———. 1987b. Correlation Theory of Stationary and Related Random Functions: Supplementary Notes and References. New York, NY: Springer Science & Business Media.

———. 1987c. Correlation Theory of Stationary and Related Random Functions. Volume II: Supplementary Notes and References. Springer Series in Statistics. New York: Springer-Verlag.

Yu, Yaoliang, Hao Cheng, Dale Schuurmans, and Csaba Szepesvári. 2013. “Characterizing the Representer Theorem.” In Proceedings of the 30th International Conference on Machine Learning (ICML-13), 570–78. http://www.jmlr.org/proceedings/papers/v28/yu13.pdf.

Zhang, Aonan, and John Paisley. 2019. “Random Function Priors for Correlation Modeling.” In International Conference on Machine Learning, 7424–33. http://proceedings.mlr.press/v97/zhang19k.html.