Covariance functions

Variograms, Mercer kernels, positive definite operators, spare reproducing kernels for that Hilbert space I bought on eBay real cheap

September 16, 2019 — January 5, 2021

Hilbert space
kernel tricks
metrics
signal processing
statistics
stochastic processes
Figure 1: A realisation of a nonstationary rough covariance process (partially observed)

On the interpretation of kernels as the covariance functions of stochastic processes, which is one way to define stochastic processes.

Suppose we have a real-valued stochastic process

\[\{\mathsf{x}(t)\}_{t\in \mathcal{T}} \] For now we may as well take the index to be \(\mathcal{T}\subseteq\mathbb{R}^D\), or at least a nice metric space.

The covariance kernel of \(\mathsf{x}\) is a function

\[\begin{aligned}\kappa:&\mathcal{T}\times \mathcal{T}&\to& \mathbb{R}\\ &s,t&\mapsto&\operatorname{Cov}(\mathsf{x}(s),\mathsf{x}(t)). \end{aligned}\]

This is covariance in the usual sense, to wit,

\[\begin{aligned} \operatorname{Cov}(\mathsf{x}(s),\mathsf{x}(t)) &:=\mathbb{E}[\mathsf{x}(s)-\mathbb{E}[\mathsf{x}(t)]] \mathbb{E}[\mathsf{x}(t)-\mathbb{E}[\mathsf{x}(t)]]\\ &=\mathbb{E}[\mathsf{x}(s)\mathsf{x}(t)]- \mathbb{E}[\mathsf{x}(s)]\mathbb{E}[\mathsf{x}(t)]\\ \end{aligned}\]

These are useful objects. In spatial statistics, Gaussian processes, kernel machines and covariance estimation we are concerned with such covariances between values of stochastic processes at different values of their indices. The Karhunen—Loève transform decomposes stochastic processes into a basis of eigenfunctions of the covariance kernel operator. We might consider them in terms of processes defined through convolution instead.

Any process with finite second moments has a covariance function. They are especially renowned for Gaussian process methods, since Gaussian processes are uniquely specified by their mean function and covariance kernels, and also have the usual convenient algebraic properties by virtue of being Gaussian.

TODO: relate to representer theorems.

Figure 2: Han Lin SHANG’s slides, relating this to PCA in functional data

1 Covariance kernels of some example processes

1.1 A simple Markov chain

Consider a homogeneous continuous time Markov process taking values in \(\{0,1\}\). Suppose it has transition rate matrix

\[\left[\begin{array}{cc} 0 & \lambda\\ \lambda & 0 \end{array}\right] \] and moreover, that we start the chain from the stationary distribution, \([\frac 1 2\; \frac 1 2]^\top,\) which implies that \(\operatorname{Cov}(0, t)=\operatorname{Cov}(s, s+t)\) for all \(s\), and further, that \(\mathbb{E}[\mathsf{x}(t)]=\frac 1 2 \,\forall t\). So we know that \(\operatorname{Cov}(s,s+t)=\mathbb{E}[\mathsf{x}(0)\mathsf{x}(t)]- \frac 1 4.\) What is \(\mathbb{E}[\mathsf{x}(0)\mathsf{x}(t)]\)?

\[\begin{aligned} \mathbb{E}[\mathsf{x}(0)\mathsf{x}(t)] &=\mathbb{P}[\{\mathsf{x}(0)=1\}\cap\{\mathsf{x}(t)=1\}]\\ &=\mathbb{P}[\text{number of jumps on $[0,t]$ is even}]\\ &=\mathbb{P}[\mathsf{z}\text{ is even}]\text{ where } \mathsf{z}\sim\operatorname{Poisson}(\lambda t)\\ &=\sum_{k=0}^{\infty}\frac{(\lambda t)^{2k} \exp(-\lambda t)}{(2k)!}\\ &= \exp(-\lambda t) \sum_{k=0}^{\infty}\frac{(\lambda t)^{2k}}{(2k)!}\\ &= \exp(-\lambda t)(\exp(-\lambda t) + \exp(\lambda t))/2 &\text{Taylor expansion}\\ &= \frac{\exp(-2\lambda t)}{2} + \frac{1}{2} \end{aligned}\]

From this we deduce that \(\operatorname{Cov}(s,s+t)=\frac{\exp(-2\lambda t)}{2} + \frac{1}{4}.\)

Question: what functions are admissible as covariance kernels for Markov chains?

1.2 The Hawkes process

Covariance kernels are also important in various point processes. Notably, the Hawkes process was introduced in terms of its covariance. 🚧

1.3 Gaussian processes

Handling certain functions in terms of their covariances is particularly convenient. Specifically, Gaussian processes are simple in this context. Such processes are

2 General real covariance kernels

A function \(K:\mathcal{T}\times\mathcal{T}\to\mathbb{R}\) can be a covariance kernel if

  1. It is symmetric in its arguments \(K(s,t)=K(t,s)\) (more generally, conjugate symmetric — \(K(s,t)=K^*(t,s)\), but I think maybe my life will be simpler if I ignore the complex case for the moment.)
  2. It is positive semidefinite.

That positive semidefiniteness means that for arbitrary real numbers \(c_1,\dots,c_k\) and arbitrary indices \(t_1,\dots,t_k\)

\[ \sum_{i=1}^{k} \sum_{j=1}^{k} c_{i} c_{j} K(t_{i}, t_{j}) \geq 0 \]

The interpretation here is since we need the covariances induced by the finite dimensional distributions of this process to be consistent, it is necessary that

\[ \operatorname{Var}\left\{c_{1} X_{\mathbf{t}_{1}}+\cdots+c_{k} X_{\mathbf{t}_{k}}\right\}= \sum_{i=1}^{k} \sum_{j=1}^{k} c_{i} c_{j} K\left(\mathbf{t}_{i}, \mathbf{t}_{j}\right) \geq 0 \]

This arises from the constraint on the covariance of \(\operatorname{Var}(\mathbf X\in \mathbb{R}_+^d\) which requires that for \(\mathbf {b}\in \mathbb{R}^d\)

\[ \operatorname {var} (\mathbf {b} ^{\top}\mathbf {X} ) =\mathbf {b} ^{\top}\operatorname {var} (\mathbf {X} )\mathbf {b} \]

Question: What can we say about this covariance if every element of \(\mathbf X\) is non-negative?

Amazingly (to me), this necessary condition will also be sufficient to make something a covariance kernel. In practice designing covariance functions using positive definiteness is tricky; the space of positive definite kernels is implicit. What we normally do is find a fun class that guarantees positive definiteness and riffle through that. Most of the rest of this notebook is devoted to such classes.

3 Bonus: complex covariance kernels

I talked in terms of real kernels above because I generally observe real-valued measurements of processes. But often complex covariances arise in a natural way too.

A function \(K:\mathcal{T}\times\mathcal{T}\to\mathbb{C}\) can be a covariance kernel if

  1. It is symmetric conjugate symmetric in its arguments — \(K(s,t)=K^*(t,s)\),
  2. It is positive semidefinite.

That positive semidefiniteness means that for arbitrary complex numbers \(c_1,\dots,c_k\) and arbitrary indices \(t_1,\dots,t_k\)

\[ \sum_{i=1}^{k} \sum_{j=1}^{k} c_{i} \overline{c_{j}} K(t_{i}, t_{j}) \geq 0. \]

Every analytic real kernel should also be a complex kernel, right? 🏗

4 Kernel zoo

For some examples of covariance kernels, see the kernel zoo.

5 Learning kernels

See learning kernels.

6 Non-positive kernels

As in, kernels which are not positive-definite. For example, (Ong et al. 2004; Saha and Balamurugan 2020) 🏗

7 References

Abrahamsen. 1997. A Review of Gaussian Random Fields and Correlation Functions.”
Agarwal, and Iii. 2011. Generative Kernels for Exponential Families.” In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics.
Alexanderian. 2015. A Brief Note on the Karhunen-Loève Expansion.” arXiv:1509.07526 [Math].
Álvarez, Rosasco, and Lawrence. 2012. Kernels for Vector-Valued Functions: A Review.” Foundations and Trends® in Machine Learning.
Aronszajn. 1950. Theory of Reproducing Kernels.” Transactions of the American Mathematical Society.
Bach. 2008. Exploring Large Feature Spaces with Hierarchical Multiple Kernel Learning.” In Proceedings of the 21st International Conference on Neural Information Processing Systems. NIPS’08.
Bacry, and Muzy. 2016. First- and Second-Order Statistics Characterization of Hawkes Processes and Non-Parametric Estimation.” IEEE Transactions on Information Theory.
Balog, Lakshminarayanan, Ghahramani, et al. 2016. The Mondrian Kernel.” arXiv:1606.05241 [Stat].
Bochner. 1959. Lectures on Fourier Integrals.
Bohn, Griebel, and Rieger. 2018. A Representer Theorem for Deep Kernel Learning.” arXiv:1709.10441 [Cs, Math].
Bonilla, Krauth, and Dezfouli. 2019. Generic Inference in Latent Gaussian Process Models.” Journal of Machine Learning Research.
Christoudias, Urtasun, and Darrell. 2009. Bayesian Localized Multiple Kernel Learning.” UCB/EECS-2009-96.
Cortes, Haffner, and Mohri. 2004. Rational Kernels: Theory and Algorithms.” Journal of Machine Learning Research.
Cressie, and Huang. 1999. Classes of Nonseparable, Spatio-Temporal Stationary Covariance Functions.” Journal of the American Statistical Association.
Duvenaud, David. 2014. Automatic Model Construction with Gaussian Processes.”
Duvenaud, David, Lloyd, Grosse, et al. 2013. Structure Discovery in Nonparametric Regression Through Compositional Kernel Search.” In Proceedings of the 30th International Conference on Machine Learning (ICML-13).
Duvenaud, David K., Nickisch, and Rasmussen. 2011. Additive Gaussian Processes.” In Advances in Neural Information Processing Systems.
Gaspari, and Cohn. 1999. Construction of Correlation Functions in Two and Three Dimensions.” Quarterly Journal of the Royal Meteorological Society.
Genton. 2001. Classes of Kernels for Machine Learning: A Statistics Perspective.” Journal of Machine Learning Research.
Genton, and Perrin. 2004. On a Time Deformation Reducing Nonstationary Stochastic Processes to Local Stationarity.” Journal of Applied Probability.
Girolami, and Rogers. 2005. Hierarchic Bayesian Models for Kernel Learning.” In Proceedings of the 22nd International Conference on Machine Learning - ICML ’05.
Gneiting. 2002a. Nonseparable, Stationary Covariance Functions for Space–Time Data.” Journal of the American Statistical Association.
———. 2002b. Compactly Supported Correlation Functions.” Journal of Multivariate Analysis.
Gneiting, Kleiber, and Schlather. 2010. Matérn Cross-Covariance Functions for Multivariate Random Fields.” Journal of the American Statistical Association.
Grosse, Salakhutdinov, Freeman, et al. 2012. Exploiting Compositionality to Explore a Large Space of Model Structures.” In Proceedings of the Conference on Uncertainty in Artificial Intelligence.
Hartikainen, and Särkkä. 2010. Kalman Filtering and Smoothing Solutions to Temporal Gaussian Process Regression Models.” In 2010 IEEE International Workshop on Machine Learning for Signal Processing.
Hawkes. 1971. Spectra of Some Self-Exciting and Mutually Exciting Point Processes.” Biometrika.
Hinton, and Salakhutdinov. 2008. Using Deep Belief Nets to Learn Covariance Kernels for Gaussian Processes.” In Advances in Neural Information Processing Systems 20.
Hofmann, Schölkopf, and Smola. 2008. Kernel Methods in Machine Learning.” The Annals of Statistics.
Kom Samo, and Roberts. 2015. Generalized Spectral Kernels.” arXiv:1506.02236 [Stat].
Kondor, and Jebara. 2006. Gaussian and Wishart Hyperkernels.” In Proceedings of the 19th International Conference on Neural Information Processing Systems. NIPS’06.
Krauth, Bonilla, Cutajar, et al. 2016. AutoGP: Exploring the Capabilities and Limitations of Gaussian Process Models.” In UAI17.
Lawrence. 2005. Probabilistic Non-Linear Principal Component Analysis with Gaussian Process Latent Variable Models.” Journal of Machine Learning Research.
Lloyd, Duvenaud, Grosse, et al. 2014. Automatic Construction and Natural-Language Description of Nonparametric Regression Models.” In Twenty-Eighth AAAI Conference on Artificial Intelligence.
Mercer. 1909. Functions of Positive and Negative Type, and Their Connection with the Theory of Integral Equations.” Philosophical Transactions of the Royal Society of London. Series A, Containing Papers of a Mathematical or Physical Character.
Micchelli, and Pontil. 2005a. Learning the Kernel Function via Regularization.” Journal of Machine Learning Research.
———. 2005b. On Learning Vector-Valued Functions.” Neural Computation.
Minasny, and McBratney. 2005. The Matérn Function as a General Model for Soil Variograms.” Geoderma, Pedometrics 2003,.
Murphy. 2012. Machine learning: a probabilistic perspective. Adaptive computation and machine learning series.
Murray-Smith, and Pearlmutter. 2005. Transformations of Gaussian Process Priors.” In Deterministic and Statistical Methods in Machine Learning. Lecture Notes in Computer Science.
O’Callaghan, and Ramos. 2011. Continuous Occupancy Mapping with Integral Kernels.” In Twenty-Fifth AAAI Conference on Artificial Intelligence.
Ong, Mary, Canu, et al. 2004. Learning with Non-Positive Kernels.” In Twenty-First International Conference on Machine Learning - ICML ’04.
Ong, and Smola. 2003. Machine Learning Using Hyperkernels.” In Proceedings of the Twentieth International Conference on International Conference on Machine Learning. ICML’03.
Ong, Smola, and Williamson. 2002. Hyperkernels.” In Proceedings of the 15th International Conference on Neural Information Processing Systems. NIPS’02.
———. 2005. Learning the Kernel with Hyperkernels.” Journal of Machine Learning Research.
Pérez-Abreu, and Rocha-Arteaga. 2005. Covariance-Parameter Lévy Processes in the Space of Trace-Class Operators.” Infinite Dimensional Analysis, Quantum Probability and Related Topics.
Perrin, and Senoussi. 1999. Reducing Non-Stationary Stochastic Processes to Stationarity by a Time Deformation.” Statistics & Probability Letters.
———. 2000. Reducing Non-Stationary Random Fields to Stationarity and Isotropy Using a Space Deformation.” Statistics & Probability Letters.
Pfaffel. 2012. Wishart Processes.” arXiv:1201.3256 [Math].
Rakotomamonjy, Bach, Canu, et al. 2008. SimpleMKL.” Journal of Machine Learning Research.
Rasmussen, and Williams. 2006. Gaussian Processes for Machine Learning. Adaptive Computation and Machine Learning.
Remes, Heinonen, and Kaski. 2017. Non-Stationary Spectral Kernels.” In Advances in Neural Information Processing Systems 30.
———. 2018. Neural Non-Stationary Spectral Kernel.” arXiv:1811.10978 [Cs, Stat].
Saha, and Balamurugan. 2020. Learning with Operator-Valued Kernels in Reproducing Kernel Krein Spaces.” In Advances in Neural Information Processing Systems.
Sampson, and Guttorp. 1992. Nonparametric Estimation of Nonstationary Spatial Covariance Structure.” Journal of the American Statistical Association.
Särkkä, and Hartikainen. 2012. Infinite-Dimensional Kalman Filtering Approach to Spatio-Temporal Gaussian Process Regression.” In Artificial Intelligence and Statistics.
Särkkä, Solin, and Hartikainen. 2013. Spatiotemporal Learning via Infinite-Dimensional Bayesian Filtering and Smoothing: A Look at Gaussian Process Regression Through Kalman Filtering.” IEEE Signal Processing Magazine.
Schölkopf, Herbrich, and Smola. 2001. A Generalized Representer Theorem.” In Computational Learning Theory. Lecture Notes in Computer Science.
Schölkopf, and Smola. 2002. Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond.
———. 2003. A Short Introduction to Learning with Kernels.” In Advanced Lectures on Machine Learning. Lecture Notes in Computer Science 2600.
Sinha, and Duchi. 2016. Learning Kernels with Random Features.” In Advances in Neural Information Processing Systems 29.
Smith, Alvarez, and Lawrence. 2018. Gaussian Process Regression for Binned Data.” arXiv:1809.02010 [Cs, Stat].
Stein. 2005. Space-Time Covariance Functions.” Journal of the American Statistical Association.
Sun, Zhang, Wang, et al. 2018. “Differentiable Compositional Kernel Learning for Gaussian Processes.” arXiv Preprint arXiv:1806.04326.
Székely, and Rizzo. 2009. Brownian Distance Covariance.” The Annals of Applied Statistics.
Uziel. 2020. Nonparametric Sequential Prediction While Deep Learning the Kernel.” In International Conference on Artificial Intelligence and Statistics.
van Delft, and Eichler. 2016. Locally Stationary Functional Time Series.” arXiv:1602.05125 [Math, Stat].
van der Wilk, Wilson, and Rasmussen. 2014. “Variational Inference for Latent Variable Modelling of Correlation Structure.” In NIPS 2014 Workshop on Advances in Variational Inference.
Vedaldi, and Zisserman. 2012. Efficient Additive Kernels via Explicit Feature Maps.” IEEE Transactions on Pattern Analysis and Machine Intelligence.
Vert, Tsuda, and Schölkopf. 2004. A Primer on Kernel Methods.” In Kernel Methods in Computational Biology.
Vishwanathan, Schraudolph, Kondor, et al. 2010. Graph Kernels.” Journal of Machine Learning Research.
Wilson, and Adams. 2013. Gaussian Process Kernels for Pattern Discovery and Extrapolation.” In International Conference on Machine Learning.
Wilson, Dann, Lucas, et al. 2015. The Human Kernel.” arXiv:1510.07389 [Cs, Stat].
Wilson, and Ghahramani. 2011. Generalised Wishart Processes.” In Proceedings of the Twenty-Seventh Conference on Uncertainty in Artificial Intelligence. UAI’11.
———. 2012. “Modelling Input Varying Correlations Between Multiple Responses.” In Machine Learning and Knowledge Discovery in Databases. Lecture Notes in Computer Science.
Wilson, Hu, Salakhutdinov, et al. 2016. Deep Kernel Learning.” In Artificial Intelligence and Statistics.
Wu. 1995. Compactly Supported Positive Definite Radial Functions.” Advances in Computational Mathematics.
Yaglom. 1961. Second-Order Homogeneous Random Fields.” Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability, Volume 2: Contributions to Probability Theory.
———. 1987. Correlation Theory of Stationary and Related Random Functions. Volume II: Supplementary Notes and References. Springer Series in Statistics.
Yu, Cheng, Schuurmans, et al. 2013. Characterizing the Representer Theorem.” In Proceedings of the 30th International Conference on Machine Learning (ICML-13).
Zhang, and Paisley. 2019. Random Function Priors for Correlation Modeling.” In International Conference on Machine Learning.