The Living Thing / Notebooks :

(Reproducing) kernel tricks

Kernel in the sense of the (Mercer-)“kernel trick”, which augments covariance estiamtion with clever distance measures. Not to be confused with density-estimation-type convolution kernels, nor the dozens of other clashing definitions of kernel; they can have their respective own pages.

Kernel machine use not-necessarily-Euclidean “reproducing” kernels, aka Mercer kernels to implicitly define convenient Hilbert “feature” spaces for your purpose. Alternatively you might like to make your Hilbert space basis explicit by doing a basis transform, or by taking your implicit feature map and approximating it but you don’t need to. In fact, the implicit space induced by your reproducing kernels might (in general will) look odd indeed, something with no finite dimensional representation. That’s the “trick” part.

TODO: clear explanation, less blather. Until then, see ScSm02, which is a very well-written textbook covering an unbelievable amount of ground without pausing to make a fuss fuss, or MaAm14, which is more narrowly focussed on just the Mercer-kernel part, or ChLi09 from an approximation-theory perspective.

Spoiler: you upgrade your old boring linear algebra on finite (usually low-) dimensional spaces to sexy new curvy algebra on potentially-infinite-dimensional manifolds. Or, if you’d like, to apply statistical learning methods based on things with an obvious finite vector space representation (\(\mathbb{R}^n\)) to things without one (Sentences, piano-rolls, \(\mathcal{C}^d_\ell\)).

The other use that I’ve seen of the kernel trick is to show something clever about your learning method in terms of fancy kernel representations through formal mathematics. This one is nice to keep in mind, because practically, kernel methods have serious problems with scalability - problems that even afflict me with a mere \(N\simeq 10^5\) data points, since the Gram matrix of inner products might not admit of any accurate representation in less than the \(N(N-1)/2\) dimensions.

TODO: discuss covariance matrices, positive definite matrices, mention convenient nexus of SVMs and learning theory etc.

I’m especially interested in

  1. Nonparametric kernel independence tests
  2. efficient kernel pre-image approximation.
  3. connection between kernel PCA and clustering (SKSB98 and Will01)
  4. kernel regression with rbfs
  5. kernel layers in neural networks

Automating kernel design has some weird hacks. See the Automated statistician project by David Duvenaud, James Robert Lloyd, Roger Grosse and colleagues. For traditionalists, one of the co-authors has written a page on doing kernel design by hand. See the GSFT12 for a mind-melting compositional kernel diagram.

Grosse, Salakhutdinov,Freeman and Joshua B. Tenenbaum, Exploiting compositionality to explore a large space of model structures

From GSFT12: “Examples of existing machine learning models which fall under our framework. Arrows represent models reachable using a single production rule. Only a small fraction of the 2496 models reachable within 3 steps are shown, and not all possible arrows are shown.”

Alex Smola (who with, Bernhard Schölkopf) has his name on a terrifying proportion of publications in this area, also has all his publications online.

The oft-cited grandfather of all the reproducing kernel stuff is Aronszajn’s 1950 work (Aron50), although this didn’t percolate into machine-learning for decades.

Kernel approximation

See kernel approximation.

RKHS distribution embedding

See that page.


Alaoui, A. E., & Mahoney, M. W.(2014) Fast Randomized Kernel Methods With Statistical Guarantees. arXiv:1411.0306 [Cs, Stat].
Altun, Y., Smola, A. J., & Hofmann, T. (2004) Exponential Families for Conditional Random Fields. In Proceedings of the 20th Conference on Uncertainty in Artificial Intelligence (pp. 2–9). Arlington, Virginia, United States: AUAI Press
Aronszajn, N. (1950) Theory of Reproducing Kernels. Transactions of the American Mathematical Society, 68(3), 337–404. DOI.
Bach, F. (2015) On the Equivalence between Kernel Quadrature Rules and Random Feature Expansions. arXiv Preprint arXiv:1502.06800.
Bach, F. (n.d.) Exploring large feature spaces with hierarchical multiple kernel learning. In In Advances in Neural Information Processing Systems (NIPS (p. 2008).
Bach, F. R.(2013) Sharp analysis of low-rank kernel matrix approximations. In COLT (Vol. 30, pp. 185–209).
Bakır, G. H., Zien, A., & Tsuda, K. (2004) Learning to Find Graph Pre-images. In C. E. Rasmussen, H. H. Bülthoff, B. Schölkopf, & M. A. Giese (Eds.), Pattern Recognition (pp. 253–261). Springer Berlin Heidelberg
Balle Pigem, B. de. (2013, July 12) Learning finite-state machines: statistical and algorithmic aspects (info:eu-repo/semantics/doctoralThesis). . Universitat Politècnica de Catalunya. Departament de Llenguatges i Sistemes Informàtics, Barcelona
Balog, M., Lakshminarayanan, B., Ghahramani, Z., Roy, D. M., & Teh, Y. W.(2016) The Mondrian Kernel. arXiv:1606.05241 [Stat].
Baxter, B., & Roussos, G. (2002) A New Error Estimate of the Fast Gauss Transform. SIAM Journal on Scientific Computing, 24(1), 257–259. DOI.
Ben-Hur, A., Ong, C. S., Sonnenburg, S., Schölkopf, B., & Rätsch, G. (2008) Support Vector Machines and Kernels for Computational Biology. PLoS Comput Biol, 4(10), e1000173. DOI.
Burges, C. J. C.(1998) Geometry and Invariance in Kernel Based Methods. In B. Schölkopf, C. J. Burges, & A. J. Smola (Eds.), Advances in Kernel Methods - Support Vector Learning. Cambridge, MA: MIT Press
Carrasco, R. C., Oncina, J., & Calera-Rubio, J. (2001) Stochastic Inference of Regular Tree Languages. Machine Learning, 44(1–2), 185–197. DOI.
Chatfield, K., Lempitsky, V., Vedaldi, A., & Zisserman, A. (2011) The devil is in the details: an evaluation of recent feature encoding methods.
Cheney, E. W., & Light, W. A.(2009) A Course in Approximation Theory. . American Mathematical Soc.
Choromanski, K., & Sindhwani, V. (2016) Recycling Randomness with Structure for Sublinear time Kernel Expansions. arXiv:1605.09049 [Cs, Stat].
Clark, A., Florêncio, C. C., & Watkins, C. (2006) Languages as Hyperplanes: Grammatical Inference with String Kernels. In J. Fürnkranz, T. Scheffer, & M. Spiliopoulou (Eds.), Machine Learning: ECML 2006 (pp. 90–101). Springer Berlin Heidelberg
Clark, A., Florêncio, C. C., Watkins, C., & Serayet, M. (2006) Planar Languages and Learnability. In Y. Sakakibara, S. Kobayashi, K. Sato, T. Nishino, & E. Tomita (Eds.), Grammatical Inference: Algorithms and Applications (pp. 148–160). Springer Berlin Heidelberg
Clark, A., & Watkins, C. (2008) Some Alternatives to Parikh Matrices Using String Kernels. Fundamenta Informaticae, 84(3), 291–303.
Collins, M., & Duffy, N. (2002) Convolution Kernels for Natural Language. In T. G. Dietterich, S. Becker, & Z. Ghahramani (Eds.), Advances in Neural Information Processing Systems 14 (pp. 625–632). MIT Press
Cortes, C., Haffner, P., & Mohri, M. (2004) Rational Kernels: Theory and Algorithms. J. Mach. Learn. Res., 5, 1035–1062.
Cucker, F., & Smale, S. (2002) On the mathematical foundations of learning. Bulletin of the American Mathematical Society, 39(1), 1–49. DOI.
Cunningham, J. P., Shenoy, K. V., & Sahani, M. (2008) Fast Gaussian process methods for point process intensity estimation. (pp. 192–199). ACM Press DOI.
Danafar, S., Fukumizu, K., & Gomez, F. (2014) Kernel-based Information Criterion. arXiv:1408.5810 [Stat].
Drineas, P., & Mahoney, M. W.(2005) On the Nyström method for approximating a Gram matrix for improved kernel-based learning. Journal of Machine Learning Research, 6, 2153–2175.
Duvenaud, D., Lloyd, J. R., Grosse, R., Tenenbaum, J. B., & Ghahramani, Z. (2013) Structure Discovery in Nonparametric Regression through Compositional Kernel Search. arXiv:1302.4922 [Cs, Stat].
Elgammal, A., Duraiswami, R., & Davis, L. S.(2003) Efficient kernel density estimation using the fast gauss transform with applications to color modeling and tracking. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 25(11), 1499–1504. DOI.
FitzGerald, D., Liukus, A., Rafii, Z., Pardo, B., & Daudet, L. (2013) Harmonic/percussive separation using kernel additive modelling. In Irish Signals & Systems Conference 2014 and 2014 China-Ireland International Conference on Information and Communications Technologies (ISSC 2014/CIICT 2014). 25th IET (pp. 35–40). IET
Florêncio, C. C.(2007) Unsupervised learning of graph languages using kernels.
Genton, M. G.(2002) Classes of Kernels for Machine Learning: A Statistics Perspective. J. Mach. Learn. Res., 2, 299–312.
Gianola, D., & Kaam, J. B. C. H. M. van. (2008) Reproducing Kernel Hilbert Spaces Regression Methods for Genomic Assisted Prediction of Quantitative Traits. Genetics, 178(4), 2289–2303. DOI.
Globerson, A., & Livni, R. (2016) Learning Infinite-Layer Networks: Beyond the Kernel Trick. arXiv:1606.05316 [Cs].
Grauman, K., & Darrell, T. (2005) The pyramid match kernel: discriminative classification with sets of image features. In Tenth IEEE International Conference on Computer Vision, 2005. ICCV 2005 (Vol. 2, p. 1458–1465 Vol. 2). DOI.
Greengard, L., & Strain, J. (1991) The Fast Gauss Transform. SIAM Journal on Scientific and Statistical Computing, 12(1), 79–94. DOI.
Gretton, A., Borgwardt, K., Fukumizu, K., Schölkopf, B., & Smola, E. (n.d.) Kernel Algorithms Distribution Embeddings in Reproducing Kernel Hilbert Spaces.
Gretton, A., Fukumizu, K., Teo, C. H., Song, L., Schölkopf, B., & Smola, A. J.(2008) A Kernel Statistical Test of Independence. In Advances in Neural Information Processing Systems 20: Proceedings of the 2007 Conference. Cambridge, MA: MIT Press
Grosse, R., Salakhutdinov, R. R., Freeman, W. T., & Tenenbaum, J. B.(2012) Exploiting compositionality to explore a large space of model structures. In Proceedings of the Conference on Uncertainty in Artificial Intelligence.
Haussler, D. (1999) Convolution kernels on discrete structures. . Technical report, UC Santa Cruz
Heinonen, M., & d’Alché-Buc, F. (2014) Learning nonparametric differential equations with operator-valued kernels and gradient matching. arXiv:1411.5172 [Cs, Stat].
Hofmann, T., Schölkopf, B., & Smola, A. J.(2008) Kernel methods in machine learning. The Annals of Statistics, 36(3), 1171–1220. DOI.
Kanagawa, M., & Fukumizu, K. (n.d.) Recovering Distributions from Gaussian RKHS Embeddings.
Keriven, N., Bourrier, A., Gribonval, R., & Pérez, P. (2016) Sketching for Large-Scale Learning of Mixture Models. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 6190–6194). DOI.
Kimeldorf, G. S., & Wahba, G. (1970) A Correspondence Between Bayesian Estimation on Stochastic Processes and Smoothing by Splines. The Annals of Mathematical Statistics, 41(2), 495–502. DOI.
Kloft, M., Rückert, U., & Bartlett, P. L.(2010) A Unifying View of Multiple Kernel Learning. In J. L. Balcázar, F. Bonchi, A. Gionis, & M. Sebag (Eds.), Machine Learning and Knowledge Discovery in Databases (pp. 66–81). Springer Berlin Heidelberg DOI.
Kontorovich, L. (Aryeh), Cortes, C., & Mohri, M. (2008) Kernel methods for learning languages. Theoretical Computer Science, 405(3), 223–236. DOI.
Kontorovich, L., Cortes, C., & Mohri, M. (2006) Learning Linearly Separable Languages. In J. L. Balcázar, P. M. Long, & F. Stephan (Eds.), Algorithmic Learning Theory (pp. 288–303). Springer Berlin Heidelberg
Lawrence, N., Seeger, M., & Herbrich, R. (2003) Fast sparse Gaussian process methods: The informative vector machine. In Proceedings of the 16th Annual Conference on Neural Information Processing Systems (pp. 609–616).
Li, F., Ionescu, C., & Sminchisescu, C. (2010) Random Fourier Approximations for Skewed Multiplicative Histogram Kernels. In M. Goesele, S. Roth, A. Kuijper, B. Schiele, & K. Schindler (Eds.), Pattern Recognition (pp. 262–271). Springer Berlin Heidelberg
Liutkus, A., Rafii, Z., Pardo, B., Fitzgerald, D., & Daudet, L. (2014) Kernel spectrogram models for source separation. (pp. 6–10). IEEE DOI.
Lloyd, J. R., Duvenaud, D., Grosse, R., Tenenbaum, J. B., & Ghahramani, Z. (2014) Automatic Construction and Natural-Language Description of Nonparametric Regression Models. arXiv:1402.4304 [Cs, Stat].
Lodhi, H., Saunders, C., Shawe-Taylor, J., Cristianini, N., & Watkins, C. (2002) Text Classification Using String Kernels. J. Mach. Learn. Res., 2, 419–444. DOI.
Lopez-Paz, D., Nishihara, R., Chintala, S., Schölkopf, B., & Bottou, L. (2016) Discovering Causal Signals in Images. arXiv:1605.08179 [Cs, Stat].
Lu, Z., Leen, T. K., Huang, Y., & Erdogmus, D. (2008) A Reproducing Kernel Hilbert Space Framework for Pairwise Time Series Distances. In Proceedings of the 25th International Conference on Machine Learning (pp. 624–631). New York, NY, USA: ACM DOI.
Manton, J. H., & Amblard, P.-O. (2014) A Primer on Reproducing Kernel Hilbert Spaces. arXiv:1408.0952 [Math].
McFee, B., & Ellis, D. P.(2011) Analyzing song structure with spectral clustering. In IEEE conference on Computer Vision and Pattern Recognition (CVPR).
Muandet, K., Fukumizu, K., Sriperumbudur, B., Gretton, A., & Schölkopf, B. (2014) Kernel Mean Shrinkage Estimators. arXiv:1405.5505 [Cs, Stat].
Muandet, K., Fukumizu, K., Sriperumbudur, B., & Schölkopf, B. (2016) Kernel Mean Embedding of Distributions: A Review and Beyonds. arXiv:1605.09522 [Cs, Stat].
Muller, K., Mika, S., Ratsch, G., Tsuda, K., & Scholkopf, B. (2001) An introduction to kernel-based learning algorithms. IEEE Transactions on Neural Networks, 12(2), 181–201. DOI.
Rahimi, A., & Recht, B. (2007) Random features for large-scale kernel machines. In Advances in neural information processing systems (pp. 1177–1184). Curran Associates, Inc.
Rahimi, A., & Recht, B. (2009) Weighted Sums of Random Kitchen Sinks: Replacing minimization with randomization in learning. In Advances in neural information processing systems (pp. 1313–1320). Curran Associates, Inc.
Ramdas, A., & Wehbe, L. (2014) Stein Shrinkage for Cross-Covariance Operators and Kernel Independence Testing. arXiv:1406.1922 [Stat].
Raykar, V. C., & Duraiswami, R. (n.d.) The improved fast Gauss transform with applications to machine learning.
RAYKAR, V. C., YANG, C., & DURAISWAMI, R. (2005) Improved fast Gauss transform User manual.
Schölkopf, B., Herbrich, R., & Smola, A. J.(2001) A Generalized Representer Theorem. In D. Helmbold & B. Williamson (Eds.), Computational Learning Theory (pp. 416–426). Springer Berlin Heidelberg
Schölkopf, B., Knirsch, P., Smola, A., & Burges, C. (1998) Fast Approximation of Support Vector Kernel Expansions, and an Interpretation of Clustering as Approximation in Feature Spaces. In P. Levi, M. Schanz, R.-J. Ahlers, & F. May (Eds.), Mustererkennung 1998 (pp. 125–132). Springer Berlin Heidelberg
Schölkopf, B., Mika, S., Burges, C. J. C., Knirsch, P., Müller, K.-R., Rätsch, G., & Smola, A. J.(1999) Input Space Versus Feature Space in Kernel-Based Methods. IEEE TRANSACTIONS ON NEURAL NETWORKS, 10, 1000–1017.
Schölkopf, B., Muandet, K., Fukumizu, K., & Peters, J. (2015) Computing Functions of Random Variables via Reproducing Kernel Hilbert Space Representations. arXiv:1501.06794 [Cs, Stat].
Schölkopf, B., & Smola, A. J.(2002) Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. . MIT Press
Schölkopf, B., & Smola, A. J.(2003) A Short Introduction to Learning with Kernels. In S. Mendelson & A. J. Smola (Eds.), Advanced Lectures on Machine Learning (pp. 41–64). Springer Berlin Heidelberg
Schölkopf, B., Smola, A., & Müller, K.-R. (1997) Kernel principal component analysis. In W. Gerstner, A. Germond, M. Hasler, & J.-D. Nicoud (Eds.), Artificial Neural Networks — ICANN’97 (pp. 583–588). Springer Berlin Heidelberg DOI.
Smola, A. J., & Schölkopf, B. (1998) On a Kernel-Based Method for Pattern Recognition, Regression, Approximation, and Operator Inversion. Algorithmica, 22(1–2), 211–231. DOI.
Smola, A. J., & Schölkopf, B. (2000) Sparse greedy matrix approximation for machine learning.
Smola, A. J., & Schölkopf, B. (2004) A tutorial on support vector regression. Statistics and Computing, 14(3), 199–222. DOI.
Smola, A. J., Schölkopf, B., & Müller, K.-R. (1998) The connection between regularization operators and support vector kernels. Neural Networks, 11(4), 637–649. DOI.
Snelson, E., & Ghahramani, Z. (2005) Sparse Gaussian processes using pseudo-inputs. In Advances in neural information processing systems (pp. 1257–1264).
Sriperumbudur, B. K., Gretton, A., Fukumizu, K., Lanckriet, G., & Schölkopf, B. (2008) Injective Hilbert Space Embeddings of Probability Measures. In Proceedings of the 21st Annual Conference on Learning Theory (COLT 2008).
Székely, G. J., & Rizzo, M. L.(2009) Brownian distance covariance. The Annals of Applied Statistics, 3(4), 1236–1265. DOI.
Székely, G. J., Rizzo, M. L., & Bakirov, N. K.(2007) Measuring and testing dependence by correlation of distances. The Annals of Statistics, 35(6), 2769–2794. DOI.
Tipping, M. E., & Nh, C. C.(2001) Sparse Kernel Principal Component Analysis.
Vedaldi, A., & Zisserman, A. (2010) Efficient additive kernels via explicit feature maps. In 2010 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (Vol. 34, pp. 3539–3546). DOI.
Vedaldi, A., & Zisserman, A. (2012) Efficient Additive Kernels via Explicit Feature Maps. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(3), 480–492. DOI.
Vempati, S., Vedaldi, A., Zisserman, A., & Jawahar, C. (2010) Generalized RBF feature maps for efficient detection. In BMVC (pp. 1–11).
Vert, J.-P., Tsuda, K., & Schölkopf, B. (2004) A primer on kernel methods. In Kernel Methods in Computational Biology. MIT Press
Vishwanathan, S. V. N., Schraudolph, N. N., Kondor, R., & Borgwardt, K. M.(2010) Graph Kernels. J. Mach. Learn. Res., 11, 1201–1242.
Walder, C., Kim, K. I., & Schölkopf, B. (2008) Sparse Multiscale Gaussian Process Regression. In Proceedings of the 25th International Conference on Machine Learning (pp. 1112–1119). New York, NY, USA: ACM DOI.
Walder, C., Schölkopf, B., & Chapelle, O. (2006) Implicit Surface Modelling with a Globally Regularised Basis of Compact Support. Computer Graphics Forum, 25(3), 635–644. DOI.
Wang, Y.-X., Smola, A., & Tibshirani, R. J.(2014) The Falling Factorial Basis and Its Statistical Applications. arXiv:1405.0558 [Stat].
Williams, C. K. I.(2001) On a Connection between Kernel PCA and Metric Multidimensional Scaling. In T. K. Leen, T. G. Dietterich, & V. Tresp (Eds.), Advances in Neural Information Processing Systems 13 (Vol. 46, pp. 675–681). MIT Press DOI.
Wilson, A. G., & Adams, R. P.(2013) Gaussian Process Kernels for Pattern Discovery and Extrapolation. arXiv:1302.4245 [Cs, Stat].
Wilson, A. G., Dann, C., Lucas, C. G., & Xing, E. P.(2015) The Human Kernel. arXiv:1510.07389 [Cs, Stat].
Wu, Q., & Zhou, D.-X. (2008) Learning with sample dependent hypothesis spaces. Computers & Mathematics with Applications, 56(11), 2896–2907. DOI.
Xu, J.-W., Paiva, A. R. C., Park, I., & Principe, J. C.(2008) A Reproducing Kernel Hilbert Space Framework for Information-Theoretic Learning. IEEE Transactions on Signal Processing, 56(12), 5891–5902. DOI.
Yang, C., Duraiswami, R., & Davis, L. S.(2004) Efficient kernel machines using the improved fast Gauss transform. In Advances in neural information processing systems (pp. 1561–1568).
Yang, C., Duraiswami, R., Gumerov, N. A., & Davis, L. (2003) Improved Fast Gauss Transform and Efficient Kernel Density Estimation. In Proceedings of the Ninth IEEE International Conference on Computer Vision - Volume 2 (p. 464–). Washington, DC, USA: IEEE Computer Society DOI.
Yang, J., Sindhwani, V., Avron, H., & Mahoney, M. (2014) Quasi-Monte Carlo Feature Maps for Shift-Invariant Kernels. arXiv:1412.8293 [Cs, Math, Stat].
Yang, T., Li, Y.-F., Mahdavi, M., Jin, R., & Zhou, Z.-H. (2012) Nyström method vs random fourier features: A theoretical and empirical comparison. In Advances in neural information processing systems (pp. 476–484).
Zhang, K., Peters, J., Janzing, D., & Schölkopf, B. (2012) Kernel-based Conditional Independence Test and Application in Causal Discovery. arXiv:1202.3775 [Cs, Stat].
Zhang, Q., Filippi, S., Gretton, A., & Sejdinovic, D. (2016) Large-Scale Kernel Methods for Independence Testing. arXiv:1606.07892 [Stat].