The Living Thing / Notebooks :

(Reproducing) kernel tricks

Kernel in the sense of the (Mercer-)“kernel trick”, which augments covariance estiamtion with clever distance measures. Not to be confused with density-estimation-type convolution kernels, nor the dozens of other clashing definitions of kernel; they can have their respective own pages.

Kernel machine use not-necessarily-Euclidean “reproducing” kernels, aka Mercer kernels to implicitly define convenient Hilbert “feature” spaces for your purpose. Alternatively you might like to make your Hilbert space basis explicit by doing a basis transform, or by taking your implicit feature map and approximating it but you don’t need to. In fact, the implicit space induced by your reproducing kernels might (in general will) look odd indeed, something with no finite dimensional representation. That’s the “trick” part.

TODO: clear explanation, less blather. Until then, see ScSm02, which is a very well-written textbook covering an unbelievable amount of ground without pausing to make a fuss fuss, or MaAm14, which is more narrowly focussed on just the Mercer-kernel part, or ChLi09 from an approximation-theory perspective.

Spoiler: you upgrade your old boring linear algebra on finite (usually low-) dimensional spaces to sexy new curvy algebra on potentially-infinite-dimensional manifolds. Or, if you’d like, to apply statistical learning methods based on things with an obvious finite vector space representation (\(\mathbb{R}^n\)) to things without one (Sentences, piano-rolls, \(\mathcal{C}^d_\ell\)).

The other use that I’ve seen of the kernel trick is to show something clever about your learning method in terms of fancy kernel representations through formal mathematics. This one is nice to keep in mind, because practically, kernel methods have serious problems with scalability - problems that even afflict me with a mere \(N\simeq 10^5\) data points, since the Gram matrix of inner products might not admit of any accurate representation in less than the \(N(N-1)/2\) dimensions.

TODO: discuss covariance matrices, positive definite matrices, mention convenient nexus of SVMs and learning theory etc.

I’m especially interested in

  1. Nonparametric kernel independence tests
  2. efficient kernel pre-image approximation.
  3. connection between kernel PCA and clustering (SKSB98 and Will01)
  4. kernel regression with rbfs
  5. kernel layers in neural networks

Automating kernel design has some weird hacks. See the Automated statistician project by David Duvenaud, James Robert Lloyd, Roger Grosse and colleagues. For traditionalists, one of the co-authors has written a page on doing kernel design by hand. See the GSFT12 for a mind-melting compositional kernel diagram.

Grosse, Salakhutdinov,Freeman and Joshua B. Tenenbaum, Exploiting compositionality to explore a large space of model structures

From GSFT12: “Examples of existing machine learning models which fall under our framework. Arrows represent models reachable using a single production rule. Only a small fraction of the 2496 models reachable within 3 steps are shown, and not all possible arrows are shown.”

Alex Smola (who with, Bernhard Schölkopf) has his name on a terrifying proportion of publications in this area, also has all his publications online.

The oft-cited grandfather of all the reproducing kernel stuff is Aronszajn’s 1950 work (Aron50), although this didn’t percolate into machine-learning for decades.

Kernel approximation

See kernel approximation.

RKHS distribution embedding

See that page.

Refs

AlMa14
Alaoui, A. E., & Mahoney, M. W.(2014) Fast Randomized Kernel Methods With Statistical Guarantees. arXiv:1411.0306 [Cs, Stat].
AlSH04
Altun, Y., Smola, A. J., & Hofmann, T. (2004) Exponential Families for Conditional Random Fields. In Proceedings of the 20th Conference on Uncertainty in Artificial Intelligence (pp. 2–9). Arlington, Virginia, United States: AUAI Press
Aron50
Aronszajn, N. (1950) Theory of Reproducing Kernels. Transactions of the American Mathematical Society, 68(3), 337–404. DOI.
Bach15
Bach, F. (2015) On the Equivalence between Kernel Quadrature Rules and Random Feature Expansions. arXiv Preprint arXiv:1502.06800.
Bach00
Bach, F. (n.d.) Exploring large feature spaces with hierarchical multiple kernel learning. In In Advances in Neural Information Processing Systems (NIPS (p. 2008).
Bach13
Bach, F. R.(2013) Sharp analysis of low-rank kernel matrix approximations. In COLT (Vol. 30, pp. 185–209).
BaZT04
Bakır, G. H., Zien, A., & Tsuda, K. (2004) Learning to Find Graph Pre-images. In C. E. Rasmussen, H. H. Bülthoff, B. Schölkopf, & M. A. Giese (Eds.), Pattern Recognition (pp. 253–261). Springer Berlin Heidelberg
Ball13
Balle Pigem, B. de. (2013, July 12) Learning finite-state machines: statistical and algorithmic aspects (info:eu-repo/semantics/doctoralThesis). . Universitat Politècnica de Catalunya. Departament de Llenguatges i Sistemes Informàtics, Barcelona
BLGR16
Balog, M., Lakshminarayanan, B., Ghahramani, Z., Roy, D. M., & Teh, Y. W.(2016) The Mondrian Kernel. arXiv:1606.05241 [Stat].
BaRo02
Baxter, B., & Roussos, G. (2002) A New Error Estimate of the Fast Gauss Transform. SIAM Journal on Scientific Computing, 24(1), 257–259. DOI.
BOSS08
Ben-Hur, A., Ong, C. S., Sonnenburg, S., Schölkopf, B., & Rätsch, G. (2008) Support Vector Machines and Kernels for Computational Biology. PLoS Comput Biol, 4(10), e1000173. DOI.
Burg98
Burges, C. J. C.(1998) Geometry and Invariance in Kernel Based Methods. In B. Schölkopf, C. J. Burges, & A. J. Smola (Eds.), Advances in Kernel Methods - Support Vector Learning. Cambridge, MA: MIT Press
CaOC01
Carrasco, R. C., Oncina, J., & Calera-Rubio, J. (2001) Stochastic Inference of Regular Tree Languages. Machine Learning, 44(1–2), 185–197. DOI.
CLVZ11
Chatfield, K., Lempitsky, V., Vedaldi, A., & Zisserman, A. (2011) The devil is in the details: an evaluation of recent feature encoding methods.
ChLi09
Cheney, E. W., & Light, W. A.(2009) A Course in Approximation Theory. . American Mathematical Soc.
ChSi16
Choromanski, K., & Sindhwani, V. (2016) Recycling Randomness with Structure for Sublinear time Kernel Expansions. arXiv:1605.09049 [Cs, Stat].
ClFW06
Clark, A., Florêncio, C. C., & Watkins, C. (2006) Languages as Hyperplanes: Grammatical Inference with String Kernels. In J. Fürnkranz, T. Scheffer, & M. Spiliopoulou (Eds.), Machine Learning: ECML 2006 (pp. 90–101). Springer Berlin Heidelberg
CFWS06
Clark, A., Florêncio, C. C., Watkins, C., & Serayet, M. (2006) Planar Languages and Learnability. In Y. Sakakibara, S. Kobayashi, K. Sato, T. Nishino, & E. Tomita (Eds.), Grammatical Inference: Algorithms and Applications (pp. 148–160). Springer Berlin Heidelberg
ClWa08
Clark, A., & Watkins, C. (2008) Some Alternatives to Parikh Matrices Using String Kernels. Fundamenta Informaticae, 84(3), 291–303.
CoDu02
Collins, M., & Duffy, N. (2002) Convolution Kernels for Natural Language. In T. G. Dietterich, S. Becker, & Z. Ghahramani (Eds.), Advances in Neural Information Processing Systems 14 (pp. 625–632). MIT Press
CoHM04
Cortes, C., Haffner, P., & Mohri, M. (2004) Rational Kernels: Theory and Algorithms. J. Mach. Learn. Res., 5, 1035–1062.
CuSm02
Cucker, F., & Smale, S. (2002) On the mathematical foundations of learning. Bulletin of the American Mathematical Society, 39(1), 1–49. DOI.
CuSS08
Cunningham, J. P., Shenoy, K. V., & Sahani, M. (2008) Fast Gaussian process methods for point process intensity estimation. (pp. 192–199). ACM Press DOI.
DaFG14
Danafar, S., Fukumizu, K., & Gomez, F. (2014) Kernel-based Information Criterion. arXiv:1408.5810 [Stat].
DrMa05
Drineas, P., & Mahoney, M. W.(2005) On the Nyström method for approximating a Gram matrix for improved kernel-based learning. Journal of Machine Learning Research, 6, 2153–2175.
DLGT13
Duvenaud, D., Lloyd, J. R., Grosse, R., Tenenbaum, J. B., & Ghahramani, Z. (2013) Structure Discovery in Nonparametric Regression through Compositional Kernel Search. arXiv:1302.4922 [Cs, Stat].
ElDD03
Elgammal, A., Duraiswami, R., & Davis, L. S.(2003) Efficient kernel density estimation using the fast gauss transform with applications to color modeling and tracking. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 25(11), 1499–1504. DOI.
FLRP13
FitzGerald, D., Liukus, A., Rafii, Z., Pardo, B., & Daudet, L. (2013) Harmonic/percussive separation using kernel additive modelling. In Irish Signals & Systems Conference 2014 and 2014 China-Ireland International Conference on Information and Communications Technologies (ISSC 2014/CIICT 2014). 25th IET (pp. 35–40). IET
Flor07
Florêncio, C. C.(2007) Unsupervised learning of graph languages using kernels.
Gent02
Genton, M. G.(2002) Classes of Kernels for Machine Learning: A Statistics Perspective. J. Mach. Learn. Res., 2, 299–312.
GiKa08
Gianola, D., & Kaam, J. B. C. H. M. van. (2008) Reproducing Kernel Hilbert Spaces Regression Methods for Genomic Assisted Prediction of Quantitative Traits. Genetics, 178(4), 2289–2303. DOI.
GlLi16
Globerson, A., & Livni, R. (2016) Learning Infinite-Layer Networks: Beyond the Kernel Trick. arXiv:1606.05316 [Cs].
GrDa05
Grauman, K., & Darrell, T. (2005) The pyramid match kernel: discriminative classification with sets of image features. In Tenth IEEE International Conference on Computer Vision, 2005. ICCV 2005 (Vol. 2, p. 1458–1465 Vol. 2). DOI.
GrSt91
Greengard, L., & Strain, J. (1991) The Fast Gauss Transform. SIAM Journal on Scientific and Statistical Computing, 12(1), 79–94. DOI.
GBFS00
Gretton, A., Borgwardt, K., Fukumizu, K., Schölkopf, B., & Smola, E. (n.d.) Kernel Algorithms Distribution Embeddings in Reproducing Kernel Hilbert Spaces.
GFTS08
Gretton, A., Fukumizu, K., Teo, C. H., Song, L., Schölkopf, B., & Smola, A. J.(2008) A Kernel Statistical Test of Independence. In Advances in Neural Information Processing Systems 20: Proceedings of the 2007 Conference. Cambridge, MA: MIT Press
GSFT12
Grosse, R., Salakhutdinov, R. R., Freeman, W. T., & Tenenbaum, J. B.(2012) Exploiting compositionality to explore a large space of model structures. In Proceedings of the Conference on Uncertainty in Artificial Intelligence.
Haus99
Haussler, D. (1999) Convolution kernels on discrete structures. . Technical report, UC Santa Cruz
HeBu14
Heinonen, M., & d’Alché-Buc, F. (2014) Learning nonparametric differential equations with operator-valued kernels and gradient matching. arXiv:1411.5172 [Cs, Stat].
HoSS08
Hofmann, T., Schölkopf, B., & Smola, A. J.(2008) Kernel methods in machine learning. The Annals of Statistics, 36(3), 1171–1220. DOI.
KaFu00
Kanagawa, M., & Fukumizu, K. (n.d.) Recovering Distributions from Gaussian RKHS Embeddings.
KBGP16
Keriven, N., Bourrier, A., Gribonval, R., & Pérez, P. (2016) Sketching for Large-Scale Learning of Mixture Models. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 6190–6194). DOI.
KiWa70
Kimeldorf, G. S., & Wahba, G. (1970) A Correspondence Between Bayesian Estimation on Stochastic Processes and Smoothing by Splines. The Annals of Mathematical Statistics, 41(2), 495–502. DOI.
KlRB10
Kloft, M., Rückert, U., & Bartlett, P. L.(2010) A Unifying View of Multiple Kernel Learning. In J. L. Balcázar, F. Bonchi, A. Gionis, & M. Sebag (Eds.), Machine Learning and Knowledge Discovery in Databases (pp. 66–81). Springer Berlin Heidelberg DOI.
KoCM08
Kontorovich, L. (Aryeh), Cortes, C., & Mohri, M. (2008) Kernel methods for learning languages. Theoretical Computer Science, 405(3), 223–236. DOI.
KoCM06
Kontorovich, L., Cortes, C., & Mohri, M. (2006) Learning Linearly Separable Languages. In J. L. Balcázar, P. M. Long, & F. Stephan (Eds.), Algorithmic Learning Theory (pp. 288–303). Springer Berlin Heidelberg
LaSH03
Lawrence, N., Seeger, M., & Herbrich, R. (2003) Fast sparse Gaussian process methods: The informative vector machine. In Proceedings of the 16th Annual Conference on Neural Information Processing Systems (pp. 609–616).
LiIS10
Li, F., Ionescu, C., & Sminchisescu, C. (2010) Random Fourier Approximations for Skewed Multiplicative Histogram Kernels. In M. Goesele, S. Roth, A. Kuijper, B. Schiele, & K. Schindler (Eds.), Pattern Recognition (pp. 262–271). Springer Berlin Heidelberg
LRPF14
Liutkus, A., Rafii, Z., Pardo, B., Fitzgerald, D., & Daudet, L. (2014) Kernel spectrogram models for source separation. (pp. 6–10). IEEE DOI.
LDGT14
Lloyd, J. R., Duvenaud, D., Grosse, R., Tenenbaum, J. B., & Ghahramani, Z. (2014) Automatic Construction and Natural-Language Description of Nonparametric Regression Models. arXiv:1402.4304 [Cs, Stat].
LSSC02
Lodhi, H., Saunders, C., Shawe-Taylor, J., Cristianini, N., & Watkins, C. (2002) Text Classification Using String Kernels. J. Mach. Learn. Res., 2, 419–444. DOI.
LNCS16
Lopez-Paz, D., Nishihara, R., Chintala, S., Schölkopf, B., & Bottou, L. (2016) Discovering Causal Signals in Images. arXiv:1605.08179 [Cs, Stat].
LLHE08
Lu, Z., Leen, T. K., Huang, Y., & Erdogmus, D. (2008) A Reproducing Kernel Hilbert Space Framework for Pairwise Time Series Distances. In Proceedings of the 25th International Conference on Machine Learning (pp. 624–631). New York, NY, USA: ACM DOI.
MaAm14
Manton, J. H., & Amblard, P.-O. (2014) A Primer on Reproducing Kernel Hilbert Spaces. arXiv:1408.0952 [Math].
McEl11
McFee, B., & Ellis, D. P.(2011) Analyzing song structure with spectral clustering. In IEEE conference on Computer Vision and Pattern Recognition (CVPR).
MFSG14
Muandet, K., Fukumizu, K., Sriperumbudur, B., Gretton, A., & Schölkopf, B. (2014) Kernel Mean Shrinkage Estimators. arXiv:1405.5505 [Cs, Stat].
MFSS16
Muandet, K., Fukumizu, K., Sriperumbudur, B., & Schölkopf, B. (2016) Kernel Mean Embedding of Distributions: A Review and Beyonds. arXiv:1605.09522 [Cs, Stat].
MMRT01
Muller, K., Mika, S., Ratsch, G., Tsuda, K., & Scholkopf, B. (2001) An introduction to kernel-based learning algorithms. IEEE Transactions on Neural Networks, 12(2), 181–201. DOI.
RaRe07
Rahimi, A., & Recht, B. (2007) Random features for large-scale kernel machines. In Advances in neural information processing systems (pp. 1177–1184). Curran Associates, Inc.
RaRe09
Rahimi, A., & Recht, B. (2009) Weighted Sums of Random Kitchen Sinks: Replacing minimization with randomization in learning. In Advances in neural information processing systems (pp. 1313–1320). Curran Associates, Inc.
RaWe14
Ramdas, A., & Wehbe, L. (2014) Stein Shrinkage for Cross-Covariance Operators and Kernel Independence Testing. arXiv:1406.1922 [Stat].
RaDu00
Raykar, V. C., & Duraiswami, R. (n.d.) The improved fast Gauss transform with applications to machine learning.
RaYD05
RAYKAR, V. C., YANG, C., & DURAISWAMI, R. (2005) Improved fast Gauss transform User manual.
ScHS01
Schölkopf, B., Herbrich, R., & Smola, A. J.(2001) A Generalized Representer Theorem. In D. Helmbold & B. Williamson (Eds.), Computational Learning Theory (pp. 416–426). Springer Berlin Heidelberg
SKSB98
Schölkopf, B., Knirsch, P., Smola, A., & Burges, C. (1998) Fast Approximation of Support Vector Kernel Expansions, and an Interpretation of Clustering as Approximation in Feature Spaces. In P. Levi, M. Schanz, R.-J. Ahlers, & F. May (Eds.), Mustererkennung 1998 (pp. 125–132). Springer Berlin Heidelberg
SMBK99
Schölkopf, B., Mika, S., Burges, C. J. C., Knirsch, P., Müller, K.-R., Rätsch, G., & Smola, A. J.(1999) Input Space Versus Feature Space in Kernel-Based Methods. IEEE TRANSACTIONS ON NEURAL NETWORKS, 10, 1000–1017.
SMFP15
Schölkopf, B., Muandet, K., Fukumizu, K., & Peters, J. (2015) Computing Functions of Random Variables via Reproducing Kernel Hilbert Space Representations. arXiv:1501.06794 [Cs, Stat].
ScSm02
Schölkopf, B., & Smola, A. J.(2002) Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. . MIT Press
ScSm03
Schölkopf, B., & Smola, A. J.(2003) A Short Introduction to Learning with Kernels. In S. Mendelson & A. J. Smola (Eds.), Advanced Lectures on Machine Learning (pp. 41–64). Springer Berlin Heidelberg
ScSM97
Schölkopf, B., Smola, A., & Müller, K.-R. (1997) Kernel principal component analysis. In W. Gerstner, A. Germond, M. Hasler, & J.-D. Nicoud (Eds.), Artificial Neural Networks — ICANN’97 (pp. 583–588). Springer Berlin Heidelberg DOI.
SmSc98
Smola, A. J., & Schölkopf, B. (1998) On a Kernel-Based Method for Pattern Recognition, Regression, Approximation, and Operator Inversion. Algorithmica, 22(1–2), 211–231. DOI.
SmSc00
Smola, A. J., & Schölkopf, B. (2000) Sparse greedy matrix approximation for machine learning.
SmSc04
Smola, A. J., & Schölkopf, B. (2004) A tutorial on support vector regression. Statistics and Computing, 14(3), 199–222. DOI.
SmSM98
Smola, A. J., Schölkopf, B., & Müller, K.-R. (1998) The connection between regularization operators and support vector kernels. Neural Networks, 11(4), 637–649. DOI.
SnGh05
Snelson, E., & Ghahramani, Z. (2005) Sparse Gaussian processes using pseudo-inputs. In Advances in neural information processing systems (pp. 1257–1264).
SGFL08
Sriperumbudur, B. K., Gretton, A., Fukumizu, K., Lanckriet, G., & Schölkopf, B. (2008) Injective Hilbert Space Embeddings of Probability Measures. In Proceedings of the 21st Annual Conference on Learning Theory (COLT 2008).
SzRi09
Székely, G. J., & Rizzo, M. L.(2009) Brownian distance covariance. The Annals of Applied Statistics, 3(4), 1236–1265. DOI.
SzRB07
Székely, G. J., Rizzo, M. L., & Bakirov, N. K.(2007) Measuring and testing dependence by correlation of distances. The Annals of Statistics, 35(6), 2769–2794. DOI.
TiNh01
Tipping, M. E., & Nh, C. C.(2001) Sparse Kernel Principal Component Analysis.
VeZi10
Vedaldi, A., & Zisserman, A. (2010) Efficient additive kernels via explicit feature maps. In 2010 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (Vol. 34, pp. 3539–3546). DOI.
VeZi12
Vedaldi, A., & Zisserman, A. (2012) Efficient Additive Kernels via Explicit Feature Maps. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(3), 480–492. DOI.
VVZJ10
Vempati, S., Vedaldi, A., Zisserman, A., & Jawahar, C. (2010) Generalized RBF feature maps for efficient detection. In BMVC (pp. 1–11).
VeTS04
Vert, J.-P., Tsuda, K., & Schölkopf, B. (2004) A primer on kernel methods. In Kernel Methods in Computational Biology. MIT Press
VSKB10
Vishwanathan, S. V. N., Schraudolph, N. N., Kondor, R., & Borgwardt, K. M.(2010) Graph Kernels. J. Mach. Learn. Res., 11, 1201–1242.
WaKS08
Walder, C., Kim, K. I., & Schölkopf, B. (2008) Sparse Multiscale Gaussian Process Regression. In Proceedings of the 25th International Conference on Machine Learning (pp. 1112–1119). New York, NY, USA: ACM DOI.
WaSC06
Walder, C., Schölkopf, B., & Chapelle, O. (2006) Implicit Surface Modelling with a Globally Regularised Basis of Compact Support. Computer Graphics Forum, 25(3), 635–644. DOI.
WaST14
Wang, Y.-X., Smola, A., & Tibshirani, R. J.(2014) The Falling Factorial Basis and Its Statistical Applications. arXiv:1405.0558 [Stat].
Will01
Williams, C. K. I.(2001) On a Connection between Kernel PCA and Metric Multidimensional Scaling. In T. K. Leen, T. G. Dietterich, & V. Tresp (Eds.), Advances in Neural Information Processing Systems 13 (Vol. 46, pp. 675–681). MIT Press DOI.
WiAd13
Wilson, A. G., & Adams, R. P.(2013) Gaussian Process Kernels for Pattern Discovery and Extrapolation. arXiv:1302.4245 [Cs, Stat].
WDLX15
Wilson, A. G., Dann, C., Lucas, C. G., & Xing, E. P.(2015) The Human Kernel. arXiv:1510.07389 [Cs, Stat].
WuZh08
Wu, Q., & Zhou, D.-X. (2008) Learning with sample dependent hypothesis spaces. Computers & Mathematics with Applications, 56(11), 2896–2907. DOI.
XPPP08
Xu, J.-W., Paiva, A. R. C., Park, I., & Principe, J. C.(2008) A Reproducing Kernel Hilbert Space Framework for Information-Theoretic Learning. IEEE Transactions on Signal Processing, 56(12), 5891–5902. DOI.
YaDD04
Yang, C., Duraiswami, R., & Davis, L. S.(2004) Efficient kernel machines using the improved fast Gauss transform. In Advances in neural information processing systems (pp. 1561–1568).
YDGD03
Yang, C., Duraiswami, R., Gumerov, N. A., & Davis, L. (2003) Improved Fast Gauss Transform and Efficient Kernel Density Estimation. In Proceedings of the Ninth IEEE International Conference on Computer Vision - Volume 2 (p. 464–). Washington, DC, USA: IEEE Computer Society DOI.
YSAM14
Yang, J., Sindhwani, V., Avron, H., & Mahoney, M. (2014) Quasi-Monte Carlo Feature Maps for Shift-Invariant Kernels. arXiv:1412.8293 [Cs, Math, Stat].
YLMJ12
Yang, T., Li, Y.-F., Mahdavi, M., Jin, R., & Zhou, Z.-H. (2012) Nyström method vs random fourier features: A theoretical and empirical comparison. In Advances in neural information processing systems (pp. 476–484).
ZPJS12
Zhang, K., Peters, J., Janzing, D., & Schölkopf, B. (2012) Kernel-based Conditional Independence Test and Application in Causal Discovery. arXiv:1202.3775 [Cs, Stat].
ZFGS16
Zhang, Q., Filippi, S., Gretton, A., & Sejdinovic, D. (2016) Large-Scale Kernel Methods for Independence Testing. arXiv:1606.07892 [Stat].