The Living Thing / Notebooks :

(Reproducing) kernel tricks

Kernel in the sense of the “kernel trick”. Not to be confused with density-estimation-type convolution kernels, nor the dozens of related-but-slightly-different clashing definitions of kernel; they can have their respective own pages.

Kernel tricks use not-necessarily-Euclidean “reproducing” kernels, aka Mercer kernels (Merc09) to implicitly define convenient Hilbert “feature” spaces for your purpose. Alternatively, you might like to make your Hilbert space basis explicit by doing a basis transform, or by taking your implicit feature map and approximating it but you don’t need to. In fact, the implicit space induced by your reproducing kernels might (in general will) look odd indeed, something with no finite dimensional representation. That’s the “trick” part.

TODO: clear explanation, less blather. Until then, see ScSm02, which is a very well-written textbook covering an unbelievable amount of ground without pausing to make a fuss, or MaAm15, which is more narrowly focussed on just the Mercer-kernel part, or ChLi09 for an approximation-theory perspective.

Spoiler: you upgrade your old boring linear algebra on finite (usually low-) dimensional spaces to sexy new curvy algebra on potentially-infinite-dimensional manifolds, which still has a low-dimensional representation. Or, if you’d like, you apply statistical learning methods based on things with an obvious finite vector space representation (\(\mathbb{R}^n\)) to things without one (Sentences, piano-rolls, \(\mathcal{C}^d_\ell\)).

With smallish data, you have lots of sexy guarantees and clever models. Practically, kernel methods have problems with scalability to large data sets. Problems, for example, that even afflict little old me with a mere \(N\simeq 10^5\) data points. Since the Gram matrix of inner products does not in general admit an accurate representation in less than \(N(N-1)/2\) dimensions, or inversion in less than \(\mathcal{O}(N^3)\), you basically can’t handle big data.

OTOH, see the inducing set methods and the random-projection inversions which make this in-principle more tractable for, e.g. Gaussian process learning.

The oft-cited grandfather of all the reproducing kernel stuff is Aronszajn’s 1950 work (Aron50) – see also Mercer (Merc09) – although this didn’t percolate into machine-learning for decades.

I’m especially interested in

  1. Nonparametric kernel independence tests
  2. efficient kernel pre-image approximation.
  3. connection between kernel PCA and clustering (SKSB98 and Will01)
  4. kernel regression with rbfs
  5. kernel layers in neural networks

Kernel design

Automating kernel design has some weird hacks. See the Automated statistician project by David Duvenaud, James Robert Lloyd, Roger Grosse and colleagues. Also AutoGP approaches by Krauth and Bonilla et al (KBCF16) use gradient descent to design kernels for gaussian processes. For traditionalists, one of the Automated Statisticians has written a page on doing kernel design by hand. See the GSFT12 for a mind-melting compositional kernel diagram.

Grosse, Salakhutdinov, Freeman and Joshua B. Tenenbaum, (GSFT12):

Exploiting compositionality to explore a large space of model structures
Exploiting compositionality to explore a large space of model structures

Examples of existing machine learning models which fall under our framework. Arrows represent models reachable using a single production rule. Only a small fraction of the 2496 models reachable within 3 steps are shown, and not all possible arrows are shown.”

Alex Smola (who with, Bernhard Schölkopf) has his name on a terrifying proportion of publications in this area, also has all his publications online.

Non-scalar kernels

Operator-valued kernels, Micchelli & Pontil 2005, generalises \(k:\mathcal{X}\times \mathcal{X}\mapsto \mathcal{L}(H_Y)\) as seen in multi-task learning.

Kernel approximation

See kernel approximation.

RKHS distribution embedding

See that page.

Refs