The Living Thing / Notebooks :

Kernel approximation

A page where I document whet I don’t know about kernel approximation. A page about what I do know would be empty.

What I mean is: approximating implicit Mercer kernel features with explicit features, that is; Equivalently, approximating the Gram matrix, which is also related to mixture model inference and clustering. I’m especially interested in how it might be done using random linear algebra. I am mostly interested in translation-invariant kernels, so assume I’m talking about those unless I say otherwise.

Not the related but distinct (terminological collision) of approximating functions from mixtures of kernels. Also note that the Fast Gauss Transform and other related fast multipole methods, while commonly used to approximate convolution kernels, can also approximate certain Mercer kernels but it’s not what I mean here - fast multipole methods gives you an approximation to the field generated by all kernels and all the support points In kernel approximation we look for approximations, in some sense, to the component kernels themselves.

Short introduction at scikit-learn kernel approximation page.

DrMa05, YLMJ12, VeZi12, LiIS10, RaRe07, AlMa14, Bach15, Bach13, YLMJ12, VVZJ10 have work here.

The approximations might be random projection, or random sampling based, e.g. the Nyström method, which is reportedly effectively Monte Carlo integration, although I understand there is an optimisation step too?

I need to work out the difference between random Fourier features, random kitchen sinks, Nyström methods and whatever Smola et al (SKSB98) call their special case Gaussian approximation. I think Random Fourier features are the same as random kitchen sinks, (RaRe07) and Nyström is different (WiSe01). When we can exploit (data- or kernel-) structure to do better? (say, LeSS13, VVZJ10) Quasi Monte Carlo can improve on random Monte Carlo? (update: someone already had that idea: YSAM14) Or better matrix factorisations?

Also I guess we need to know trade-offs of computational time/space cost versus approximation bounds, so that we can decide when to bother. When is it enough to reduce computational cost merely with support vectors, or to evaluate the kernels efficiently using, e.g. the Fast Gauss Transform and related methods methods, rather than coming up with alternative kernel bases? (e.g. you don’t want to do the fiddly coding for the Faust Gauss transform)

Does this help? Chris Ding’s Principal Component Analysis and Matrix Factorizations for Learning.

VVZJ10:

Recently, Maji and Berg and Vedaldi and Zisserman proposed explicit feature maps to approximate the additive kernels (intersection, \(\chi^2\), etc.) by linear ones, thus enabling the use of fast machine learning technique in a non-linear context. An analogous technique was proposed by Rahimi and Recht (RaRe07) for the translation invariant RBF kernels. In this paper, we complete the construction and combine the two techniques to obtain explicit feature maps for the generalized RBF kernels.

Lectures: Fastfood etc

Implementations

libaskit:

This is the LIBASKIT set of scalable machine learning and data analysis tools. Currently we provide codes for kernel sums, nearest-neighbors, kmeans clustering, kernel regression, and multiclass kernel logistic regression. All codes use OpenMP and MPI for shared memory and distributed memory parallelism.

[…] PNYSTR : (Parallel Nystrom method) Code for kernel summation using the Nystrom method.

Connections

MDS and Kernel PCA and mixture models are all related in a way I should try to understand when I have a moment.

For the connection with kernel PCA, see Smola et al ( SKSB98) and Williams for metric multidimensional scaling ( Will01).

Refs

AlMa14
Alaoui, A. E., & Mahoney, M. W.(2014) Fast Randomized Kernel Methods With Statistical Guarantees. arXiv:1411.0306 [Cs, Stat].
Bach15
Bach, F. (2015) On the Equivalence between Kernel Quadrature Rules and Random Feature Expansions. arXiv Preprint arXiv:1502.06800.
Bach13
Bach, F. R.(2013) Sharp analysis of low-rank kernel matrix approximations. In COLT (Vol. 30, pp. 185–209).
BaWS04
Bakır, G. H., Weston, J., & Schölkopf, B. (2004) Learning to find pre-images. Advances in Neural Information Processing Systems, 16(7), 449–456.
BeMo10
Beylkin, G., & Monzón, L. (2010) Approximation by exponential sums revisited. Applied and Computational Harmonic Analysis, 28(2), 131–149. DOI.
BrBH16
Brault, R., d’Alché-Buc, F., & Heinonen, M. (2016) Random Fourier Features for Operator-Valued Kernels. arXiv:1605.02536 [Cs].
BrLB00
Brault, R., Lim, N., & d’Alché-Buc, F. (n.d.) Scaling up Vector Autoregressive Models With Operator-Valued Random Fourier Features.
ChLi09
Cheney, E. W., & Light, W. A.(2009) A Course in Approximation Theory. . American Mathematical Soc.
ChSi16
Choromanski, K., & Sindhwani, V. (2016) Recycling Randomness with Structure for Sublinear time Kernel Expansions. arXiv:1605.09049 [Cs, Stat].
CuSS08
Cunningham, J. P., Shenoy, K. V., & Sahani, M. (2008) Fast Gaussian process methods for point process intensity estimation. (pp. 192–199). ACM Press DOI.
CBMF16
Cutajar, K., Bonilla, E. V., Michiardi, P., & Filippone, M. (2016) Practical Learning of Deep Gaussian Processes via Random Fourier Features. arXiv:1610.04386 [Stat].
DrMa05
Drineas, P., & Mahoney, M. W.(2005) On the Nyström method for approximating a Gram matrix for improved kernel-based learning. Journal of Machine Learning Research, 6, 2153–2175.
GlLi16
Globerson, A., & Livni, R. (2016) Learning Infinite-Layer Networks: Beyond the Kernel Trick. arXiv:1606.05316 [Cs].
KwTs04
Kwok, J. T.-Y., & Tsang, I. W.-H. (2004) The Pre-Image Problem in Kernel Methods. IEEE Transactions on Neural Networks, 15(6), 1517–1525. DOI.
LeSS13
Le, Q., Sarlós, T., & Smola, A. (2013) Fastfood-approximating kernel expansions in loglinear time. In Proceedings of the international conference on machine learning.
LiIS10
Li, F., Ionescu, C., & Sminchisescu, C. (2010) Random Fourier Approximations for Skewed Multiplicative Histogram Kernels. In M. Goesele, S. Roth, A. Kuijper, B. Schiele, & K. Schindler (Eds.), Pattern Recognition (pp. 262–271). Springer Berlin Heidelberg DOI.
MiNY06
Minh, H. Q., Niyogi, P., & Yao, Y. (2006) Mercer’s theorem, feature maps, and smoothing. In International Conference on Computational Learning Theory (pp. 154–168). Springer DOI.
PoBe16a
Pourkamali-Anaraki, F., & Becker, S. (2016a) A Randomized Approach to Efficient Kernel Clustering. arXiv:1608.07597 [Stat].
PoBe16b
Pourkamali-Anaraki, F., & Becker, S. (2016b) Randomized Clustered Nystrom for Large-Scale Kernel Machines. arXiv:1612.06470 [Cs, Stat].
RaRe07
Rahimi, A., & Recht, B. (2007) Random features for large-scale kernel machines. In Advances in neural information processing systems (pp. 1177–1184). Curran Associates, Inc.
RaRe09
Rahimi, A., & Recht, B. (2009) Weighted Sums of Random Kitchen Sinks: Replacing minimization with randomization in learning. In Advances in neural information processing systems (pp. 1313–1320). Curran Associates, Inc.
SKSB98
Schölkopf, B., Knirsch, P., Smola, A., & Burges, C. (1998) Fast Approximation of Support Vector Kernel Expansions, and an Interpretation of Clustering as Approximation in Feature Spaces. In P. Levi, M. Schanz, R.-J. Ahlers, & F. May (Eds.), Mustererkennung 1998 (pp. 125–132). Springer Berlin Heidelberg DOI.
ScSm02
Schölkopf, B., & Smola, A. J.(2002) Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. . MIT Press
ScSM97
Schölkopf, B., Smola, A., & Müller, K.-R. (1997) Kernel principal component analysis. In W. Gerstner, A. Germond, M. Hasler, & J.-D. Nicoud (Eds.), Artificial Neural Networks — ICANN’97 (pp. 583–588). Springer Berlin Heidelberg DOI.
VeZi12
Vedaldi, A., & Zisserman, A. (2012) Efficient Additive Kernels via Explicit Feature Maps. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(3), 480–492. DOI.
VVZJ10
Vempati, S., Vedaldi, A., Zisserman, A., & Jawahar, C. (2010) Generalized RBF feature maps for efficient detection. In BMVC (pp. 1–11).
Will01
Williams, C. K. I.(2001) On a Connection between Kernel PCA and Metric Multidimensional Scaling. In T. K. Leen, T. G. Dietterich, & V. Tresp (Eds.), Advances in Neural Information Processing Systems 13 (Vol. 46, pp. 675–681). MIT Press DOI.
WiSe01
Williams, C. K., & Seeger, M. (2001) Using the Nyström Method to Speed Up Kernel Machines. In Advances in Neural Information Processing Systems (pp. 682–688).
YSAM14
Yang, J., Sindhwani, V., Avron, H., & Mahoney, M. (2014) Quasi-Monte Carlo Feature Maps for Shift-Invariant Kernels. arXiv:1412.8293 [Cs, Math, Stat].
YLMJ12
Yang, T., Li, Y.-F., Mahdavi, M., Jin, R., & Zhou, Z.-H. (2012) Nyström method vs random fourier features: A theoretical and empirical comparison. In Advances in neural information processing systems (pp. 476–484).
YuMB17
Yu, C. D., March, W. B., & Biros, G. (2017) An $N log N$ Parallel Fast Direct Solver for Kernel Matrices. arXiv:1701.02324 [Cs].
ZFGS16
Zhang, Q., Filippi, S., Gretton, A., & Sejdinovic, D. (2016) Large-Scale Kernel Methods for Independence Testing. arXiv:1606.07892 [Stat].