The Living Thing / Notebooks :

Approximate matrix factorisation, sparse/low-rank

Forget QR and LU decompositions, there are now so many ways of factorising matrices that there are not enough acronyms in the alphabet to hold them, especially if you suspect your matrix is sparse, or could be made sparse because of some underlying constraint, or probably could, if squinted at in the right fashion, be such as a graph transition matrix, or Laplacian, or noisy transform of some smooth object, or at least would be very close to sparse or…

Your big matrix is the product (or sum, or…) of two matrices that are in some way simple (small-rank, small dimension, sparse), possibly with additional constraints. Can you find these simple matrices?

Here’s an example. Godec — A decomposition into low-rank and sparse components. (looks combined multidimensional factorisation and outlier detection). Implementations for MATLAB and R exist.

A hip one from the previous decade is Non Negative Matrix factorisation (NMF), which I think is a classic. I should explain why.

There are so many of these things, depending on your preferred choice of loss function, free variables and such.

Keywords: Matrix sketching, low-rank approximation, traditional dimensionality reduction.

Matrix concentration inequalities turn out to be useful in making this work.

I would like to learn more about

Igor Carron’s Matrix Factorization Jungle classifies the following problems as matrix-factorisation type.

Kernel Factorizations
… Spectral clustering
\([A = DX]\) with unknown D and X, solve for sparse X and X_i = 0 or 1 K-Means / K-Median clustering
\([A = DX]\) with unknown D and X, solve for XX^T = I and X_i = 0 or 1 Subspace clustering
\([A = AX]\) with unknown X, solve for sparse/other conditions on X Graph Matching
\([A = XBX^T]\) with unknown X, B solve for B and X as a permutation NMF
\([A = DX]\) with unknown D and X, solve for elements of D,X positive Generalized Matrix Factorization
\([W.*L − W.*UV']\) with W a known mask, U,V unknowns solve for U,V and L lowest rank possible Matrix Completion
\([A = H.*L]\) with H a known mask, L unknown solve for L lowest rank possible Stable Principle Component Pursuit (SPCP)/ Noisy Robust PCA
\([A = L + S + N]\) with L, S, N unknown, solve for L low rank, S sparse, N noise Robust PCA
\([A = L + S]\) with L, S unknown, solve for L low rank, S sparse Sparse PCA
\([A = DX]\) with unknown D and X, solve for sparse D Dictionary Learning
\([A = DX]\) with unknown D and X, solve for sparse X Archetypal Analysis
\([A = DX]\) with unknown D and X, solve for D = AB with D, B positive Matrix Compressive Sensing (MCS)
find a rank-r matrix L such that \([A(L) ~= b]\) / or \([A(L+S) = b]\) Multiple Measurement Vector (MMV)
\([Y = A X]\) with unknown X and rows of X are sparse. compressed sensing
\([Y = A X]\) with unknown X and rows of X are sparse, X is one column. Blind Source Separation (BSS)
\([Y = A X]\) with unknown A and X and statistical independence between columns of X or subspaces of columns of X Partial and Online SVD/PCA
… Tensor Decomposition
… **Not sure about this one but see orthogonally decomposable tensors

Truncated Classic PCA is clearly also an example of this, but is excluded from the list for some reason. Boringness? the fact it’s a special case of Sparse PCA?

See also learning on manifolds, compressed sensing, optimisation random linear algebra and clustering, sparse regression

Why does it ever work

For certain types of data matrix, here is a possibly plausible explanation:

[#UdTo19] ask “Why Are Big Data Matrices Approximately Low Rank?

Matrices of (approximate) low rank are pervasive in data science, appearing in movie preferences, text documents, survey data, medical records, and genomics. While there is a vast literature on how to exploit low rank structure in these datasets, there is less attention paid to explaining why the low rank structure appears in the first place. Here, we explain the effectiveness of low rank models in data science by considering a simple generative model for these matrices: we suppose that each row or column is associated to a (possibly high dimensional) bounded latent variable, and entries of the matrix are generated by applying a piecewise analytic function to these latent variables. These matrices are in general full rank. However, we show that we can approximate every entry of an \(m\times n\) matrix drawn from this model to within a fixed absolute error by a low rank matrix whose rank grows as \(\mathcal{O}(\log(m+n))\). Hence any sufficiently large matrix from such a latent variable model can be approximated, up to a small entrywise error, by a low rank matrix.

Overviews

Non-negative matrix factorisations

David Austin gives a simple introduction, to the classic Non-negative matrix factorization for the American Mathematical Society.

This method is famous for decomposing things into parts with the \(l_2\) loss. It can be made even sparse by exploring other losses; TBC.

Sketching

“Sketching” is what I am going to use to describe the subset of factorisations which reduce the dimensionality of the matrices in question in a way I will make clear shortly.

[#Mart16] mentions CUR and interpolative decompositions. Does preconditioning fit here?

\([\mathcal{H}]\)-matrix methods

There seems to be little contact with the related area of \([\mathcal{H}]\)-matrix methods, as seen in, e.g. covariance matrices. Why is that?

(See hmatrix.org for one lab’s backgrounder and their implementation, h2lib, hlibpro for a black-box closed-source one.)

Randomized methods

Most of these algorithms have multiple optima and use a greedy search to find solutions; that is, they are deterministic up to choice of starting parameters.

There are also randomised versions.

Implementations

“Enough theory! Plug the hip new toy into my algorithm!”

OK.

Vowpal Wabbit does this:

HPC for matlab, R, python, c++: libpmf:

LIBPMF implements the CCD++ algorithm, which aims to solve large-scale matrix factorization problems such as the low-rank factorization problems for recommender systems.

NMF (R): TBD

laplacians.jl (Julia):

Laplacians is a package containing graph algorithms, with an emphsasis on tasks related to spectral and algebraic graph theory. It contains (and will contain more) code for solving systems of linear equations in graph Laplacians, low stretch spanning trees, sparsifiation, clustering, local clustering, and optimization on graphs.

All graphs are represented by sparse adjacency matrices. This is both for speed, and because our main concerns are algebraic tasks. It does not handle dynamic graphs. It would be very slow to implement dynamic graphs this way.

Laplacians.jl was started by Daniel A. Spielman. Other contributors include Rasmus Kyng, Xiao Shi, Sushant Sachdeva, Serban Stan and Jackson Thea.

Matlab: Chih-Jen Lin’s nmf.m - “This tool solves NMF by alternative non-negative least squares using projected gradients. It converges faster than the popular multiplicative update approach.”

distributed nmf: In this repository, we offer both MPI and OPENMP implementation for MU, HALS and ANLS/BPP based NMF algorithms. This can run off the shelf as well easy to integrate in other source code. These are very highly tuned NMF algorithms to work on super computers. We have tested this software in NERSC as well OLCF cluster. The openmp implementation is tested on many different Linux variants with intel processors. The library works well for both sparse and dense matrix. (KaBP16_, Kann16_, FKPB15_)

Spams (C++/MATLAB/python) includes some matrix factorisations in its sparse approx toolbox. (see optimisation)

scikit-learn (python) does a few matrix factorisation in its inimitable batteries-in-the-kitchen-sink way.

nimfa (python) - “Nimfa is a Python library for nonnegative matrix factorization. It includes implementations of several factorization methods, initialization approaches, and quality scoring. Both dense and sparse matrix representation are supported.”

Tapkee (C++). Pro-tip – even without coding C++, tapkee does a long list of dimensionality reduction from the CLI.

Refs