Fast non-negative matrix factorization on large sparse matrix

Fast non-negative matrix factorization on large sparse matrix - python

Using Scikit-learn (v 0.15.2) for non-negative matrix factorization on a large sparse matrix (less than 1% values > 0). I want to find factors by minimizing errors only on non-zero values of the matrix (i.e., do not calculate errors for entries that are zero), and to favor sparsity. I'm not sure if something's wrong with what I'm trying. The scikit-learn package's NMF and ProjectedGradientNMF have worked well for me before. But it seems that when the matrix size increases, the factorization is terribly slow.
I'm talking about matrices with > 10^10 cells. For matrix with ~10^7 cells, I find the executime time to be good.
The parameters I've used are as follows: nmf_model = NMF(n_components = 100, init='nndsvd', random_state=0, tol = 0.01, sparseness='data').
When I tried slightly different parameters (change to init=random), I get the following warning. After the warning, the execution of the script halts.
/lib/python2.7/site-packages/sklearn/decomposition/nmf.py:252: UserWarning: Iteration limit reached in nls subproblem.
warnings.warn("Iteration limit reached in nls subproblem.")
Is there a way to make this faster and solve the above problem? I've tried using numpy sparse matrix (column- and row-sparse), but surprisingly - it's slower on the test I did with a smaller matrix (~10^7 cells).
Considering that one would have to run multiple iterations of such a factorization (to choose an ideal number of factors and k-fold cross validation), a faster way to solve this problem is highly desirable.
I'm also open to suggestions of package/tools that's not based on sklearn or Pyhon. I understand questions about package/tool choices are not encouraged, but for such a specific use-case, knowing what techniques others in the field use would be very helpful.

Maybe a few words on what the initial problem is about could enable us to give better answers.
Matrix Factorization on a very large matrix is always going to be slow due to the nature of the problem.
Suggestions:
Reducing n_components to < 20 will speed it up somewhat. However, the only real improvement in speed will be achieved by limiting the size of the matrix.
With a matrix like you describe, one could think that you are trying to factorize a term frequency matrix. If this is so, you could try to use the vectorization functions in scikit-learn to limit the size of the matrix. Most of them have a max_features parameter. Example:
vectorizer = TfidfVectorizer(
max_features=10000,
ngram_range=(1,2))
tfidf = vectorizer.fit_transform(data)
This will significantly speed up the problem solving.
Should I be completely wrong and this is not a term frequency problem, I would still look into ways to limit the initial matrix you are trying to factorize.

You might want to take a look at this article which discusses more recent techniques on NMF: http://www.cc.gatech.edu/~hpark/papers/nmf_blockpivot.pdf
The idea is to work only on the nonzero entries for factorization which reduces computational time especially when the matrix/matrices involved is/are very sparse.
Also, one of the authors from the same article created NMF implementations on github including the ones mentioned in their article. Here's the link: https://github.com/kimjingu/nonnegfac-python
Hope that helps.

Old question, new answer.
The OP asks for "zero-masked" NMF, where zeros are treated as missing values. This will never be faster than normal NMF. Consider NMF by alternating least squares, here the left-hand side of systems of equations is generally constant (it is simply the tcrossprod of W or H), but in zero-masked NMF it needs to be re-calculated for every single sample or feature.
I've implemented zero-masked NMF in the RcppML R package. You can install it from CRAN and use the nmf function setting the mask_zeros argument to TRUE:
install.packages("RcppML")
A <- rsparsematrix(1000, 1000, 0.1) # simulate random matrix
model <- RcppML::nmf(A, k = 10, mask_zeros = TRUE)
My NMF implementation is faster than scikit-learn without masking zeros, and shouldn't be impossibly slow for 99% sparse matrices.

Related

What is the most efficient way to calculate the eigen values of a large covariance matrix?

I have been trying for some days to calculate the nearest positive semi-definite matrix from a very large covariance matrix to be able to sample from it.
I have tried MATLAB for the effect, but the memory usage is insane and it always crashes eventually, no error message or log file as far as I searched. The function used for the calculation can be found here https://www.mathworks.com/matlabcentral/fileexchange/42885-nearestspd. Optimizing the function to remove intermediate matrices seemed to reduce the memory usage, but it eventually crashes much in the same way.
Found this approach for doing the calculation https://stackoverflow.com/a/63131309/18660401 and switched to Python, in hopes of finding some GPU libraries to accelerate the calculations, but it seems I cannot find an up-to-date library that suports calculating eigenvectors using the numpy function. This is the function I am using:
import numpy as np
def get_near_psd(A):
C = (A + A.T)/2
eigval, eigvec = np.linalg.eig(C)
eigval[eigval < 0] = 0
return eigvec.dot(np.diag(eigval)).dot(eigvec.T)
I am currently trying to run the same function with numba in hopes that the translation to LLVM is enough to make the calculations in reasonable time, only modified the above version to include the #jit decorator from numba.
There does not seem to be a very optimized way to do this as far as I can find on my own, so any suggestion is very appreciated to crack this.
Edit: The matrix is a two-dimensional 60416x60416 covariance matrix and it is to be used to generate new samples from the distribution of the mean and covariance matrix calculated from a set of samples using a GAN. For training purposes, samples also need to be generated from randomly sampling the distribution, which I am intending to use the function multivariate_normal from numpy for.

A very up to date library that does have these capabilities including GPU support is pytorch, check out the examples on the torch.linalg.eig-function and the corresponding accelerated function torch.linalg.eigh for symmetric/hermitian matrices. You do have to convert these matrices from numpy to pytorch-tensors first to do the computation (and then convert it back), but you can definitely use it in a very similar way.
Of course also this library can't just magically give you more memory, but it is highly optimized.

Difference in results with sparse solver

I'm solving a non-linear elliptic PDE via linearization + iteration and a finite difference method: basically it comes down to solving a matrix equation Ax = b. A is a banded matrix. Due to the large size of A (typically ~8 billion elements) I have been using a sparse solver (scipy.sparse.linalg.spsolve) to do this. In my code, I compute a residual value which measures deviation from the true non-linear solution and lowers it with successive iterations. It turns out that there is a difference between the values that the sparse solver produces in comparison to what scipy.linalg.solve does.
Output of normal solver:
Output of sparse solver:
The only difference in my code is the replacement of the solver. I don't think this is down to floating-point errors since the error creeps upto the 2nd decimal place (in the last iteration - but the order of magnitude also decreases... so I'm not sure). Any insights on why this might be happening? The final solution, it seems, is not affected qualitatively - but I wonder whether this can create problems.
(No code has been included since the difference is only there in the creation of a sparse matrix and the sparse solver. However, if you feel you need to check some part of it, please ask me to include code accordingly)

Efficient way to solve matrix equation in Python

Right now I am using the numpy.linalg.solve to solve my matrix, but the fact that I am using it to solve a 5000*17956 matrix makes it really time consuming. It runs really slow and It have taken me more than an hour to solve. The running time for this is probably O(n^3) for solving matrix equation but I never thought it would be that slow. Is there any way to solve it faster in Python?
My code is something like that, to solve a for the equation BT * UT = BT*B a, where m is the number of test cases (in my case over 5000), B is a data matrix m*17956, and u is 1*m.
C = 0.005 # hyperparameter term for regulization
I = np.identity(17956) # 17956*17956 identity matrix
rhs = np.dot(B.T, U.T) # (17956*m) * (m*1) = 17956*1
lhs = np.dot(B.T, B)+C*I # (17956*m) * (m*17956) = 17956*17956
a = np.linalg.solve(lhs, rhs) # B.T u = B.T B a, solve for a (17956*1)

Update (2 July 2018): The updated question asks about the impact of a regularization term and the type of data in the matrices. In general, this can make a large impact in terms of the datatypes a particular CPU is most optimized for (as a rough rule of thumb, AMD is better with vectorized integer math and Intel is better with vectorized floating point math when all other things are held equal), and the presence of a large number of zero values can allow for the use of sparse matrix libraries. In this particular case though, the changes on the main diagonal (well under 1% of all the values in consideration) will have a negligible impact in terms of runtime.
TLDR;
An hour is reasonable (a cubic regression suggests that this would take around 83 minutes on my machine -- a low-end chromebook).
The pre-processing to generate lhs and rhs account for almost none of that time.
You won't be able to solve that exact problem much faster than with numpy.linalg.solve.
If m is small as you suggest and if B is invertible, you can instead solve the equation U.T=Ba in a minute or less.
If this is part of a larger problem, this costly intermediate step might be able to be simplified away from a mathematical framework.
Performance bottlenecks really should be addressed with profiling to figure out which step is causing the issues.
Since this comes from real-world data, you might be able to get away with fewer features (either directly or through a reduction step like PCA, NMF, or LLE), depending on the end goal.
As mentioned in another answer, if the matrix is sufficiently sparse you can get away with sparse linear algebra routines to great effect (many natural language processing data sources are like this).
Since the output is a 1D vector, I would use np.dot(U, B).T instead of np.dot(B.T, U.T). Transposes are neat that way. This avoids doing the transpose on a big matrix like B, though since you have a cubic operation as the dominant step this doesn't matter much for your problem.
Depending on whether you need the original data anymore and if the matrices involved have any other special properties, you might be able to fiddle with the parameters in scipy.linalg.solve instead for a gain.
I've had mixed success replacing large matrix equations with block matrix equations falling back on numpy routines. That approach typically saves 5-20% over numpy approaches and takes 1% or so off scipy approaches on my system. I haven't fully explored the reason for the discrepancy.

Assuming your matrix is sparse, the scipy.sparse.linalg module will be useful. Here is the documentation for the whole module, and here is the documentation for spsolve.

How to efficiently calculate huge matrix multiplication (tfidf features) in Python?

I currently want to calculate all-pair document similarity using cosine similarity and Tfidf features in python. My basic approach is the following:
from sklearn.feature_extraction.text import TfidfVectorizer
#c = [doc1, doc2, ..., docn]
vec = TfidfVectorizer()
X = vec.fit_transform(c)
del vec
Y = X * X.T
Works perfectly fine, but unfortunately, not for my very large datasets. X has a dimension of (350363, 2526183) and hence, the output matrix Y should have (350363, 350363). X is very sparse due to the tfidf features, and hence, easily fits into memory (around 2GB only). Yet, the multiplication gives me a memory error after running for some time (even though the memory is not full but I suppose that scipy is so clever as to expect the memory usage).
I have already tried to play around with the dtypes without any success. I have also made sure that numpy and scipy have their BLAS libraries linked -- whereas this does not have an effect on the csr_matrix dot functionality as it is implemented in C. I thought of maybe using things like memmap, but I am not sure about that.
Does anyone have an idea of how to best approach this?

Even though X is sparse, X * X.T probably won't, notice, that it just needs one nonzero common element in a given pair of rows. You are working with NLP task, so I am pretty sure that there are huge amounts of words which occur in nearly all documents (and as said before - it does not have to be one word for all pairs, but one (possibly different) for each pair. As a result you get a matrix of 350363^2 elements which has about 122,000,000,000 elements, if you don't have 200GB of ram, it does not look computable. Try to perform much more aggresive filtering of words in order to force X * X.T to be sparse (remove many common words)
In general you won't be able to compute Gram matrix of big data, unless you enforce the sparsity of the X * X.T, so most of your vectors' pairs (documents) have 0 "similarity". It can be done in numerous ways, the easiest way is to set some threshold T under which you treat <a,b> as 0 and compute the dot product by yourself, and create an entry in the resulting sparse matrix iff <a,b> > T

You may want to look at the random_projection module in scikit-learn. The Johnson-Lindenstrauss lemma says that a random projection matrix is guaranteed to preserve pairwise distances up to some tolerance eta, which is a hyperparameter when you calculate the number of random projections needed.
To cut a long story short, the scikit-learn class SparseRandomProjection seen here is a transformer to do this for you. If you run it on X after vec.fit_transform you should see a fairly large reduction in feature size.
The formula from sklearn.random_projection.johnson_lindenstrauss_min_dim shows that to preserve up to a 10% tolerance, you only need johnson_lindenstrauss_min_dim(350363, .1) 10942 features. This is an upper bound, so you may be able to get away with much less. Even 1% tolerance would only need johnson_lindenstrauss_min_dim(350363, .01) 1028192 features which is still significantly less than you have right now.
EDIT:
Simple thing to try - if your data is dtype='float64', try using 'float32'. That alone can save a massive amount of space, especially if you do not need double precision.
If the issue is that you cannot store the "final matrix" in memory either, I would recommend working with the data in an HDF5Store (as seen in pandas using pytables). This link has some good starter code, and you could iteratively calculate chunks of your dot product and write to disk. I have been using this extensively in a recent project on a 45GB dataset, and could provide more help if you decide to go this route.

What you could do is slice a row and a column of X, multiply those and save the resulting row to a file. Then move to the next row and column.
It is still the same amount of calculation work but you wouldn't run out of memory.
Using multiprocessing.Pool.map() or multiprocessing.Pool.map_async() you migt be able to speed it up, provided you use numpy.memmap() to read the matrix in the mapped function. And you would probably have to write each of the calculated rows to a separate file to merge them later. If you were to return the row from the mapped function it would have to be transferred back to the original process. That would take a lot of memory and IPC bandwidth.

The fastest way to calculate eigenvalues of large matrices

Until now I used numpy.linalg.eigvals to calculate the eigenvalues of quadratic matrices with at least 1000 rows/columns and, for most cases, about a fifth of its entries non-zero (I don't know if that should be considered a sparse matrix). I found another topic indicating that scipy can possibly do a better job.
However, since I have to calculate the eigenvalues for hundreds of thousands of large matrices of increasing size (possibly up to 20000 rows/columns and yes, I need ALL of their eigenvalues), this will always take awfully long. If I can speed things up, even just the tiniest bit, it would most likely be worth the effort.
So my question is: Is there a faster way to calculate the eigenvalues when not restricting myself to python?

#HighPerformanceMark is correct in the comments, in that the algorithms behind numpy (LAPACK and the like) are some of the best, but perhaps not state of the art, numerical algorithms out there for diagonalizing full matrices. However, you can substantially speed things up if you have:
Sparse matrices
If your matrix is sparse, i.e. the number of filled entries is k, is such that k<<N**2 then you should look at scipy.sparse.
Banded matrices
There are numerous algorithms for working with matrices of a specific banded structure.
Check out the solvers in scipy.linalg.solve.banded.
Largest Eigenvalues
Most of the time, you don't really need all of the eigenvalues. In fact, most of the physical information comes from the largest eigenvalues and the rest are simply high frequency oscillations that are only transient. In that case you should look into eigenvalue solutions that quickly converge to those largest eigenvalues/vectors such as the Lanczos algorithm.

An easy way to maybe get a decent speedup with no code changes (especially on a many-core machine) is to link numpy to a faster linear algebra library, like MKL, ACML, or OpenBLAS. If you're associated with an academic institution, the excellent Anaconda python distribution will let you easily link to MKL for free; otherwise, you can shell out $30 (in which case you should try the 30-day trial of the optimizations first) or do it yourself (a mildly annoying process but definitely doable).
I'd definitely try a sparse eigenvalue solver as well, though.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.