I have to compute massive similarity computations between vectors in a sparse matrix. What is currently the best tool, scipy-sparse or pandas, for this task?
After some research I found that both pandas and Scipy have structures to represent sparse matrix efficiently in memory. But none of them have out of box support for compute similarity between vectors like cosine, adjusted cosine, euclidean etc. Scipy support this on dense matrix only. For sparse, Scipy support dot products and others linear algebra basic operations.
Related
I have a sparse csr matrix, sparse.csr_matrix(A), for which I would like to compute the matrix rank
There are two options that I am aware of: I could convert it to numpy matrix or array (.todense() or .toarray()) and then use np.linalg.matrix_rank(A), which defeats my purpose of using the sparse matrix format, since I have extremely large matrices. The other option is to compute a SVD decomposition (sparse matrix svd in python) for a matrix, then deduce matrix rank from this.
Are there any other options for this? Is there currently a standard, most efficient way for me to compute the rank of a sparse matrix? I am relatively new to doing linear algebra in python, so any alternatives and suggestions with that in mind would be most helpful.
I have been using .todense() method and using rank method of numpy to calculate the answers.
It has given me a satisfactory answer till now.
Refer to documentation section 2.3.5:
https://scikit-learn.org/stable/modules/clustering.html#spectral-clustering
The documentation says that spectral clustering is especially efficient if the affinity matrix is sparse (and the pyamg module is installed). My understanding of a sparse matrix is that there are more zeros than nonzeros within the matrix.
When an affinity matrix is 'sparse', does this mean there are overall less calculations to find the affinity matrix and that makes it more efficient?
My reasoning is that these 'sparse zero values' will have a distance of 0 and have high similarity.
If this is true, to prepare your data it would be better to preprocessing.normalize rather than preprocessing.scale before doing spectral cluster analysis...?
I'm currently using sklearn's ProjectedGradientNMF and nimfa's Lsnmf solvers to factor a very sparse matrix. ProjecteGradientNMF runs slower but converges to a closer solution while Lsnmf runs about twice as fast but converges to a further solution (frobenius norm distance measure).
I'm curious what the current fastest or closest solvers are available to the python community or is there a better option for a sparse matrix (the matrix is sparse, not scipy.sparse)?
There is a benchmark here: https://github.com/scikit-learn/scikit-learn/pull/4852 which is a pull request including the coordinate descent solver from mblondel https://gist.github.com/mblondel/09648344984565f9477a
What do you mean by sparse not scipy.sparse? Which library is it from?
Version 0.17 of scikit-learn has a solver based on Coordinate Descent, which is considerably faster than the previous Projected Gradient.
I'm trying to decomposing signals in components (matrix factorization) in a large sparse matrix in Python using the sklearn library.
I made use of scipy's scipy.sparse.csc_matrix to construct my matrix of data. However I'm unable to perform any analysis such as factor analysis or independent component analysis. The only thing I'm able to do is use truncatedSVD or scipy's scipy.sparse.linalg.svds and perform PCA.
Does anyone know any work-arounds to doing ICA or FA on a sparse matrix in python? Any help would be much appreciated! Thanks.
Given:
M = UΣV^t
The drawback with SVD is that the matrix U and V^t are dense matrices. It doesn't really matter that the input matrix is sparse, U and T will be dense. Also the computational complexity of SVD is O(n^2*m) or O(m^2*n) where n is the number of rows and m the number of columns in the input matrix M. It depends on which one is biggest.
It is worth mentioning that SVD will give you the optimal solution and if you can live with a smaller loss, calculated by the frobenius norm, you might want to consider using the CUR algorithm. It will scale to larger datasets with O(n*m).
U = CUR^t
Where C and R are now SPARSE matrices.
If you want to look at a python implementation, take a look at pymf. But be a bit careful about that exact implementations since it seems, at this point in time, there is an open issue with the implementation.
Even the input matrix is sparse the output will not be a sparse matrix. If the system does not support a dense matrix neither the results will not be supported
It is usually a best practice to use coo_matrix to establish the matrix and then convert it using .tocsc() to manipulate it.
I have a large (500k by 500k), sparse matrix. I would like to get the principle components of it (in fact, even computing just the largest PC would be fine). Randomized PCA works great, except that it is essentially finding the eigenvectors of the covariance matrix instead of the correlation matrix. Any ideas of a package that will find PCA using the covariance matrix of a large, sparse matrix? Preferably in python, though matlab and R work too.
(For reference, a similar question was asked here but the methods refer to covariance matrix).
Are they not the same thing? As far as I understand it, the correlation matrix is just the covariance matrix normalised by the product of each variable's standard deviation. And, if I recall correctly, isn't there a scaling ambiguity in PCA anyway?
Have you ever tried irlba package in R - "The IRLBA package is the R language implementation of the method. With it, you can compute partial SVDs and principal component analyses of very large scale data. The package works well with sparse matrices and with other matrix classes like those provided by the Bigmemory package." you can check here for details