Dissimilarity matrix of a scipy.sparse.csc.csc_matrix in Python - python

I am searching for a Python implementation of computing dissimilarity measures of a sparse matrix. I used using scipy.spatial.distance.pdist. But I get an error:
ValueError: setting an array element with a sequence.
I think this is because pdist does not work on scipy.sparse.csc.csc_matrix. It works fine on dense matrix. So far I have been using
Dismat = A.T*T
to compute the euclidean distance, where A is the sparse matrix and Dismat is the dissimilarity matrix. But I would like to compute other distance such as Manhattan, jaccard, shortest path and so on.
I am wondering if anyone know whether there is a python package that calculates dissimilarity measures of a scipy.sparse.csc.csc_matrix. That would be great!

Related

Rank of Sparse matrix Python

I have a sparse csr matrix, sparse.csr_matrix(A), for which I would like to compute the matrix rank
There are two options that I am aware of: I could convert it to numpy matrix or array (.todense() or .toarray()) and then use np.linalg.matrix_rank(A), which defeats my purpose of using the sparse matrix format, since I have extremely large matrices. The other option is to compute a SVD decomposition (sparse matrix svd in python) for a matrix, then deduce matrix rank from this.
Are there any other options for this? Is there currently a standard, most efficient way for me to compute the rank of a sparse matrix? I am relatively new to doing linear algebra in python, so any alternatives and suggestions with that in mind would be most helpful.
I have been using .todense() method and using rank method of numpy to calculate the answers.
It has given me a satisfactory answer till now.

Why is SpectralClustering from sklearn efficient when affinity matrix is sparse?

Refer to documentation section 2.3.5:
https://scikit-learn.org/stable/modules/clustering.html#spectral-clustering
The documentation says that spectral clustering is especially efficient if the affinity matrix is sparse (and the pyamg module is installed). My understanding of a sparse matrix is that there are more zeros than nonzeros within the matrix.
When an affinity matrix is 'sparse', does this mean there are overall less calculations to find the affinity matrix and that makes it more efficient?
My reasoning is that these 'sparse zero values' will have a distance of 0 and have high similarity.
If this is true, to prepare your data it would be better to preprocessing.normalize rather than preprocessing.scale before doing spectral cluster analysis...?

Compute Cholesky decomposition of Sparse Matrix in Python

I'm trying to implement Reinsch's Algorithm (pp 4).
Since the working matrices are sparse, I'm using scipy.sparse module, but as you can see, Reinsch's algorithm needs the Cholesky decomposition of a sparse matrix (let's call it my_matrix) in order to solve certain system, but I couldn't find anything related to this.
Of course, in the same algorithm I can solve the sparse system using, for instance scipy.sparse.linalg.spsolve, and then at the end of the algorithm use something like:
R = numpy.linalg.chol( my_matrix.A )
But, in my application my_matrix is usualy about 800*800, so the last one is very inneficient.
So, my question is, where I can find such decomposition?.
Thank's in advance.
For fast decomposition, you can try,
from scikits.sparse.cholmod import cholesky
factor = cholesky(A.toarray())
x = factor(b)
A is your sparse, symmetric, positive-definite matrix.
Since your matrix is not "Huge!!" converting it into numpy array doesn't cause any problem.

How to invert a matrix that contains all-zero rows?

I am running this algorithm, which computes the MLE of the matrix normal distribution.
One part of the algorithm requires computing the inverse of a matrix that contains all-zero rows. How can we solve this problem?

PCA on large Sparse matrix using Correlation matrix

I have a large (500k by 500k), sparse matrix. I would like to get the principle components of it (in fact, even computing just the largest PC would be fine). Randomized PCA works great, except that it is essentially finding the eigenvectors of the covariance matrix instead of the correlation matrix. Any ideas of a package that will find PCA using the covariance matrix of a large, sparse matrix? Preferably in python, though matlab and R work too.
(For reference, a similar question was asked here but the methods refer to covariance matrix).
Are they not the same thing? As far as I understand it, the correlation matrix is just the covariance matrix normalised by the product of each variable's standard deviation. And, if I recall correctly, isn't there a scaling ambiguity in PCA anyway?
Have you ever tried irlba package in R - "The IRLBA package is the R language implementation of the method. With it, you can compute partial SVDs and principal component analyses of very large scale data. The package works well with sparse matrices and with other matrix classes like those provided by the Bigmemory package." you can check here for details

Categories