I'm currently using sklearn's ProjectedGradientNMF and nimfa's Lsnmf solvers to factor a very sparse matrix. ProjecteGradientNMF runs slower but converges to a closer solution while Lsnmf runs about twice as fast but converges to a further solution (frobenius norm distance measure).
I'm curious what the current fastest or closest solvers are available to the python community or is there a better option for a sparse matrix (the matrix is sparse, not scipy.sparse)?
There is a benchmark here: https://github.com/scikit-learn/scikit-learn/pull/4852 which is a pull request including the coordinate descent solver from mblondel https://gist.github.com/mblondel/09648344984565f9477a
What do you mean by sparse not scipy.sparse? Which library is it from?
Version 0.17 of scikit-learn has a solver based on Coordinate Descent, which is considerably faster than the previous Projected Gradient.
Related
I am trying to solve a statistics-related real world problem with Python and am looking for inputs on my ideas: I have N random vectors from a m-dimensional normal distribution. I have no information about the means and the covariance matrix of the underlying distribution, in fact also that it is a normal distribution is only an assumption, a very plausible one though. I want to compute an approximation of the mean vector and covariance matrix of the distribution. The number of random vectors is in the order of magnitude of 100 to 300, the dimensionality of the normal distribution is somewhere between 2 and 5. The time for the calculation should ideally not exceed 1 minute on a standard home computer.
I am currently thinking about three approaches and am happy about all suggestions for other approaches or preferences between those three:
Fitting: Make a multi dimensional histogram of all random vectors and fit a multi dimensional normal distribution to the histogram. Problem about that approach: The covariance matrix has many entries, this could possibly be a problem for the fitting process?
Invert cumulative distribution function: Make a multi dimensional histogram as approximation of the density function of the random vectors. Then integrate this to get a multi dimensional cumulative distribution function. For one dimension, this is invertible and one could use the cum-dist function to distribute random numbers like in the original distribution. Problem: For the multi-dimensional case the cum-dist function is not invertible(?) and I don't know if this approach still works then?
Bayesian: Use Bayesian Statistics with some normal distribution as prior and update for every observation. The result should always be again a normal distribution. Problem: I think this is computationally expensive? Also, I don't want the later updates have more impact on the resulting distribution than the earlier ones.
Also, maybe there is some library which has this task already implemented? I did not find exactly this in Numpy or Scipy, maybe someone has an idea where else to look?
If the simple estimates described in the section Parameter estimation of the wikipedia article on the multivariate normal distribution are sufficient for your needs, you can use numpy.mean to compute the mean and numpy.cov to compute the sample covariance matrix.
How can one generate random sparse orthogonal matrix?
I know there is a sparse matrices in scipy library but they are generally non-orthogonal. One can exploit QR-factorization, but it is not necessarily preserves sparsity.
As a preliminary thought you could partition the matrix into diagonal blocks, fill those blocks with QR and then permute rows/columns. The resulting matrices will remain orthagonal. Alternatively, you could define some sparsity pattern for Q and try to minimize f(Q, xi) subject to QQ^T=I where f is some (preferably) convex function that adds entropy through the random variable xi. Can't say anything about the efficacy of either method since I haven't actually tried them.
EDIT: A bit more about the second method. f can really be any function. One choice might be similarity of the non-zero elements to a random gaussian vector (or any other random variate): f = ||vec(Q) - x||_2^2, x ~ N(0, sigma * I). You could handle this using any general constrained optimizer. The problem of course, is that not every pattern S is guaranteed to have a (full rank) orthogonal filling. If you have the memory, L1 regularization (or a smooth approximation) could encourage sparsity in a dense matrix variable: g(Q) = f(Q) + P(Q) where P is any sparsity-inducing penalty function. Check out Wen & Yen (2010) "A feasible Method for Optimization with Orthogonality Constraints" for an algorithm specifically designed for optimization of general (differentiable) functions over (dense) orthogonal matrices and Liu, Wu, So (2015) "Quadratic Optimization with Orthogonality Constraints" for more theorical evaluation of several line/arc search algorithms for quadratic functions. If memory is a problem, you could generate each row/column separately using sparse basis pursuit, for which there are many algorithms depending on the nature of your problem. See Qu, Sun and Wright (2015) "Finding a sparse vector in a subspace: linear sparsity using alternate directions" and Bian et al (2015) "Sparse null space basis pursuit and analysis dictionary learning for high-dimensional data analysis" for algorithm details, though in both cases you will have to incorporate/replace constraints to promote orthogonality to all previous vectors.
It's also worth noting there are sparse QR algorithms that return Q as the product of sparse/structured matrices. If you are concerned about storage space alone, this might be the simplest method to create large, efficient orthogonal operators.
I'm trying to decomposing signals in components (matrix factorization) in a large sparse matrix in Python using the sklearn library.
I made use of scipy's scipy.sparse.csc_matrix to construct my matrix of data. However I'm unable to perform any analysis such as factor analysis or independent component analysis. The only thing I'm able to do is use truncatedSVD or scipy's scipy.sparse.linalg.svds and perform PCA.
Does anyone know any work-arounds to doing ICA or FA on a sparse matrix in python? Any help would be much appreciated! Thanks.
Given:
M = UΣV^t
The drawback with SVD is that the matrix U and V^t are dense matrices. It doesn't really matter that the input matrix is sparse, U and T will be dense. Also the computational complexity of SVD is O(n^2*m) or O(m^2*n) where n is the number of rows and m the number of columns in the input matrix M. It depends on which one is biggest.
It is worth mentioning that SVD will give you the optimal solution and if you can live with a smaller loss, calculated by the frobenius norm, you might want to consider using the CUR algorithm. It will scale to larger datasets with O(n*m).
U = CUR^t
Where C and R are now SPARSE matrices.
If you want to look at a python implementation, take a look at pymf. But be a bit careful about that exact implementations since it seems, at this point in time, there is an open issue with the implementation.
Even the input matrix is sparse the output will not be a sparse matrix. If the system does not support a dense matrix neither the results will not be supported
It is usually a best practice to use coo_matrix to establish the matrix and then convert it using .tocsc() to manipulate it.
Until now I used numpy.linalg.eigvals to calculate the eigenvalues of quadratic matrices with at least 1000 rows/columns and, for most cases, about a fifth of its entries non-zero (I don't know if that should be considered a sparse matrix). I found another topic indicating that scipy can possibly do a better job.
However, since I have to calculate the eigenvalues for hundreds of thousands of large matrices of increasing size (possibly up to 20000 rows/columns and yes, I need ALL of their eigenvalues), this will always take awfully long. If I can speed things up, even just the tiniest bit, it would most likely be worth the effort.
So my question is: Is there a faster way to calculate the eigenvalues when not restricting myself to python?
#HighPerformanceMark is correct in the comments, in that the algorithms behind numpy (LAPACK and the like) are some of the best, but perhaps not state of the art, numerical algorithms out there for diagonalizing full matrices. However, you can substantially speed things up if you have:
Sparse matrices
If your matrix is sparse, i.e. the number of filled entries is k, is such that k<<N**2 then you should look at scipy.sparse.
Banded matrices
There are numerous algorithms for working with matrices of a specific banded structure.
Check out the solvers in scipy.linalg.solve.banded.
Largest Eigenvalues
Most of the time, you don't really need all of the eigenvalues. In fact, most of the physical information comes from the largest eigenvalues and the rest are simply high frequency oscillations that are only transient. In that case you should look into eigenvalue solutions that quickly converge to those largest eigenvalues/vectors such as the Lanczos algorithm.
An easy way to maybe get a decent speedup with no code changes (especially on a many-core machine) is to link numpy to a faster linear algebra library, like MKL, ACML, or OpenBLAS. If you're associated with an academic institution, the excellent Anaconda python distribution will let you easily link to MKL for free; otherwise, you can shell out $30 (in which case you should try the 30-day trial of the optimizations first) or do it yourself (a mildly annoying process but definitely doable).
I'd definitely try a sparse eigenvalue solver as well, though.
I have a large (500k by 500k), sparse matrix. I would like to get the principle components of it (in fact, even computing just the largest PC would be fine). Randomized PCA works great, except that it is essentially finding the eigenvectors of the covariance matrix instead of the correlation matrix. Any ideas of a package that will find PCA using the covariance matrix of a large, sparse matrix? Preferably in python, though matlab and R work too.
(For reference, a similar question was asked here but the methods refer to covariance matrix).
Are they not the same thing? As far as I understand it, the correlation matrix is just the covariance matrix normalised by the product of each variable's standard deviation. And, if I recall correctly, isn't there a scaling ambiguity in PCA anyway?
Have you ever tried irlba package in R - "The IRLBA package is the R language implementation of the method. With it, you can compute partial SVDs and principal component analyses of very large scale data. The package works well with sparse matrices and with other matrix classes like those provided by the Bigmemory package." you can check here for details