python matrix when both row and column size are large - python

I want to create a 2D matrix in python when number of rows and columns are equal and it is around 231000. Most of the cell entries would be zero.
Some [i][j] entries would be non-zero.
The reason for creating this matrix is to apply SVD and get [U S V] matrices with rank of say 30.
Can anyone provide me with the idea how to implement this by applying proper libraries. I tried pandas Dataframe but it shows Memory error.
I have also seen scipy.sparse matrix but couldn't figure out how it would be applied to find SVD.

I think this is a duplicate question, but I'll answer this anyways.
There are several libraries in python aimed at dealing with partial svds on very sparse matrices.
My personal preference is scipy.sparse.linalg.svds, a ARPACK implementation of iterative partial SVD calculation.
You can also try the function sparsesvd.sparsesvd, which uses the SVDLIBC implementation, or scipy.sparse.linalg.svd, which uses the LAPACK implementation.
To convert your table to a format that these algorithms use, you will need to import scipy.sparse, which allows you to use the csc_matrix class
Use the above links to help you out. There are a lot of resources already here on stack overflow and many more on the internet.

Related

What is the most efficient way to calculate the eigen values of a large covariance matrix?

I have been trying for some days to calculate the nearest positive semi-definite matrix from a very large covariance matrix to be able to sample from it.
I have tried MATLAB for the effect, but the memory usage is insane and it always crashes eventually, no error message or log file as far as I searched. The function used for the calculation can be found here https://www.mathworks.com/matlabcentral/fileexchange/42885-nearestspd. Optimizing the function to remove intermediate matrices seemed to reduce the memory usage, but it eventually crashes much in the same way.
Found this approach for doing the calculation https://stackoverflow.com/a/63131309/18660401 and switched to Python, in hopes of finding some GPU libraries to accelerate the calculations, but it seems I cannot find an up-to-date library that suports calculating eigenvectors using the numpy function. This is the function I am using:
import numpy as np
def get_near_psd(A):
C = (A + A.T)/2
eigval, eigvec = np.linalg.eig(C)
eigval[eigval < 0] = 0
return eigvec.dot(np.diag(eigval)).dot(eigvec.T)
I am currently trying to run the same function with numba in hopes that the translation to LLVM is enough to make the calculations in reasonable time, only modified the above version to include the #jit decorator from numba.
There does not seem to be a very optimized way to do this as far as I can find on my own, so any suggestion is very appreciated to crack this.
Edit: The matrix is a two-dimensional 60416x60416 covariance matrix and it is to be used to generate new samples from the distribution of the mean and covariance matrix calculated from a set of samples using a GAN. For training purposes, samples also need to be generated from randomly sampling the distribution, which I am intending to use the function multivariate_normal from numpy for.
A very up to date library that does have these capabilities including GPU support is pytorch, check out the examples on the torch.linalg.eig-function and the corresponding accelerated function torch.linalg.eigh for symmetric/hermitian matrices. You do have to convert these matrices from numpy to pytorch-tensors first to do the computation (and then convert it back), but you can definitely use it in a very similar way.
Of course also this library can't just magically give you more memory, but it is highly optimized.

Fast subsequent multiplication of many matrices in python

I have to generate a matrix (propagator in physics) by ordered multiplication of many other matrices. Each matrix is about the size of (30,30), all real entries (floats), but not symmetric. The number of matrices to multiply varies between 1e3 to 1e5. Each matrix is only slightly different from previous, however they are not commutative (and at the end I need the product of all these non-commutative multiplication). Each matrix is for certain time slice, so I know how to generate each of them independently, wherever they are in the multiplication sequence. At the end, I have to produce many such matrix propagators, so any performance enhancement is welcomed.
What is the algorithm for fastest implementation of such matrix multiplication in python?
In particular -
How to structure it? Are there fast axes and so on? preferable dimensions for rows / columns of the matrix?
Assuming memory is not a problem, to allocate and build all matrices before multiplication, or to generate each per time step? To store each matrix in dedicated variable before multiplication, or to generate when needed and directly multiply?
Cumulative effects of function call overheads effects when generating matrices?
As I know how to build each, should it be parallelized? For example maybe to create batch sequences from start of the sequence and from the end, multiply them in parallel and at the end multiply the results in proper order?
Is it preferable to use other module than numpy? Numba can be useful? or some other efficient way to compile in place to C, or use of optimized external libraries? (please give reference if so, I don't have experience in that)
Thanks in advance.
I don't think that the matrix multiplication would take much time. So, I would do it in a single loop. The assembling is probably the costly part here.
If you have bigger matrices, a map-reduce approach could be helpful. (split the set of matrices, apply matrix multiplication to each set and do the same for the resulting matrices)
Numpy is perfectly fine for problems like this as it is pretty optimized. (and is partly in C)
Just test how much time the matrix multiplication takes and how much the assembling. The result should indicate where you need to optimize.

Efficient Graph Data structure Python

I have a weighted graph data structure used in a machine learning algorithm, that requires frequent alterations (insertions, deletions of both vertices and edges). I am currently using an adjacency matrix implemented with a numpy 2d array with entries being
G[i, j] = W{i, j} if ij (is an edge) else 0
This works well for edges |V| < 1,500 but gets really slow with the search, insert and delete operations beyond that.
Since I am using a vectorized optimization of the graph embedding based on the weights, I need to use numpy arrays, so using lists is not feasible in this case.
Is there any efficient implementations of graphs that I can use for the storage, and operations on Graphs written in Python that can used ?
As mentioned in the question, it is very hard to beat the performance of an adjacency list when the graph is sparse. Adjacency matrices will always waste a lot of space for sparse graphs so you will probably have to find an alternative from using numpy arrays in all operations.
Some of the possible solutions to your problem may be:
Use an adjacency list structures for the other operations and convert to 2d numpy arrays when necessary (may not be efficient)
Use a sparse matrix: try to use a sparse matrix so you still can do matrix operations without converting back and forth. You may read more about them in this blog post. Note that you will have to replace some of the numpy operations to their scipy.sparse equivalents in your code if you opt for this solution.
Try using NetworkX library which is one of the best out there to handle Graph data structures.

convert a sparse matrix to dense and get the full eigenvalues

recently I'm working on a problem which requires
diagonalizing a huge hermitian matrix to get all the eigenvalues.
Currently I'm using Mathematica to do the job.
However it is not applicable due to the limitation of memory
when the matrix size approaches (2^15,2^15), where the diagonalization costs approximately 32 GBs memory.
I've tried using python by importing the matrix from mathematica,
import numpy as np
from scipy.io import mmread
from scipy.sparse import csc_matrix
#importing sparse matrix to save space
h = mmread("h.mtx")
h = csc_matrix(h)
#diagonlizing the dense one
ev = np.linalg.eigvalsh(h.todense())
It works but unfortunately an order of magnitude slower than Mathematica.
So, is there any other possible solutions, say, C++?
I know nothing about C++ so I guess the simplest way may be importing the
matrix to C++ and diagonalizing.
Thanks!
Running some preliminary test using this matrix:
http://math.nist.gov/MatrixMarket/data/NEP/h2plus/qc2534.html
I determined that the conversion to dense does not take up much of the time. The eigenvalue calculation does.
Numpy uses highly-optimized Lapack routines to calculate. These are the same you'd use in C++. Therefore C++ won't give you much of a speedup. If you want a speedup use the sparseness as a property, go to a better computer or switch to a distributed matrix storage(lot's of labor here).
P.S: if you do this for a university project you might want to look around if your university has a cluster of some sort. A cluster node typically has lots of memory. If not, check amazons AWS EC2 or googles compute engine for instances with lot's of ram.
Edit:
Here Wolfram says what Mathematica does behind the scenes: http://reference.wolfram.com/language/tutorial/LinearAlgebraAppendix.html#83486633
Arpack is a (arnoldi)subspace solver, giving you only the highest or lowest k-eigenvalues, ATLAS is just a Lapack implementation and the rest seems to be for solving linear systems.
All methods giving you the full eigenspectrum will require the matrix decomposition of a NxN matrix. If you only want k vectors there are methods which reduce it to a decomposition of a k x k-matrix.
There are modern alternatives to Arpack(http://slepc.upv.es/ or the one that comes with MKL), but they all give you a subspace.
c++ won't help much.
In python you can delegate easily to C++ and a lot of scipy routines will do just that (for performance). I also expect that if you only time the eigen value line you will get similar performance to Matematica and the difference in performance comes from reading the data.
The best solution is to look for a more appropriate algorithm, maybe something that operates on the sparse matrix directly, or decompose the original into smaller matrices and combine them.
To make the original solution more tractable you could try increasing the amount of swap space. In linux it's a dedicated partition, in windows it's a setting. This should allow Matematica/python to use more memory, but it's going to be much slower due to memory trashing. Get an SSD to speed this setup up, but note that it's going to be destroyed faster due to often writes. Or even better buy more RAM.

Is the format/structure of SciPy's condensed distance matrix stable?

Several SciPy functions are documented as taking a "condensed distance matrix as returned by scipy.spatial.distance.pdist". Now, inspection shows that what pdist returns is the row-major 1D-array form of the upper off-diagonal part of the distance matrix. This is all well and good, and natural and obvious, but is it documented or defined anywhere? I'd rather not assume anything about a data structure that'll suddenly change. (Granted, there isn't a lot of things it could change to, but I guess one possibility would be to wrap the array in an object that allows matrix-like indexing.)
Honestly, this is a better question for the scipy users or dev list, as it's about future plans for scipy.
However, the structure is fairly rigorously documented in the docstrings for both scipy.spatial.pdist and in scipy.spatial.squareform.
E.g. for pdist:
Returns a condensed distance matrix Y. For
each :math:`i` and :math:`j` (where :math:`i<j<n`), the
metric ``dist(u=X[i], v=X[j])`` is computed and stored in the
:math:`ij`th entry.
See ``squareform`` for information on how to calculate the index of
this entry or to convert the condensed distance matrix to a
redundant square matrix.
Becuase of this, and the fact that so many other functions in scipy.spatial expect a distance matrix in this form, I'd seriously doubt it's going to change without a number of depreciation warnings and announcements.
Modules in scipy itself (as opposed to scipy's scikits) are fairly stable, and there's a great deal of consideration put into backwards compatibility when changes are made (and because of this, there's quite a bit of legacy "cruft" in scipy: e.g. the fact that the core scipy module is just numpy with different defaults on a couple of functions.).

Categories