I have a set of large very sparse matrices which I'm trying to find the eigenvector corresponding to eigenvalue 0 of. From the structure of the underlying problem I know a solution to this must exist and that zero is also the largest eigenvalue.
To solve the problem I use scipy.sparse.linalg.eigs with arguments:
val, rho = eigs(L, k=1, which = 'LM', v0=init, sigma=-0.001)
where L is the matrix. I then call this multiple times for different matrices. At first I was having a severe problem where the memory usage just increases as I call the function more and more times. It seems eigs doesn't free all the memory it should. I solved by calling gc.collect() each time I use eigs.
But now I worry that also internally memory isn't being freed, naively I expect that using something like Arnoldi shouldn't use more memory as the algorithm progresses, it should just be storing the matrix and the current set of Lanczos vectors, but I find that memory usage increases while the code is still inside the eigs function.
Any ideas?
Related
I'm trying to do some matrix computations as
def get_P(X,Z):
n_sample,n_m,n_t,n_f = X.shape
res = np.zeros((n_sample,n_m,n_t,n_t))
for i in range(n_sample):
res[i,:,:,:] = np.dot(X[i,:,:,:],Z[i,:,:])
return res
Because the size of X and Z is large, it takes more than 600ms to compute one np.dot, and I have 10k rows in X.
Is there anyway we can speed it up?
Well, there might be some avoidable overhead posed by your zero initialization (which gets overwritten right away): Just use np.ndarray instead.
Other than that: numpy is fairly well-optimized. Probably you can speed things up if you used dtype=numpy.float32 instead of the default 64-bit floating point numbers for your X, Z and res – but that's about it. Dot products are mostly spending time going linear through RAM and multiplying and summing numbers – things that numpy, compilers and CPUs are radically good at these days.
Note that numpy will only use one CPU core at a time in its default configuration - it might make sense to parallelize; for example, if you've got 16 CPU cores, you'd make 16 separate res partitions and calculate subsets of your range(n_sample) dot products on each core; python does bring the multithreading / async facilities to do so – you'll find plenty of examples, and explaining how would lead too far.
If you can spend the development time, and need massive amounts of data, so that this pays: you can use e.g. GPUs to multiply matrices; these are really good at that, and cuBLASlt (GEMM) is an excellent implementation, but honestly, you'd mostly be abandoning Numpy and would need to work things out yourself – in C/C++.
You can use numpy einsum to do this multiplication in one vectorized step.
It will be much faster than this loop based dot product. For examples, check this link https://rockt.github.io/2018/04/30/einsum
Here is the code:
# input:
# A : a large csr matrix (365 million rows and 1.3 billion entries), 32 bit float datatype
# get the two largest eigenvalues of A and the corresponding eigenvectors
from scipy.sparse.linalg import eigsh
(w,V) = eigsh(A,k=2,tol=10e-2,ncv=5)
As far as I can tell, there is not a lot of room to mess up here, but what I am observing is that my machine initially has plenty of memory (90G including swap), but the memory usage of eigsh slowly creeps up during the run until I run out of memory. Is there something obvious I am missing here?
What I have tried:
--Looking through the source. It is a lot, but as far as I could see, there is no memory allocated by the python code between iterations. I am not as good at Fortran, but it would be unexpected if ARPACK itself or the calling routine allocated memory.
--Tried an equivalent thing in Octave (MATLAB clone), with similar effects, although less obvious since the datatype is necessarily double precision and thus it is more constrained from the start. So perhaps it could be something with ARPACK itself?
--Googled a bunch. It looks like Scipy does (did?) use a circular reference somewhere that has caused others grief when calling eigsh multiple times, but I am calling it once, so maybe this is not the same issue.
Any help would be very greatly appreciated.
I have created a function to rotate a vector by a quaternion:
def QVrotate_toLocal(Quaternion,Vector):
#NumSamples x Quaternion[w,x,y,z]
#NumSamples x Vector[x,y,z]
#For example shape (20000000,4) with range 0,1
# shape (20000000,3) with range -100,100
#All numbers are float 64s
Quaternion[:,2]*=-1
x,y,z=QuatVectorRotate(Quaternion,Vector)
norm=np.linalg.norm(Quaternion,axis=1)
x*=(1/norm)
y*=(1/norm)
z*=(1/norm)
return np.stack([x,y,z],axis=1)
Everything within QuatVectorRotate is addition and multiplication of (20000000,1) numpy arrays
For the data I have (20million samples for both the quaternion and vector), every time I run the code the solution oscillates between a (known) correct solution and very incorrect solution. Never deviating from pattern correct,incorrect,correct,incorrect...
This kind of numerical oscillation in static code usually means there is an ill-conditioned matrix which is being operated on, python is running out of floating point precision, or there is a silent memory overflow somewhere.
There is little linear algebra in my code, and I have checked and found the norm line to be static with every run. The problem seems to be happening somewhere in lines a= ... to d= ...
Which led me to believe that given these large arrays I was running out of memory somewhere along the line. This could still be the issue, but I dont believe it is; I have 16gb memory, and while running I never get above 75% usage. But again, I do not know enough about memory allocation to definitively rule this out. I attempted to force garbage collection at the beginning and end of the function to no avail.
Any ideas would be appreciated.
EDIT:
I just reproduced this issue with the following data and same behavior was observed.
Q=np.random.random((20000000,4))
V=np.random.random((20000000,3))
When you do Quaternion[:,2]*=-1 in the first line, you are mutating the Quaternion array. This is not a local copy of that array, but the actual array you pass in from the outside.
So every time you run this code, you have different signs on those elements. After having run the function twice, the array is back to the start (since, obviously, -1*-1 = 1).
One way to get around this is to make a local copy first:
Quaternion_temp = Quaternion.copy()
I wrote a function in Python that returns a large 2d numpy array (2**13,2**13), call it pdd.
import pdd
array=pdd.function(some stuff)
If I call the function once the memory usage jumps to a few gigabytes. Then if I run the same command again
array=pdd.function(some stuff)
The memory usage roughly doubles, like its a second array of that size rather than just rewriting the concurrent one. The problem with this is I want to use this function with an mcmc sampler so many repeated calls to the function, which obviously can't work as it is.
So is there some way to free the memory, or something in the function to optimize or minimize the usage??
EDIT
I appear to have fixed the problem. After trying several things it seems to be scipy's fault. Inside the function there are several 2d FFTs and I was using scipy's fftpack fft2 and ifft2. This resulted in the creation of some large arrays using lots of memory that left/added over a gb in memory with each function call. When I switched to using numpys fft2 and ifft2 it went away. Now after the function ends in left with my one array with a few hundred Mb of memory and no more added with subsequent function calls.
I don't know or understand why this is, and found it surprising that numpy would be better in this case than scipy but there it is.
I am working on a project which basically boils down to solving the matrix equation
A.dot(x) = d
where A is a matrix with dimensions roughly 10 000 000 by 2000 (I would like to increase this in both directions eventually).
A obviously does not fit in memory, so this has to be parallelized. I do that by solving A.T.dot(A).dot(x) = A.T.dot(d) instead. A.T will have dimensions 2000 by 2000. It can be calculated by dividing A and d into chunks A_i and d_i, along the rows, calculate A_i.T.dot(A_i) and A_i.T.dot(d_i), and sum these. Perfect for parallellizing. I have been able to implement this with the multiprocessing module, but it is 1) hard to scale any further (increasing A in both dimensions), due to memory use, and 2) not pretty (and therefore not easy to maintain).
Dask seems to be a very promising library for solving both these problems, and I have made some attempts. My A matrix is complicated to calculate: It is based on approximately 15 different arrays (with size equal to the number of rows in A), and some are used in an iterative algorithm to evaluate associated Legendre function. When the chunks are small (10000 rows), it takes a very long time to build the task graph, and it takes a lot of memory (the memory increase coincides with the call to the iterative algorithm). When the chunks are larger (50000 rows), the memory consumption prior to calculations is a lot smaller, but it is rapidly depleted when calculating A.T.dot(A). I have tried with cache.Chest, but it significantly slows down the calculations.
The task graph must be very large and complicated - calling A._visualize() crashes. With simpler A matrices, it works to do this directly (see response by #MRocklin). Is there a way for me to simplify it?
Any advice on how to get around this would be highly appreciated.
As a toy example, I tried
A = da.ones((2e3, 1e7), chunks = (2e3, 1e3))
(A.T.dot(A)).compute()
This also failed, using up all the memory with only one core being active. With chunks = (2e3, 1e5), all cores start almost immediately, but MemoryError appears within 1 second (I have 15 GB on my current computer). chunks = (2e3, 1e4) was more promising, but it ended up consuming all memory as well.
Edit:
I struckthrough the toy example test, because the dimensions were wrong, and corrected the dimensions in the rest. As #MRocklin says, it does work with the right dimensions. I added a question which I now think is more relevant to my problem.
Edit2:
This is a much simplified example of what I was trying to do. The problem is, I believe, the recursions involved in defining the columns in A.
import dask.array as da
N = 1e6
M = 500
x = da.random.random((N, 1), chunks = 5*M)
# my actual
A_dict = {0:x}
for i in range(1, M):
A_dict[i] = 2*A_dict[i-1]
A = da.hstack(tuple(A_dict.values()))
A = A.rechunk((M*5, M))
ATA = A.T.dot(A)
This seems to lead to a very complicated task graph, which takes up a lot of memory before the calculations even start.
I have now solved this by placing the recursion in a function, with numpy arrays, and more or less do A = x.map_blocks(...).
As a second note, once I have the A matrix task graph, calculating A.T.dot(A) directly does seem to give some memory issues (memory usage is not very stable). I therefore explicitly calculate it in chunks, and sum the results. Even with these workarounds, dask makes a big difference in speed and readability.
Your output is very very large.
>>> A.T.dot(A).shape
(10000000, 10000000)
Perhaps you intended to compute this with the transposes in the other direction?
>>> A.dot(A.T).shape
(2000, 2000)
This still takes a while (it's a large computation) but it does complete.