Multiplying large matrices with dask

Multiplying large matrices with dask - python

I am working on a project which basically boils down to solving the matrix equation
A.dot(x) = d
where A is a matrix with dimensions roughly 10 000 000 by 2000 (I would like to increase this in both directions eventually).
A obviously does not fit in memory, so this has to be parallelized. I do that by solving A.T.dot(A).dot(x) = A.T.dot(d) instead. A.T will have dimensions 2000 by 2000. It can be calculated by dividing A and d into chunks A_i and d_i, along the rows, calculate A_i.T.dot(A_i) and A_i.T.dot(d_i), and sum these. Perfect for parallellizing. I have been able to implement this with the multiprocessing module, but it is 1) hard to scale any further (increasing A in both dimensions), due to memory use, and 2) not pretty (and therefore not easy to maintain).
Dask seems to be a very promising library for solving both these problems, and I have made some attempts. My A matrix is complicated to calculate: It is based on approximately 15 different arrays (with size equal to the number of rows in A), and some are used in an iterative algorithm to evaluate associated Legendre function. When the chunks are small (10000 rows), it takes a very long time to build the task graph, and it takes a lot of memory (the memory increase coincides with the call to the iterative algorithm). When the chunks are larger (50000 rows), the memory consumption prior to calculations is a lot smaller, but it is rapidly depleted when calculating A.T.dot(A). I have tried with cache.Chest, but it significantly slows down the calculations.
The task graph must be very large and complicated - calling A._visualize() crashes. With simpler A matrices, it works to do this directly (see response by #MRocklin). Is there a way for me to simplify it?
Any advice on how to get around this would be highly appreciated.
As a toy example, I tried
A = da.ones((2e3, 1e7), chunks = (2e3, 1e3))
(A.T.dot(A)).compute()
This also failed, using up all the memory with only one core being active. With chunks = (2e3, 1e5), all cores start almost immediately, but MemoryError appears within 1 second (I have 15 GB on my current computer). chunks = (2e3, 1e4) was more promising, but it ended up consuming all memory as well.
Edit:
I struckthrough the toy example test, because the dimensions were wrong, and corrected the dimensions in the rest. As #MRocklin says, it does work with the right dimensions. I added a question which I now think is more relevant to my problem.
Edit2:
This is a much simplified example of what I was trying to do. The problem is, I believe, the recursions involved in defining the columns in A.
import dask.array as da
N = 1e6
M = 500
x = da.random.random((N, 1), chunks = 5*M)
# my actual
A_dict = {0:x}
for i in range(1, M):
A_dict[i] = 2*A_dict[i-1]
A = da.hstack(tuple(A_dict.values()))
A = A.rechunk((M*5, M))
ATA = A.T.dot(A)
This seems to lead to a very complicated task graph, which takes up a lot of memory before the calculations even start.
I have now solved this by placing the recursion in a function, with numpy arrays, and more or less do A = x.map_blocks(...).
As a second note, once I have the A matrix task graph, calculating A.T.dot(A) directly does seem to give some memory issues (memory usage is not very stable). I therefore explicitly calculate it in chunks, and sum the results. Even with these workarounds, dask makes a big difference in speed and readability.

Your output is very very large.
>>> A.T.dot(A).shape
(10000000, 10000000)
Perhaps you intended to compute this with the transposes in the other direction?
>>> A.dot(A.T).shape
(2000, 2000)
This still takes a while (it's a large computation) but it does complete.

Related

Speed up N-d array dot

I'm trying to do some matrix computations as
def get_P(X,Z):
n_sample,n_m,n_t,n_f = X.shape
res = np.zeros((n_sample,n_m,n_t,n_t))
for i in range(n_sample):
res[i,:,:,:] = np.dot(X[i,:,:,:],Z[i,:,:])
return res
Because the size of X and Z is large, it takes more than 600ms to compute one np.dot, and I have 10k rows in X.
Is there anyway we can speed it up?

Well, there might be some avoidable overhead posed by your zero initialization (which gets overwritten right away): Just use np.ndarray instead.
Other than that: numpy is fairly well-optimized. Probably you can speed things up if you used dtype=numpy.float32 instead of the default 64-bit floating point numbers for your X, Z and res – but that's about it. Dot products are mostly spending time going linear through RAM and multiplying and summing numbers – things that numpy, compilers and CPUs are radically good at these days.
Note that numpy will only use one CPU core at a time in its default configuration - it might make sense to parallelize; for example, if you've got 16 CPU cores, you'd make 16 separate res partitions and calculate subsets of your range(n_sample) dot products on each core; python does bring the multithreading / async facilities to do so – you'll find plenty of examples, and explaining how would lead too far.
If you can spend the development time, and need massive amounts of data, so that this pays: you can use e.g. GPUs to multiply matrices; these are really good at that, and cuBLASlt (GEMM) is an excellent implementation, but honestly, you'd mostly be abandoning Numpy and would need to work things out yourself – in C/C++.

You can use numpy einsum to do this multiplication in one vectorized step.
It will be much faster than this loop based dot product. For examples, check this link https://rockt.github.io/2018/04/30/einsum

Large numpy matrix memory issues

I have two questions, but the first takes precedence.
I was doing some timeit testing of some basic numpy operations that will be relevant to me.
I did the following
n = 5000
j = defaultdict()
for i in xrange(n):
print i
j[i] = np.eye(n)
What happened is, python's memory use almost immediately shot up to 6gigs, which is over 90% of my memory. However, numbers printed off at a steady pace, about 10-20 per second. While numbers printed off, memory use sporadically bounced down to ~4 gigs, and back up to 5, back down to 4, up to 6, down to 4.5, etc etc.
At 1350 iterations I had a segmentation fault.
So my question is, what was actually occurring during this time? Are these matrices actually created one at a time? Why is memory use spiking up and down?
My second question is, I may actually need to do something like this in a program I am working on. I will be doing basic arithmetic and comparisons between many large matrices, in a loop. These matrices will sometimes, but rarely, be dense. They will often be sparse.
If I actually need 5000 5000x5000 matrices, is that feasible with 6 gigs of memory? I don't know what can be done with all the tools and tricks available... Maybe I would just have to store some of them on disk and pull them out in chunks?
Any advice for if I have to loop through many matrices and do basic arithmetic between them?
Thank you.

If I actually need 5000 5000x5000 matrices, is that feasible with 6 gigs of memory?
If they're dense matrices, and you need them all at the same time, not by a long shot. Consider:
5K * 5K = 25M cells
25M * 8B = 200MB (assuming float64)
5K * 200MB = 1TB
The matrices are being created one at a time. As you get near 6GB, what happens depends on your platform. It might start swapping to disk, slowing your system to a crawl. There might be a fixed-size or max-size swap, so eventually it runs out of memory anyway. It may make assumptions about how you're going to use the memory, guessing that there will always be room to fit your actual working set at any given moment into memory, only to segfault when it discovers it can't. But the one thing it isn't going to do is just work efficiently.
You say that most of your matrices are sparse. In that case, use one of the sparse matrix representations. If you know which of the 5000 will be dense, you can mix and match dense and sparse matrices, but if not, just use the same sparse matrix type for everything. If this means your occasional dense matrices take 210MB instead of 200MB, but all the rest of your matrices take 1MB instead of 200MB, that's more than worthwhile as a tradeoff.
Also, do you actually need to work on all 5000 matrices at once? If you only need, say, the current matrix and the previous one at each step, you can generate them on the fly (or read from disk on the fly), and you only need 400MB instead of 1TB.
Worst-case scenario, you can effectively swap things manually, with some kind of caching discipline, like least-recently-used. You can easily keep, say, the last 16 matrices in memory. Keep a dirty flag on each so you know whether you have to save it when flushing it to make room for another matrix. That's about as tricky as it's going to get.

Python script for analyzing pair separations in huge arrays

I am writing a Python script to compare two matrices (A and B) and do a specific range query. Each matrix has three columns and over a million rows. The specific range query involves finding indices of A for each row of B (when A is compared to B) in such a manner that the the pair separation distance is in a given range (e.g. 15 units < Pair Separation < 20 units). (It doesn't matter which array comes first.) And then, I have to repeat it for other slices of pair separations, viz. 20 units < Pair Separation < 25 units and so on till a maximum pair separation range is reached.
I have read that cKDTree of scipy.spatial can do this job very quickly. I tried using cKDTree and followed by the use of query_ball_tree. The query_ball_tree returns output in the form of a list of lists and it works perfectly fine for matrices have ~ 100,000 rows (small size). But, once the size of the matrix (number of rows) becomes closer to a million, I start getting memory errors and my code doesn't work. Obviously, the computer memory doesn't like to store huge arrays. I tried using query_ball_point and that also runs into memory errors.
This looks like a typical big data problem. Time of execution is extremely important for the nature of my work, so I am searching for ways to make the code faster (but, I also run into memory errors and my code comes to a grinding halt)
I was wondering if someone can tell me what is the best method to approach in such a scenario.
: Is the use of scipy.spatial cKDTree and query_ball_point (or query_ball_tree) the best approach for my problem?
: Can I hope to keep to keep using cKDTree and be successful, using some modification?
: I tried to see if I can make some headway by modifying the source of scipy.spatial cKDTree, specifically query_ball_tree or query_ball_point to prevent it from storing huge arrays. But, I am not very experienced with that, so I haven't been successful yet.
: Its very possible that there are solutions that haven't struck my mind. I will very much appreciate any useful suggestion from the people here.

Working on many small matrices

I'm currently working on many small 6x6 matrices: shape A = (N, N, N, 6, 6) with N is about 500. I store these matrices in a HDF5 file by Pytables (http://www.pytables.org).
I want to do some calculations on these matrices, say inverting, transposing, multiplication, etc... It's quite easy while N is not very big, by example numpy.linalg.inv(A) should do the trick without loop. But in my case, it works very slow and sometimes I have a memory's problem.
Could you suggest me an approach to do this more efficiently?

A 6x6 matrix has 36 8 byte double values, or 288 bytes; we'll say 0.5KB for ease of calculation and overhead. If you accept that, then 500 of them would only represent 250KB - not much memory.
If you keep all those inverses in memory it's still not a lot - just 500KB.
Can you calculate the amount of RAM your program is consuming to confirm this?
What are you doing - finite element analysis? Are these stiffness, mass, and damping matricies for 500 elements?
If yes, you should not be inverting element matricies. You have to assemble those into global matricies, which will consume more memory, and then solve that. Inverse still isn't calculated - LU decomposition in place is the usual way to do it.
I wouldn't consider a 500 element mesh to be a large problem. I saw meshes with thousands and ten of thousands of elements when I stopped doing that kind of work in 1995. I'm sure that hardware makes even larger problems possible today.
You're doing something else wrong or there are details that you aren't providing.

Explain the speed difference between numpy's vectorized function application VS python's for loop

I was implementing a weighting system called TF-IDF on a set of 42000 images, each consisting 784 pixels. This is basically a 42000 by 784 matrix.
The first method I attempted made use of explicit loops and took more than 2 hours.
def tfidf(color,img_pix,img_total):
if img_pix==0:
return 0
else:
return color * np.log(img_total/img_pix)
...
result = np.array([])
for img_vec in data_matrix:
double_vec = zip(img_vec,img_pix_vec)
result_row = np.array([tfidf(x[0],x[1],img_total) for x in double_vec])
try:
result = np.vstack((result,result_row))
# first row will throw a ValueError since vstack accepts rows of same len
except ValueError:
result = result_row
The second method I attempted used numpy matrices and took less than 5 minutes. Note that data_matrix, img_pix_mat are both 42000 by 784 matrices while img_total is a scalar.
result = data_matrix * np.log(np.divide(img_total,img_pix_mat))
I was hoping someone could explain the immense difference in speed.
The authors of the following paper entitled "The NumPy array: a structure for eﬃcient numerical computation" (http://arxiv.org/pdf/1102.1523.pdf), state on the top of page 4 that they observe a 500 times speed increase due to vectorized computation. I'm presuming much of the speed increase I'm seeing is due to this. However, I would like to go a step further and ask why numpy vectorized computations are that much faster than standard python loops?
Also, perhaps you guys might know of other reasons why the first method is slow. Do try/except structures have high overhead? Or perhaps forming a new np.array for each loop is takes a long time?
Thanks.

Due to the internal workings of numpy, (as far as i know, numpy works with C internally, so everything you push down to numpy is actually much faster because it is in a different language)
edit:
taking out the zip, and replacing it with a vstack should go faster too, (zip tends to go slow if the arguments are very large, and than vstack is faster), (but that is also something which puts it into numpy (thus into C), while zip is python)
and yes, if i understand correctly - not sure about that- , you are doing 42k times a try/except block, that should definately be bad for the speed,
test:
T=numpy.ndarray((5,10))
for t in T:
print t.shape
results in (10,)
This means that yes, if your matrices are 42kx784 matrices, you are trying 42k times a try-except block, i am assuming that should put an effect in the computation times, as well as each time doing a zip each time, but not certain if that would be the main cause,
(so every one of your 42k times you run your stuff, takes 0.17sec, i am quite certain that a try/except block doesnt take 0.17 seconds, but maybe the overhead it causes or so, does contribute to it?
try changing the following:
double_vec = zip(img_vec,img_pix_vec)
result_row = np.array([tfidf(x[0],x[1],img_total) for x in double_vec])
to
result_row=np.array([tfidf(img_vec[i],img_pix_vec[i],img_total) for i in xrange(len(img_vec))])
that at least gets rid of the zip statement, But not sure if the zip statement takes down your time by one min, or by nearly two hours (i know zip is slow, compared to numpy vstack, but no clue if that would give you two hours time gain)

The difference you're seeing isn't due to anything fancy like SSE vectorization. There are two primary reasons. The first is that NumPy is written in C, and the C implementation doesn't have to go through the tons of runtime method dispatch and exception checking and so on that a Python implementation goes through.
The second reason is that even for Python code, your loop-based implementation is inefficient. You're using vstack in a loop, and every time you call vstack, it has to completely copy all arrays you've passed to it. That adds an extra factor of len(data_matrix) to your asymptotic complexity.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.