in c there is an example of importance of memory utilization: naive matrix multiplication using 3 for loops (i,j,k). And one can show that using (i,k,j) is much faster than (i,j,k) due to memory coalescence. In python numpy the order if indexes (again naive 3 loops, not np.dot) does not matter. Why is that?
First of all you need to know why the loop (i,k,j) is faster than (i,j,k) on C. That's happens because the memory usage optimization, in your computer memory the matrix is allocated in a linear way, so if you iterate using (i,k,j) you are using this in your favor where each loop takes a block of memory and load to your RAM. If you use (i,j,k) you are working against it, and each step will take a block of memory load to your RAM and discard on next step because you are iterating jumping blocks.
The implementation of numpy handle it for you, so even if you use the worst order numpy will order it to work faster.
The event of throwing away the cache, and keep changing it all the time is called Cache miss
At this link you can see a much better explanation about how the memory is allocated and why is it faster in some specific itartion way.
Related
I have a huge numpy array with shape (50000000, 3) and I'm using:
x = array[np.where((array[:,0] == value) | (array[:,1] == value))]
to get the part of the array that I want. But this way seems to be quite slow.
Is there a more efficient way of performing the same task with numpy?
np.where is highly optimized and I doubt someone can write a faster code than the one implemented in the last Numpy version (disclaimer: I was one who optimized it). That being said, the main issue here is not much np.where but the conditional which create a temporary boolean array. This is unfortunately the way to do that in Numpy and there is not much to do as long as you use only Numpy with the same input layout.
One reason explaining why it is not very efficient is that the input data layout is inefficient. Indeed, assuming array is contiguously stored in memory using the default row major ordering, array[:,0] == value will read 1 item every 3 item of the array in memory. Due to the way CPU cache works (ie. cache lines, prefetching, etc.), 2/3 of the memory bandwidth is wasted. In fact, the output boolean array also need to be written and filling a newly-created array is a bit slow due to page faults. Note that array[:,1] == value will certainly reload data from RAM due to the size of the input (that cannot fit in most CPU caches). The RAM is slow and it is getter slower compared to the computational speed of the CPU and caches. This problem, called "memory wall", has been observed few decades ago and it is not expected to be fixed any time soon. Also note that the logical-or will also create a new array read/written from/to RAM. A better data layout is a (3, 50000000) transposed array contiguous in memory (note that np.transpose does not produce a contiguous array).
Another reason explaining the performance issue is that Numpy tends not to be optimized to operate on very small axis.
One main solution is to create the input in a transposed way if possible. Another solution is to write a Numba or Cython code. Here is an implementation of the non transposed input:
# Compilation for the most frequent types.
# Please pick the right ones so to speed up the compilation time.
#nb.njit(['(uint8[:,::1],uint8)', '(int32[:,::1],int32)', '(int64[:,::1],int64)', '(float64[:,::1],float64)'], parallel=True)
def select(array, value):
n = array.shape[0]
mask = np.empty(n, dtype=np.bool_)
for i in nb.prange(n):
mask[i] = array[i, 0] == value or array[i, 1] == value
return mask
x = array[select(array, value)]
Note that I used a parallel implementation since the or operator is sub-optimal with Numba (the only solution seems to use a native code or Cython) and also because the RAM cannot be fully saturated with one thread on some platforms like computing servers. Also note that it can be faster to use array[np.where(select(array, value))[0]] regarding the result of select. Indeed, if the result is random or very small, then np.where can be faster since it has special optimizations for theses cases that a boolean indexing does not perform. Note that np.where is not particularly optimized in the context of a Numba function since Numba use its own implementation of Numpy functions and they are sometimes not as much optimized for large arrays. A faster implementation consists in creating x in parallel but this is not trivial to do with Numba since the number of output item is not known ahead of time and that threads must know where to write data, not to mention Numpy is already fairly fast to do that in sequential as long as the output is predictable.
I am generating a large array of random numbers, totaling more than half the available memory on a GPU. I am doing this in a loop.
When I call cupy.random the second time (or third time...), assigning to the same variable name, it does not free the memory for the first array. It tries to allocate more memory, which causes an out of memory error.
Explicitly freeing the memory before generating a new random array is very slow, and seems inefficient.
Is there a way to generate a new set of numbers, but in the same memory space?
Edit: cupy.random.shuffle() is letting me work around the problem, but I wonder if there is a better way?
Edit 2: on further review, shuffle() does not address the problem, and appears to need even more memory than allocating a second block (before freeing the first) of memory... I am back to restricting ndarray size to less than half the remaining memory, so two ndarrays can be allocated alternately
As user2357112 suggests, cupy.random.random() does not appear to support “re-randomizing“ an existing ndarray, even though cuRand does. Writing C to modify an existing cupy array somewhat defeats the point of using python / cupy in the first place.
Curiously, having an array about 1/3rd the size of available memory, while increasing the number of loops, is faster in total execution time (versus larger arrays/fewer loops). I was not able to determine when cupy (or python or cuda?) does garbage collection on the disused array, but it seems to happen asynchronously.
If GPU garbage collection uses cuda cores (I presume it does?), it does not appear to materially effect my code execution time. Nvidia-smi reports “P2” GPU usage when my code calculations are running, suggesting there are still cores available for cupy / cuda to free memory outside of my code?
I don’t like answering my own question... just sharing what I found in case it helps someone else
I want to test using cupy whether a float is positive, e.g.:
import cupy as cp
u = cp.array(1.3)
u < 2.
>>> array(True)
My problem is that this operation is extremely slow:
%timeit u < 2. gives 26 micro seconds on my computer. It is orders of magnitude greater than what I get in CPU. I suspect it is because u has to be cast on the CPU...
I'm trying to find a faster way to do this operation.
Thanks !
Edit for clarification
My code is something like:
import cupy as cp
n = 100000
X = cp.random.randn(n) # can be greater
for _ in range(100): # There may be more iterations
result = X.dot(X)
if result < 1.2:
break
And it seems like the bottleneck of this code (for this n) is the evaluation of result < 1.2. It is still much faster than on CPU since the dot costs way less.
Running a single operation on the GPU is always a bad idea. To get performance gains out of your GPU, you need to realize a good 'compute intensity'; that is, the amount of computation performed relative to movement of memory; either from global ram to gpu mem, or from gpu mem into the cores themselves. If you dont have at least a few hunderd flops per byte of compute intensity, you can safely forget about realizing any speedup on the gpu. That said your problem may lend itself to gpu acceleration, but you certainly cannot benchmark statements like this in isolation in any meaningful way.
But even if your algorithm consists of chaining a number of such simple low-compute intensity operations on the gpu, you still will be disappointed by the speedup. Your bottleneck will be your gpu memory bandwidth; which really isnt that great compared to cpu memory bandwidth as it may look on paper. Unless you will be writing your own compute-intense kernels, or have plans for running some big ffts or such using cupy, dont think that it will give you any silver-bullet speedups by just porting your numpy code.
This may be because, when using CUDA, the array must be copied to the GPU before processing. Therefore, if your array has only one element, it can be slower in GPU than in CPU. You should try a larger array and see if this keeps happening
I think the problem here is your just leveraging one GPU device. Consider using say 100 to do all the for computations in parallel (although in the case of your simple example code it would only need doing once). https://docs-cupy.chainer.org/en/stable/tutorial/basic.html
Also there is a cupy greater function you could use to do the comparison in the GPU
Also the first time the dot gets called the kernel function will need to be compiled for the GPU which will take significantly longer than subsequent calls.
I have two questions, but the first takes precedence.
I was doing some timeit testing of some basic numpy operations that will be relevant to me.
I did the following
n = 5000
j = defaultdict()
for i in xrange(n):
print i
j[i] = np.eye(n)
What happened is, python's memory use almost immediately shot up to 6gigs, which is over 90% of my memory. However, numbers printed off at a steady pace, about 10-20 per second. While numbers printed off, memory use sporadically bounced down to ~4 gigs, and back up to 5, back down to 4, up to 6, down to 4.5, etc etc.
At 1350 iterations I had a segmentation fault.
So my question is, what was actually occurring during this time? Are these matrices actually created one at a time? Why is memory use spiking up and down?
My second question is, I may actually need to do something like this in a program I am working on. I will be doing basic arithmetic and comparisons between many large matrices, in a loop. These matrices will sometimes, but rarely, be dense. They will often be sparse.
If I actually need 5000 5000x5000 matrices, is that feasible with 6 gigs of memory? I don't know what can be done with all the tools and tricks available... Maybe I would just have to store some of them on disk and pull them out in chunks?
Any advice for if I have to loop through many matrices and do basic arithmetic between them?
Thank you.
If I actually need 5000 5000x5000 matrices, is that feasible with 6 gigs of memory?
If they're dense matrices, and you need them all at the same time, not by a long shot. Consider:
5K * 5K = 25M cells
25M * 8B = 200MB (assuming float64)
5K * 200MB = 1TB
The matrices are being created one at a time. As you get near 6GB, what happens depends on your platform. It might start swapping to disk, slowing your system to a crawl. There might be a fixed-size or max-size swap, so eventually it runs out of memory anyway. It may make assumptions about how you're going to use the memory, guessing that there will always be room to fit your actual working set at any given moment into memory, only to segfault when it discovers it can't. But the one thing it isn't going to do is just work efficiently.
You say that most of your matrices are sparse. In that case, use one of the sparse matrix representations. If you know which of the 5000 will be dense, you can mix and match dense and sparse matrices, but if not, just use the same sparse matrix type for everything. If this means your occasional dense matrices take 210MB instead of 200MB, but all the rest of your matrices take 1MB instead of 200MB, that's more than worthwhile as a tradeoff.
Also, do you actually need to work on all 5000 matrices at once? If you only need, say, the current matrix and the previous one at each step, you can generate them on the fly (or read from disk on the fly), and you only need 400MB instead of 1TB.
Worst-case scenario, you can effectively swap things manually, with some kind of caching discipline, like least-recently-used. You can easily keep, say, the last 16 matrices in memory. Keep a dirty flag on each so you know whether you have to save it when flushing it to make room for another matrix. That's about as tricky as it's going to get.
I have a numpy script that is currently running quite slowly.
spends the vast majority of it's time performing the following operation inside a loop:
terms=zip(Coeff_3,Coeff_2,Curl_x,Curl_y,Curl_z,Ex,Ey,Ez_av)
res=[np.dot(C2,array([C_x,C_y,C_z]))+np.dot(C3,array([ex,ey,ez])) for (C3,C2,C_x,C_y,C_z,ex,ey,ez) in terms]
res=array(res)
Ex[1:Nx-1]=res[1:Nx-1,0]
Ey[1:Nx-1]=res[1:Nx-1,1]
It's the list comprehension that is really slowing this code down.
In this case, Coeff_3, and Coeff_2 are length 1000 lists whose elements are 3x3 numpy matricies, and Ex,Ey,Ez, Curl_x, etc are all length 1000 numpy arrays.
I realize it might be faster if i did things like setting a single 3x1000 E vector, but i have to perform a significant amount of averaging of different E vectors between step, which would make things very unwieldy.
Curiously however, i perform this operation twice per loop (once for Ex,Ey, once for Ez), and performing the same operation for the Ez's takes almost twice as long:
terms2=zip(Coeff_3,Coeff_2,Curl_x,Curl_y,Curl_z,Ex_av,Ey_av,Ez)
res2=array([np.dot(C2,array([C_x,C_y,C_z]))+np.dot(C3,array([ex,ey,ez])) for (C3,C2,C_x,C_y,C_z,ex,ey,ez) in terms2])
Anyone have any idea what's happening? Forgive me if it's anything obvious, i'm very new to python.
As pointed out in previous comments, use array operations. np.hstack(), np.vstack(), np.outer() and np.inner() are useful here. You're code could become something like this (not sure about your dimensions):
Cxyz = np.vstack((Curl_x,Curl_y,Curl_z))
C2xyz = np.dot(C2, Cxyz)
...
Check the shape of your resulting dimensions, to make sure you translated your problem right. Sometimes numexpr can also to speed up such tasks significantly with little extra effort,