Slow GPU comparison in Cupy

Slow GPU comparison in Cupy - python

I want to test using cupy whether a float is positive, e.g.:
import cupy as cp
u = cp.array(1.3)
u < 2.
>>> array(True)
My problem is that this operation is extremely slow:
%timeit u < 2. gives 26 micro seconds on my computer. It is orders of magnitude greater than what I get in CPU. I suspect it is because u has to be cast on the CPU...
I'm trying to find a faster way to do this operation.
Thanks !
Edit for clarification
My code is something like:
import cupy as cp
n = 100000
X = cp.random.randn(n) # can be greater
for _ in range(100): # There may be more iterations
result = X.dot(X)
if result < 1.2:
break
And it seems like the bottleneck of this code (for this n) is the evaluation of result < 1.2. It is still much faster than on CPU since the dot costs way less.

Running a single operation on the GPU is always a bad idea. To get performance gains out of your GPU, you need to realize a good 'compute intensity'; that is, the amount of computation performed relative to movement of memory; either from global ram to gpu mem, or from gpu mem into the cores themselves. If you dont have at least a few hunderd flops per byte of compute intensity, you can safely forget about realizing any speedup on the gpu. That said your problem may lend itself to gpu acceleration, but you certainly cannot benchmark statements like this in isolation in any meaningful way.
But even if your algorithm consists of chaining a number of such simple low-compute intensity operations on the gpu, you still will be disappointed by the speedup. Your bottleneck will be your gpu memory bandwidth; which really isnt that great compared to cpu memory bandwidth as it may look on paper. Unless you will be writing your own compute-intense kernels, or have plans for running some big ffts or such using cupy, dont think that it will give you any silver-bullet speedups by just porting your numpy code.

This may be because, when using CUDA, the array must be copied to the GPU before processing. Therefore, if your array has only one element, it can be slower in GPU than in CPU. You should try a larger array and see if this keeps happening

I think the problem here is your just leveraging one GPU device. Consider using say 100 to do all the for computations in parallel (although in the case of your simple example code it would only need doing once). https://docs-cupy.chainer.org/en/stable/tutorial/basic.html
Also there is a cupy greater function you could use to do the comparison in the GPU
Also the first time the dot gets called the kernel function will need to be compiled for the GPU which will take significantly longer than subsequent calls.

Related

numpy argsort slow performance

My sincere apologies, in advance if this question seems quite basic.
Given:
>>> import numpy as np
>>> import time
>>> A = np.random.rand( int(1e5), int(5e4) ) # large numpy array
Goal:
>>> bt=time.time(); B=np.argsort(A,axis=1);et=time.time();print(f"Took {(et-bt):.2f} s")
However, it takes quite long time to calculate an array of indices:
# Took 316.90 s
Question:
Is there any other time efficient ways to do this?
Cheers,

The input array A has a shape of (100_000, 50_000) and contains np.float64 values by default. This means that you need 8 * 100_000 * 50_000 / 1024**3 = 37.2 Gio of memory just for this array. You also likely need the same amount of space for the output matrix B (which should contains items of type np.int64). This means you need a machine with at least 74.4 Gio, not to mentions the space required for the operating system (OS) and running software (so probably at least 80 Gio). If you do not have such a memory space, then your OS will use your storage device as a swap memory, which is much much slower.
Assuming you have such a memory space available, such a computation is very expensive. It is mainly due to the page faults when filling the B array, and also the fact that B is pretty huge as well as the default implementation of Numpy do the computation sequentially. You can speed the computation up using a parallel Numba code and a smaller output. Here is an example:
import numba as nb
#nb.njit('int32[:,::1](float64[:,::1])', parallel=True)
def fastSort(a):
b = np.empty(a.shape, dtype=np.int32)
for i in nb.prange(a.shape[0]):
b[i,:] = np.argsort(a[i,:])
return b
Note that the Numba's implementation of argsort is less efficient than the one of Numpy but the parallel version should be much faster if the target machine have mny cores and a good memory bandwidth.
Here are the results on my 6-core machine on a matrix of size (10_000, 50_000) (10 times smaller because I do not have 80 Gio of RAM):
Original implementation: 28.14 s
Sequential Numba: 38.70 s
Parallel Numba: 6.79 s
The resulting solution is thus 4.1 times faster.
Note that you could even use the type uint16 for the items of B in this specific case as the size of each line is less than 2**16 = 65536. This will likely not be significantly faster but it should save some additional memory. The resulting required memory will be 46.5 Gio. You can further reduce the amount of memory needed using the np.float32 type (often at the expense of a loss of accuracy).
If you want to improve the execution time further, then you need to implement a faster implementation of argsort for your specific needs using a low-level language like C or C++. But be aware that beating Numpy is far from being easy if you are not an experienced programmer in such a language or not familiar with low-level optimizations. If you are interested in such a solution, a good start is probably to implement a radix sort.

Speed up N-d array dot

I'm trying to do some matrix computations as
def get_P(X,Z):
n_sample,n_m,n_t,n_f = X.shape
res = np.zeros((n_sample,n_m,n_t,n_t))
for i in range(n_sample):
res[i,:,:,:] = np.dot(X[i,:,:,:],Z[i,:,:])
return res
Because the size of X and Z is large, it takes more than 600ms to compute one np.dot, and I have 10k rows in X.
Is there anyway we can speed it up?

Well, there might be some avoidable overhead posed by your zero initialization (which gets overwritten right away): Just use np.ndarray instead.
Other than that: numpy is fairly well-optimized. Probably you can speed things up if you used dtype=numpy.float32 instead of the default 64-bit floating point numbers for your X, Z and res – but that's about it. Dot products are mostly spending time going linear through RAM and multiplying and summing numbers – things that numpy, compilers and CPUs are radically good at these days.
Note that numpy will only use one CPU core at a time in its default configuration - it might make sense to parallelize; for example, if you've got 16 CPU cores, you'd make 16 separate res partitions and calculate subsets of your range(n_sample) dot products on each core; python does bring the multithreading / async facilities to do so – you'll find plenty of examples, and explaining how would lead too far.
If you can spend the development time, and need massive amounts of data, so that this pays: you can use e.g. GPUs to multiply matrices; these are really good at that, and cuBLASlt (GEMM) is an excellent implementation, but honestly, you'd mostly be abandoning Numpy and would need to work things out yourself – in C/C++.

You can use numpy einsum to do this multiplication in one vectorized step.
It will be much faster than this loop based dot product. For examples, check this link https://rockt.github.io/2018/04/30/einsum

CuPy random - how to generate new random set in same memory?

I am generating a large array of random numbers, totaling more than half the available memory on a GPU. I am doing this in a loop.
When I call cupy.random the second time (or third time...), assigning to the same variable name, it does not free the memory for the first array. It tries to allocate more memory, which causes an out of memory error.
Explicitly freeing the memory before generating a new random array is very slow, and seems inefficient.
Is there a way to generate a new set of numbers, but in the same memory space?
Edit: cupy.random.shuffle() is letting me work around the problem, but I wonder if there is a better way?
Edit 2: on further review, shuffle() does not address the problem, and appears to need even more memory than allocating a second block (before freeing the first) of memory... I am back to restricting ndarray size to less than half the remaining memory, so two ndarrays can be allocated alternately

As user2357112 suggests, cupy.random.random() does not appear to support “re-randomizing“ an existing ndarray, even though cuRand does. Writing C to modify an existing cupy array somewhat defeats the point of using python / cupy in the first place.
Curiously, having an array about 1/3rd the size of available memory, while increasing the number of loops, is faster in total execution time (versus larger arrays/fewer loops). I was not able to determine when cupy (or python or cuda?) does garbage collection on the disused array, but it seems to happen asynchronously.
If GPU garbage collection uses cuda cores (I presume it does?), it does not appear to materially effect my code execution time. Nvidia-smi reports “P2” GPU usage when my code calculations are running, suggesting there are still cores available for cupy / cuda to free memory outside of my code?
I don’t like answering my own question... just sharing what I found in case it helps someone else

Why does this operation execute faster on CPU than GPU?

When I was reading the tensorflow official guide, there is one example to show Explicit Device Placement of the operations. In the example, why does CPU executed time is less than GPU? More usually, what kind of operation will be executed faster on GPU?
import time
def time_matmul(x):
start = time.time()
for loop in range(10):
tf.matmul(x, x)
result = time.time()-start
print("10 loops: {:0.2f}ms".format(1000*result))
# Force execution on CPU
print("On CPU:")
with tf.device("CPU:0"):
x = tf.random.uniform([1000, 1000])
assert x.device.endswith("CPU:0")
time_matmul(x)
# Force execution on GPU #0 if available
if tf.test.is_gpu_available():
print("On GPU:")
with tf.device("GPU:0"): # Or GPU:1 for the 2nd GPU, GPU:2 for the 3rd etc.
x = tf.random.uniform([1000, 1000])
assert x.device.endswith("GPU:0")
time_matmul(x)
### Output
# On CPU:
# 10 loops: 107.55ms
# On GPU:
# 10 loops: 336.94ms

GPU has high memory bandwidth and a high number of parallel computation units. Easily parallelizable or data-heavy operations would benefit from GPU execution. As, for example, matrix multiplication involves a large number of multiplications and additions that can be done in parallel.
CPU has low memory latency (which becomes less important when you read a lot of data at once) and a rich set of instructions. It shines when you have to do sequential calculations (fibonachi numbers might be an example), have to make random memory reads often, have complicated control flow etc.
The difference in the official blog is due to the fact, that PRNG algorithms are typically sequential and can not utilize parallelized operations quiet efficiently. But this is in general. Latest CUDA version already has PRNG kernels and does outperform CPU on such tasks.
When it comes to the example above, on my system I got 65ms on CPU and 0.3ms on GPU. Furthermore, if I set sampling size to [5000, 5000] it becomes CPU:7500ms while for GPU it stays the same GPU:0.3ms. On the other hand fo [10, 10] it is CPU:0.18 (up to 0.4ms though) vs GPU:0.25ms. It shows clearly, that even a single operation performance depends on the size of the data.
Back to the answer. Placing operations on GPU might be beneficial for easily parallelizable operations that can be computed with a low number of memory calls. CPU, on the other hand, shines when it comes to a high number of low latency (i.e. small amount of data) memory calls. Additionally, not all operations can be easily performed on a GPU.

Python matrix multiplication index exchange memory usage example

in c there is an example of importance of memory utilization: naive matrix multiplication using 3 for loops (i,j,k). And one can show that using (i,k,j) is much faster than (i,j,k) due to memory coalescence. In python numpy the order if indexes (again naive 3 loops, not np.dot) does not matter. Why is that?

First of all you need to know why the loop (i,k,j) is faster than (i,j,k) on C. That's happens because the memory usage optimization, in your computer memory the matrix is allocated in a linear way, so if you iterate using (i,k,j) you are using this in your favor where each loop takes a block of memory and load to your RAM. If you use (i,j,k) you are working against it, and each step will take a block of memory load to your RAM and discard on next step because you are iterating jumping blocks.
The implementation of numpy handle it for you, so even if you use the worst order numpy will order it to work faster.
The event of throwing away the cache, and keep changing it all the time is called Cache miss
At this link you can see a much better explanation about how the memory is allocated and why is it faster in some specific itartion way.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.