Cuda python warnings: Host incur copy overhead to/from device [duplicate]

Cuda python warnings: Host incur copy overhead to/from device [duplicate] - python

I am using NUMBA and cupy to perform GPU coding. Now I have switched my code from a V100 NVIDIA card to A100, but then, I got the following warnings:
NumbaPerformanceWarning: Grid size (27) < 2 * SM count (216) will likely result in GPU under utilization due to low occupancy.
NumbaPerformanceWarning:Host array used in CUDA kernel will incur copy overhead to/from device.
Does anyone know what the two warnings really suggests? How should I improve my code then?

NumbaPerformanceWarning: Grid size (27) < 2 * SM count (216) will likely result in GPU under utilization due to low occupancy.
A GPU is subdivided into SMs. Each SM can hold a complement of threadblocks (which is like saying it can hold a complement of threads). In order to "fully utilize" the GPU, you would want each SM to be "full", which roughly means each SM has enough threadblocks to fill its complement of threads. An A100 GPU has 108 SMs. If your kernel has less than 108 threadblocks in the kernel launch (i.e. the grid), then your kernel will not be able to fully utilize the GPU. Some SMs will be empty. A threadblock cannot be resident on 2 or more SMs at the same time. Even 108 (one per SM) may not be enough. A A100 SM can hold 2048 threads, which is at least two threadblocks of 1024 threads each. Anything less than 2*108 threadblocks in your kernel launch may not fully utilize the GPU. When you don't fully utilize the GPU, your performance may not be as good as possible.
The solution is to expose enough parallelism (enough threads) in your kernel launch to fully "occupy" or "utilize" the GPU. 216 threadblocks of 1024 threads each is sufficient for an A100. Anything less may not be.
For additional understanding here, I recommend the first 4 sections of this course.
NumbaPerformanceWarning:Host array used in CUDA kernel will incur copy overhead to/from device.
One of the cool things about a numba kernel launch is that I can pass to it a host data array:
a = numpy.ones(32, dtype=numpy.int64)
my_kernel[blocks, threads](a)
and numba will "do the right thing". In the above example it will:
create a device array that is for storage of a in device memory, let's call this d_a
copy the data from a to d_a (Host->Device)
launch your kernel, where the kernel is actually using d_a
when the kernel is finished, copy the contents of d_a back to a (Device->Host)
That's all very convenient. But what if I were doing something like this:
a = numpy.ones(32, dtype=numpy.int64)
my_kernel1[blocks, threads](a)
my_kernel2[blocks, threads](a)
What numba will do is it will perform steps 1-4 above for the launch of my_kernel1 and then perform steps 1-4 again for the launch of my_kernel2. In most cases this is probably not what you want as a numba cuda programmer.
The solution in this case is to "take control" of data movement:
a = numpy.ones(32, dtype=numpy.int64)
d_a = numba.cuda.to_device(a)
my_kernel1[blocks, threads](d_a)
my_kernel2[blocks, threads](d_a)
a = d_a.to_host()
This eliminates unnecessary copying and will generally make your program run faster, in many cases. (For trivial examples involving a single kernel launch, there probably will be no difference.)
For additional understanding, probably any online tutorial such as this one, or just the numba cuda docs, will be useful.

Related

Why does this operation execute faster on CPU than GPU?

When I was reading the tensorflow official guide, there is one example to show Explicit Device Placement of the operations. In the example, why does CPU executed time is less than GPU? More usually, what kind of operation will be executed faster on GPU?
import time
def time_matmul(x):
start = time.time()
for loop in range(10):
tf.matmul(x, x)
result = time.time()-start
print("10 loops: {:0.2f}ms".format(1000*result))
# Force execution on CPU
print("On CPU:")
with tf.device("CPU:0"):
x = tf.random.uniform([1000, 1000])
assert x.device.endswith("CPU:0")
time_matmul(x)
# Force execution on GPU #0 if available
if tf.test.is_gpu_available():
print("On GPU:")
with tf.device("GPU:0"): # Or GPU:1 for the 2nd GPU, GPU:2 for the 3rd etc.
x = tf.random.uniform([1000, 1000])
assert x.device.endswith("GPU:0")
time_matmul(x)
### Output
# On CPU:
# 10 loops: 107.55ms
# On GPU:
# 10 loops: 336.94ms

GPU has high memory bandwidth and a high number of parallel computation units. Easily parallelizable or data-heavy operations would benefit from GPU execution. As, for example, matrix multiplication involves a large number of multiplications and additions that can be done in parallel.
CPU has low memory latency (which becomes less important when you read a lot of data at once) and a rich set of instructions. It shines when you have to do sequential calculations (fibonachi numbers might be an example), have to make random memory reads often, have complicated control flow etc.
The difference in the official blog is due to the fact, that PRNG algorithms are typically sequential and can not utilize parallelized operations quiet efficiently. But this is in general. Latest CUDA version already has PRNG kernels and does outperform CPU on such tasks.
When it comes to the example above, on my system I got 65ms on CPU and 0.3ms on GPU. Furthermore, if I set sampling size to [5000, 5000] it becomes CPU:7500ms while for GPU it stays the same GPU:0.3ms. On the other hand fo [10, 10] it is CPU:0.18 (up to 0.4ms though) vs GPU:0.25ms. It shows clearly, that even a single operation performance depends on the size of the data.
Back to the answer. Placing operations on GPU might be beneficial for easily parallelizable operations that can be computed with a low number of memory calls. CPU, on the other hand, shines when it comes to a high number of low latency (i.e. small amount of data) memory calls. Additionally, not all operations can be easily performed on a GPU.

Slow GPU comparison in Cupy

I want to test using cupy whether a float is positive, e.g.:
import cupy as cp
u = cp.array(1.3)
u < 2.
>>> array(True)
My problem is that this operation is extremely slow:
%timeit u < 2. gives 26 micro seconds on my computer. It is orders of magnitude greater than what I get in CPU. I suspect it is because u has to be cast on the CPU...
I'm trying to find a faster way to do this operation.
Thanks !
Edit for clarification
My code is something like:
import cupy as cp
n = 100000
X = cp.random.randn(n) # can be greater
for _ in range(100): # There may be more iterations
result = X.dot(X)
if result < 1.2:
break
And it seems like the bottleneck of this code (for this n) is the evaluation of result < 1.2. It is still much faster than on CPU since the dot costs way less.

Running a single operation on the GPU is always a bad idea. To get performance gains out of your GPU, you need to realize a good 'compute intensity'; that is, the amount of computation performed relative to movement of memory; either from global ram to gpu mem, or from gpu mem into the cores themselves. If you dont have at least a few hunderd flops per byte of compute intensity, you can safely forget about realizing any speedup on the gpu. That said your problem may lend itself to gpu acceleration, but you certainly cannot benchmark statements like this in isolation in any meaningful way.
But even if your algorithm consists of chaining a number of such simple low-compute intensity operations on the gpu, you still will be disappointed by the speedup. Your bottleneck will be your gpu memory bandwidth; which really isnt that great compared to cpu memory bandwidth as it may look on paper. Unless you will be writing your own compute-intense kernels, or have plans for running some big ffts or such using cupy, dont think that it will give you any silver-bullet speedups by just porting your numpy code.

This may be because, when using CUDA, the array must be copied to the GPU before processing. Therefore, if your array has only one element, it can be slower in GPU than in CPU. You should try a larger array and see if this keeps happening

I think the problem here is your just leveraging one GPU device. Consider using say 100 to do all the for computations in parallel (although in the case of your simple example code it would only need doing once). https://docs-cupy.chainer.org/en/stable/tutorial/basic.html
Also there is a cupy greater function you could use to do the comparison in the GPU
Also the first time the dot gets called the kernel function will need to be compiled for the GPU which will take significantly longer than subsequent calls.

Python/ Pycharm memory and CPU allocation for faster runtime?

I am trying to run a very capacity intensive python program which process text with NLP methods for conducting different classifications tasks.
The runtime of the programm takes several days, therefore, I am trying to allocate more capacity to the programm. However, I don't really understand if I did the right thing, because with my new allocation the python code is not significantly faster.
Here are some information about my notebook:
I have a notebook running windows 10 with a intel core i7 with 4 core (8 logical processors) # 2.5 GHZ and 32 gb physical memory.
What I did:
I changed some parameters in the vmoptions file, so that it looks like this now:
-Xms30g
-Xmx30g
-Xmn30g
-Xss128k
-XX:MaxPermSize=30g
-XX:ParallelGCThreads=20
-XX:ReservedCodeCacheSize=500m
-XX:+UseConcMarkSweepGC
-XX:SoftRefLRUPolicyMSPerMB=50
-ea
-Dsun.io.useCanonCaches=false
-Djava.net.preferIPv4Stack=true
-XX:+HeapDumpOnOutOfMemoryError
-XX:-OmitStackTraceInFastThrow
My problem:
However, as I said my code is not running significantly faster. On top of that, if I am calling the taskmanager I can see that pycharm uses neraly 80% of the memory but 0% CPU, and python uses 20% of the CPU and 0% memory.
My question:
What do I need to do that the runtime of my python code gets faster?
Is it possible that i need to allocate more CPU to pycharm or python?
What is the connection beteen the allocation of memory to pycharm and the runtime of the python interpreter?
Thank you very much =)

You can not increase CPU usage manually. Try one of these solutions:
Try to rewrite your algorithm to be multi-threaded. then you can use
more of your CPU. Note that, not all programs can profit from
multiple cores. In these cases, calculation done in steps, where the
next step depends on the results of the previous step, will not be
faster using more cores. Problems that can be vectorized (applying
the same calculation to large arrays of data) can relatively easy be
made to use multiple cores because the individual calculations are
independent.
Use numpy. It is an extension written in C that can use optimized
linear algebra libraries like ATLAS. It can speed up numerical
calculations significantly compared to standard python.

You can adjust the number of CPU cores to be used by the IDE when running the active tasks (for example, indexing header files, updating symbols, and so on) in order to keep the performance properly balanced between AppCode and other applications running on your machine.
ues this link

Theano only using 40% of my GPU

I am starting to get into deep learning and I am trying out the example from Chapter 6 on neuralnetworksanddeeplearning.com. Theano is telling me, that it is using my GPU (a GTX 780). However, the GPU usage hovers only at around 40~50% and the clockspeed is only at ~800 MHz (normal Boost clock in games is ~1100 MHz).
Is this normal? Or is something wrong here?

It is normal. Actually, 40~50% should be considered high usage. Some operations like vector concatenation are performed on CPU. The GPU has to wait these operations to be completed before using the results as input. Besides, the overhead can be caused by loading data from memory.
So people commonly run 2~3 programs on the same GPU to take full advantage of it.

OPEN CL, Python and parallelisation

As a starter in Open CL, I have a simple understanding question to optimize GPU computing.
As far as I understood I can make i.e. a matrix of 1000 X 1000 and put one code at each pixel at the same time using a GPU. What about the following option :
I have 100 times a 100 x 100 matrixes and need to calculate them differently. So I need to
do the serial or can I start 100 instances, i.e. I start 100 Python multiprocesses and each
shoot a matrix calculation to the GPU (assumning thetre are enough resources).
Other way round, I have one matrix of 1000 X 1000 and 100 different instance to
calculate,
can I do this as the same time or serial processing ?
Any advice or concept how to solve this the fastest way is appreciated
Thanks
Adrian

The OpenCL execution model revolves around kernels, which are just functions that execute for each point in your problem domain. When you launch a kernel for execution on your OpenCL device, you define a 1, 2 or 3-dimensional index space for this domain (aka the NDRange or global work size). It's entirely up to you how you map the NDRange onto your actual problem.
For example, you could launch an NDRange that is 100x100x100, in order to process 100 sets of 100x100 matrices (assuming they are all independent). Your kernel then defines the computation for a single element of one of these matrices. Alternatively, you could launch 100 kernels, each with a 100x100 NDRange to achieve the same thing. The former is probably faster, since it avoids the overhead of launching multiple kernels.
I strongly recommend taking a look at the OpenCL specification for more information about the OpenCL execution model. Specifically, section 3.2 has a great description of the core concepts surrounding kernel execution.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.