How to running a simple for loop in parallel on GPU

How to running a simple for loop in parallel on GPU - python

I am looking to do something very basic. I have a piece of code that I did not write which performs some processing that takes approximately 10 minutes to run on a single data set. I have 50,000 data sets, so I would like to utilize many GPUs to run this in parallel. I am familiar with how to do this on CPUs, however I do not know how to do this on GPUs. I see many examples of how to increase the speed the of certain function calls with GPUs via numba, although I cannot find how I would run a for loop on a gpu. Is this possible? In essence I have 50,000 image names, and I want a loop which reads through all images and performs the processing, then writes the extracted information to a .csv

I'm participate in Supercomputer Challenge.
From my experience, it's a complicated job to boost CPU code with GPU.
But there are some projects/libraries about python may help you.
CuPy: Easy to convert numpy code to CUDA code
Numba: JIT compiler which you mention above
PyCUDA: run C CUDA coda in Python
RAPDIS: cuXX which developed by Nvidia
easy -> hard : CuPy/RAPDIS > Numba > PyCUDA
In summary, you should study CuPy if you are using numpy, or try to find similar graphics processing methods in RAPDIS Library (ex:cuGraph). PyCUDA is the most difficult option for this case.
Just some suggestions, Speed up!

Related

Speed up opencv from python?

Have people had success speeding up video (post) processing in python/opencv?
I'm using 4.5.2 from brew on a (non-M1) MacBook Pro.
The two "tall poles" I expect to impact performance are:
I'm using ndarray types for Mat (the default), so I don't believe
algorithms can take advantage of OpenCL acceleration.
VideoCapture looks to be read()'ing in real-time, so I'm limited to 30 fps max
processing on a 30 fps source. A few sources suggest with FFMPEG,
i.e. build from source, would allow reading as fast as the processor
can run.
What I've seen from online searching (since I'm post-processing) suggest using parallel processing by splitting the file into chunks and reading in a separate thread. I can look at that. But I'd also like to understand if people had success fixing the two issues I raised. Ideally, faster reading and processing would allow me to avoid having to chunk and then combine video snippets...
My initial attempts to convert Mat -> UMat have had no noticeable improvement, even though I can see that OpenCV source has OpenCL implementations for the methods I am using (undistort, cvtColor, calcOpticalFlowFarneback, etc). FWIW, I'm getting < 5 fps for a 30 fps source.
To summarize:
Have people seen improvements from using UMat instead of Mat for OpenCL support?
Does a custom build with FFMPEG allow faster than real-time reading with VideoCapture?
Suggestion I haven't thought of?
The intent is to use python because of the speed of iteration, so I'm hoping for some easy tricks to speed of the computational step of each iteration.

Is there an easy way of parallel processing with GPU with a defined python function?

First of all, I've read multiple forums, papers and articles in the subject.
I had not had the need to implement the use of a GPU in my processes, however they have become more robust. The problem is that I have a somewhat complex function created in python, vectorized in numpy and using #jit to make it run faster.
However, my GPU (AMD) does not show use in the task panel (0%). I have seen PyOpenCL, however I want to know if there is something simpler than translating the code. The function is fast, the problem is that I want to use parallel processing to iterate that function 18 million times which currently takes me 5 hours into multiple proceso, I know that I can use multiprocessing on the CPU, but I want to use my GPU, there is some 'easy' way to split the task on the GPU?

We had some discussion, whether Numba can compile code for GPU automatically. I think it was able to, but now this is deprecated way. The other approach is to use #numba.cuda.jit and write code in terms of CUDA blocks, threads and so on. It works well. With it you enter big and fascinating (I am not joking) world of CUDA programming. You can maybe parallelize running of your big function with different parameters. Maybe you will even not need to rewrite it itself for this...

How to execute python script (face detection on very large dataset) on Nvidia GPU

I have a python script that loops through a dataset of videos and applies a face and lip detector function to each video. The function returns a 3D numpy array of pixel data centered on the human lips in each frame.
The dataset is quite large (70GB total, ~500,000 videos each about 1 second in duration) and executing on a normal CPU would take days. I have a Nvidia 2080 Ti that I would like to use to execute code. Is it possible to include some code that executes my entire script on the available GPU? Or am I oversimplifying a complex problem?
So far I have been trying to implement using numba and pycuda and havent made any progress as the examples provided don't really fit my problem well.

Your first problem is actually getting your Python code to run on all CPU cores!
Python is not fast, and this is pretty much by design. More accurately, the design of Python emphasizes other qualities. Multi-threading is fairly hard in general, and Python can't make it easy due to those design constraints. A pity, because modern CPU's are highly parallel. In your case, there's a lucky escape. Your problem is also highly parallel. You can just divide those 500,000 video's over CPU cores. Each core runs a copy of the Python script over its own input. Even a quad-core would process 4x125.000 files using that strategy.
As for the GPU, that's not going to help much with Python code. Python simply doesn't know how to send data to the GPU, send commands to the CPU, or get results back. Some Pythons extensions can use the GPU, such as Tensorflow. But they use the GPU for their own internal purposes, not to run Python code.

Python/ Pycharm memory and CPU allocation for faster runtime?

I am trying to run a very capacity intensive python program which process text with NLP methods for conducting different classifications tasks.
The runtime of the programm takes several days, therefore, I am trying to allocate more capacity to the programm. However, I don't really understand if I did the right thing, because with my new allocation the python code is not significantly faster.
Here are some information about my notebook:
I have a notebook running windows 10 with a intel core i7 with 4 core (8 logical processors) # 2.5 GHZ and 32 gb physical memory.
What I did:
I changed some parameters in the vmoptions file, so that it looks like this now:
-Xms30g
-Xmx30g
-Xmn30g
-Xss128k
-XX:MaxPermSize=30g
-XX:ParallelGCThreads=20
-XX:ReservedCodeCacheSize=500m
-XX:+UseConcMarkSweepGC
-XX:SoftRefLRUPolicyMSPerMB=50
-ea
-Dsun.io.useCanonCaches=false
-Djava.net.preferIPv4Stack=true
-XX:+HeapDumpOnOutOfMemoryError
-XX:-OmitStackTraceInFastThrow
My problem:
However, as I said my code is not running significantly faster. On top of that, if I am calling the taskmanager I can see that pycharm uses neraly 80% of the memory but 0% CPU, and python uses 20% of the CPU and 0% memory.
My question:
What do I need to do that the runtime of my python code gets faster?
Is it possible that i need to allocate more CPU to pycharm or python?
What is the connection beteen the allocation of memory to pycharm and the runtime of the python interpreter?
Thank you very much =)

You can not increase CPU usage manually. Try one of these solutions:
Try to rewrite your algorithm to be multi-threaded. then you can use
more of your CPU. Note that, not all programs can profit from
multiple cores. In these cases, calculation done in steps, where the
next step depends on the results of the previous step, will not be
faster using more cores. Problems that can be vectorized (applying
the same calculation to large arrays of data) can relatively easy be
made to use multiple cores because the individual calculations are
independent.
Use numpy. It is an extension written in C that can use optimized
linear algebra libraries like ATLAS. It can speed up numerical
calculations significantly compared to standard python.

You can adjust the number of CPU cores to be used by the IDE when running the active tasks (for example, indexing header files, updating symbols, and so on) in order to keep the performance properly balanced between AppCode and other applications running on your machine.
ues this link

Running Python on GPU using numba

I am trying to run python code in my NVIDIA GPU and googling seemed to tell me that numbapro was the module that I am looking for. However, according to this, numbapro is no longer continued but has been moved to the numba library. I tried out numba and it's #jit decorator does seem to speed up some of my code very much. However, as I read up on it more, it seems to me that jit simply compiles your code during run-time and in doing so, it does some heavy optimization and hence the speed-up.
This is further re-enforced by the fact that jit does not seem to speed up the already optimized numpy operations such as numpy.dot etc.
Am I getting confused and way off the track here? What exactly does jit do? And if it does not make my code run on the GPU, how else do I do it?

You have to specifically tell Numba to target the GPU, either via a ufunc:
http://numba.pydata.org/numba-doc/latest/cuda/ufunc.html
or by programming your functions in a way that explicitly takes the GPU into account:
http://numba.pydata.org/numba-doc/latest/cuda/examples.html
http://numba.pydata.org/numba-doc/latest/cuda/index.html
The plain jit function does not target the GPU and will typically not speed-up calls to things like np.dot. Typically Numba excels where you can either avoid creating intermediate temporary numpy arrays or if the code you are writing is hard to write in a vectorized fashion to begin with.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.