Running Python on GPU using numba

Running Python on GPU using numba - python

I am trying to run python code in my NVIDIA GPU and googling seemed to tell me that numbapro was the module that I am looking for. However, according to this, numbapro is no longer continued but has been moved to the numba library. I tried out numba and it's #jit decorator does seem to speed up some of my code very much. However, as I read up on it more, it seems to me that jit simply compiles your code during run-time and in doing so, it does some heavy optimization and hence the speed-up.
This is further re-enforced by the fact that jit does not seem to speed up the already optimized numpy operations such as numpy.dot etc.
Am I getting confused and way off the track here? What exactly does jit do? And if it does not make my code run on the GPU, how else do I do it?

You have to specifically tell Numba to target the GPU, either via a ufunc:
http://numba.pydata.org/numba-doc/latest/cuda/ufunc.html
or by programming your functions in a way that explicitly takes the GPU into account:
http://numba.pydata.org/numba-doc/latest/cuda/examples.html
http://numba.pydata.org/numba-doc/latest/cuda/index.html
The plain jit function does not target the GPU and will typically not speed-up calls to things like np.dot. Typically Numba excels where you can either avoid creating intermediate temporary numpy arrays or if the code you are writing is hard to write in a vectorized fashion to begin with.

Related

Is there a function on numba to do matrix inversion on GPU?

My algorithm uses Numba do run a simulation on a GPU, and I need to do a matrix inversion, on CPU I know how to do this with numpy, but the cost of moving the data to the CPU just to do this calculation doesn't worth it.
Actually searching around the net I saw that this might be possible using other libraries (scikit-cuda, cupy, pytorch, among others). But I would like to know if there is how to do this just with Numba or if I'll have to chose another library to do this.

Most NumPy stuff doesn't mix with Numba CUDA (Numba docs on the little NumPy support in CUDA). The recent Issue 4726 echoes the same sentiment and a dev suggests CuPy, where you use CuPy arrays on the GPU and CuPy functions to do the work. They mentioned that CuPy arrays are compatible with (CPU) Numba, but as always, you should verify that for your possible use case. CuPy alone is intended for GPU stuff, so you might end up just using that.
NumPy's functions generally check their argument types for alternative implementations (dispatching), and CuPy functions mirror NumPy functions to enable this. For example, numpy.linalg.inv(myCUPYarray) will end up calling cupy.linalg.inv(myCUPYarray). This'll help you "duck-type" your code between NumPy arrays and CuPy arrays.

Multithread many FFT operations in Python / NUMBA?

To to a convolution / cross-correlation of different kernels on a 3D NumPy array, I want to calculate many smaller FFTs in parallel. As I found out the #njit(parallel = True) tag of NUMBA does not support the FFT / IFFT functions of SciPy or NumPy.
Is there any chance to calculate several 3D FFTs multi-threaded with NUMBA without having to implement the FFT algorithm myself? Or does the NUMBA parallel = True tag work without the #njit tag? I don't care too much about code compilation, the multithreading part is what I am really interested in.
I know that I could always use Python's build-in modules for multithreading / multiprocessing - but I am wondering if there is a more elegant solution using NUMBA for that purpose?
Tank you in advance for your help and all the best,
Valentin

You cannot parallelize a code (using multiple threads like Numba does) that use any pure-Python type because of the GIL (Global Interpreter Lock). Rewriting your own FFT algorithm will likely be pretty inefficient. Indeed, FFT libraries (typically used by Python libraries) are often very optimized.
The most famous and one of the fastest is the FFTW. It generate an algorithm (possibly at runtime or ahead of time) by assembling small portions of codes regarding the parameters of the algorithm. It beats almost all carefully-optimized human implementations often by a large margin. FFTW support the computation of parallel multidimensional FFTs. Hopefully, there are Python wrappers of the library you can use.
Alternatively, if no Python wrappers are correct, you can write a simple C/C++ function calling the FFTW internally which is itself called from Python. Cython can help to do that quite easily. Note that it seems Numba #njit functions can be mixed with Cython code. This can be useful if your FFT is computed in the middle of a complex Numba #njit code.

Why is numba not speeding up the following piece of code?

Why is numba not speeding up the following piece of code?
#jit(nopython=True)
def sort(x):
for i in range(1000):
np.sort(x)
I thought numba was made for these sorts of tasks, where you have for loops combined with numpy operations. Yet this jitted function is 2-3x slower than the pure Python variant (i.e. the same function but without the jit), and yes I have run it after it was compiled.
Am I doing something wrong?
EDIT:
Size of x and data-type is dtype = int32 AND float64 (I tried both), len = 5000.

The performance of the Numba implementation is not mean to be faster with relatively big array (eg. > 1024). Indeed, both Numba and Numpy use a compiled sorting algorithm as Numba does (except Numba use a JIT). Numba an only be better here for small arrays because it can mostly remove the overhead of calling a Numpy function from the CPython interpreter (and performing many input checks). The running time is dominated by the time of the sorting calls and not the overhead of the loop for an array of size=5000 (see below).
Besides this, both implementation appear to use slightly different algorithm implementations (at least not the same thresholds). As a result, the two implementations results in different performance. This is dependent of the input array. Some sorting algorithm are fast on some specific kind of distribution where some other sorting algorithm are slow and vice versa for other kind of distribution.
Here is the runtime execution of the two implementation plotted against the array size tested on random arrays on my machine (with 32-bit integers from 0 to 1,000,000,000):
One can see that Numba is faster for small arrays and faster for big ones. When len=5000, the Numba implementation is 50% slower.
Note that you can tune the algorithm used using the parameter kind. Note also that some Numpy optimized implementations use parallelism so that primitives can run faster. In that case, the comparison with the Numba implementation is not fair as Numba should use a sequential implementation (especially if parallel=True is not set). Besides this, this problem appear to be a well known issue and developers are working on it.

I wouldn't expect any performance benefit either. Numba isn't a magic wand that if you just add it you magically get better performance. It does have an overhead that can easily sneak up on you. It helps to understand what exactly numba does. It parses the ast of a python function and compiles it to native code using llvm and for a lot of non-trivial cases, this makes a huge difference because honestly, python sucks at complex math and branching. That is a reasonable drawback for its design choices. Take a look at your code though. It is a numpy sort function inside a for loop. Think logically what optimisation could numba possibly make that could speed this up. Remember that numpy is already damn fast and numba cant really affect that performance. So you have essentially added overhead to the most critical part of your code and hence the loss in performance.

GPU parallelization of code that uses numpy functions

I'm trying to parallelize a Python function I wrote to run on multiple GPU cores simultaneously, but it seems like current methods for doing so, such as vectorize and guvectorize from numba, don't allow for anything more sophisticated than simple arithmetic operations in the function (https://github.com/numba/numba/issues/2736).
My question is, is there a package or technique other than numba capable of handling functions that call numpy functions such as numpy.where or numpy.intersect1d?
Completely new to GPU programming here and not sure what the state-of-the-art capabilities are, so sorry if this question seems dumb.
Thanks a lot!

No speed gain with Cython over CPython+NumPy

For a uni assignment I have written a 2D square domain flow solver in MATLAB. To study Python I have converted the MATLAB code to Python. I have used NumPy to do all matrix-vector multiplications and I have used scipy.sparse.linalg.spsolve() to solve Ax=b, where A is 40x40 and sparse.
In the end I wasn't too happy about the speed of the solver. So I used the profiler included in Spyder to track down the bottleneck. It basically turned out that all linear algebra operations are quite fast except for the system solve (using the aforementioned method). No surprise there because solving a system is always more expensive than just multiplying some vectors and matrices.
I turned to Cython to accelerate my solver. I read http://wiki.cython.org/tutorials/numpy and I went haywire by giving each and every variable a static type (yes I know this is not the smartest or most efficient way, but I am in a hurry to see results and will do a proper job thereafter). The only thing I haven't given a static type is the sparse matrix A, because it is a CSR sparse matrix and I do not yet know how to static type it. And yes I know that it is the most crucial part, because profiling showed the system solve to be the bottleneck.
After finally managing to compile everything with Cython the result was exactly the same as without Cython... I do understand that the Cython performance gain would not be great because I did not tackle the bottleneck, however I do not understand why the Cython version did not run even just 1% faster.
Could someone please help me out with benefiting from Cython? How can I make my code run faster? And how should I give a CSR sparse matrix from scipy a static type?
My code can be downloaded using this google drive link:
https://docs.google.com/file/d/0B-nchNKLtgjeWTE4OXZrVUpfcWs/edit?usp=sharing

Because you didn't tackle the bottleneck.
It sounds to me that all you have done now is to make the methods calls to NumPy a bit faster. That will only help if you make a lot of calls to NumPy, but you say that that's not where the Bottleneck is for you.
Cython enables you to speed up Python code. It won't help you speed up NumPy code.

Because most Numpy codes are already written in C. C codes won't benefit from Cython of course.
If it runs too slow, you should suspect your algorithm instead.
Have a look here for comparisons of different way to speed up python.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.