Why is numba not speeding up the following piece of code?

Why is numba not speeding up the following piece of code? - python

Why is numba not speeding up the following piece of code?
#jit(nopython=True)
def sort(x):
for i in range(1000):
np.sort(x)
I thought numba was made for these sorts of tasks, where you have for loops combined with numpy operations. Yet this jitted function is 2-3x slower than the pure Python variant (i.e. the same function but without the jit), and yes I have run it after it was compiled.
Am I doing something wrong?
EDIT:
Size of x and data-type is dtype = int32 AND float64 (I tried both), len = 5000.

The performance of the Numba implementation is not mean to be faster with relatively big array (eg. > 1024). Indeed, both Numba and Numpy use a compiled sorting algorithm as Numba does (except Numba use a JIT). Numba an only be better here for small arrays because it can mostly remove the overhead of calling a Numpy function from the CPython interpreter (and performing many input checks). The running time is dominated by the time of the sorting calls and not the overhead of the loop for an array of size=5000 (see below).
Besides this, both implementation appear to use slightly different algorithm implementations (at least not the same thresholds). As a result, the two implementations results in different performance. This is dependent of the input array. Some sorting algorithm are fast on some specific kind of distribution where some other sorting algorithm are slow and vice versa for other kind of distribution.
Here is the runtime execution of the two implementation plotted against the array size tested on random arrays on my machine (with 32-bit integers from 0 to 1,000,000,000):
One can see that Numba is faster for small arrays and faster for big ones. When len=5000, the Numba implementation is 50% slower.
Note that you can tune the algorithm used using the parameter kind. Note also that some Numpy optimized implementations use parallelism so that primitives can run faster. In that case, the comparison with the Numba implementation is not fair as Numba should use a sequential implementation (especially if parallel=True is not set). Besides this, this problem appear to be a well known issue and developers are working on it.

I wouldn't expect any performance benefit either. Numba isn't a magic wand that if you just add it you magically get better performance. It does have an overhead that can easily sneak up on you. It helps to understand what exactly numba does. It parses the ast of a python function and compiles it to native code using llvm and for a lot of non-trivial cases, this makes a huge difference because honestly, python sucks at complex math and branching. That is a reasonable drawback for its design choices. Take a look at your code though. It is a numpy sort function inside a for loop. Think logically what optimisation could numba possibly make that could speed this up. Remember that numpy is already damn fast and numba cant really affect that performance. So you have essentially added overhead to the most critical part of your code and hence the loss in performance.

Related

resources on numpy operation complexity?

I'm trying to optimize the performance of some numpy code. The problem is, I'm just blindly replacing code with equivalent code, and profiling to compare the results.
Does numpy provide any resource describing the complexity/performance of its basic operations (either a theoretical analysis, or some practical rules of the thumb), for example
basic operations (eg: sum, multiplication, exponentiation)
different data types (eg: float vs complex)
vectorization/broadcasting as function of the size of the matrix (eg: vectorizing beyond 2 dimensions seems to make performance worse than equivalent for-loop)

Multithread many FFT operations in Python / NUMBA?

To to a convolution / cross-correlation of different kernels on a 3D NumPy array, I want to calculate many smaller FFTs in parallel. As I found out the #njit(parallel = True) tag of NUMBA does not support the FFT / IFFT functions of SciPy or NumPy.
Is there any chance to calculate several 3D FFTs multi-threaded with NUMBA without having to implement the FFT algorithm myself? Or does the NUMBA parallel = True tag work without the #njit tag? I don't care too much about code compilation, the multithreading part is what I am really interested in.
I know that I could always use Python's build-in modules for multithreading / multiprocessing - but I am wondering if there is a more elegant solution using NUMBA for that purpose?
Tank you in advance for your help and all the best,
Valentin

You cannot parallelize a code (using multiple threads like Numba does) that use any pure-Python type because of the GIL (Global Interpreter Lock). Rewriting your own FFT algorithm will likely be pretty inefficient. Indeed, FFT libraries (typically used by Python libraries) are often very optimized.
The most famous and one of the fastest is the FFTW. It generate an algorithm (possibly at runtime or ahead of time) by assembling small portions of codes regarding the parameters of the algorithm. It beats almost all carefully-optimized human implementations often by a large margin. FFTW support the computation of parallel multidimensional FFTs. Hopefully, there are Python wrappers of the library you can use.
Alternatively, if no Python wrappers are correct, you can write a simple C/C++ function calling the FFTW internally which is itself called from Python. Cython can help to do that quite easily. Note that it seems Numba #njit functions can be mixed with Cython code. This can be useful if your FFT is computed in the middle of a complex Numba #njit code.

Running Python on GPU using numba

I am trying to run python code in my NVIDIA GPU and googling seemed to tell me that numbapro was the module that I am looking for. However, according to this, numbapro is no longer continued but has been moved to the numba library. I tried out numba and it's #jit decorator does seem to speed up some of my code very much. However, as I read up on it more, it seems to me that jit simply compiles your code during run-time and in doing so, it does some heavy optimization and hence the speed-up.
This is further re-enforced by the fact that jit does not seem to speed up the already optimized numpy operations such as numpy.dot etc.
Am I getting confused and way off the track here? What exactly does jit do? And if it does not make my code run on the GPU, how else do I do it?

You have to specifically tell Numba to target the GPU, either via a ufunc:
http://numba.pydata.org/numba-doc/latest/cuda/ufunc.html
or by programming your functions in a way that explicitly takes the GPU into account:
http://numba.pydata.org/numba-doc/latest/cuda/examples.html
http://numba.pydata.org/numba-doc/latest/cuda/index.html
The plain jit function does not target the GPU and will typically not speed-up calls to things like np.dot. Typically Numba excels where you can either avoid creating intermediate temporary numpy arrays or if the code you are writing is hard to write in a vectorized fashion to begin with.

Numpy fft.pack vs FFTW vs Implement DFT on your own

I am currently need to run FFT on 1024 sample points signal. So far I have implementing my own DFT algorithm in python, but it is very slow. If I use the NUMPY fftpack, or even move to C++ and use FFTW, do you guys think it would be better?

If you are implementing the DFFT entirely within Python, your code will run orders of magnitude slower than either package you mentioned. Not just because those libraries are written in much lower-level languages, but also (FFTW in particular) they are written so heavily optimized, taking advantage of cache locality, vector units, and basically every trick in the book, that it would not surprise me if they ran at 10,000x the speed of a naive Python implementation. Even if you are using numpy in your implementation, it will still pale in comparison.
So yes; use numpy's fftpack. If that is not fast enough, you can try the python bindings for FFTW (PyFFTW), but the speedup from fftpack to fftw will not be nearly as dramatic. I really doubt there's a need to drop into C++ just for FFTs - they're sort of the ideal case for Python bindings.

If you need speed, then you want to go for FFTW, check out the pyfftw project.
In order to use processor SIMD instructions, you need to align the data and there is not an easy way of doing so in numpy. Moreover, pyfftw allows you to use true multithreading, so trust me, it will be much faster.

In case you wish to stick to Python (handling and maintaining custom C++ bindings can be time consuming), you have the alternative of using OpenCV's implementation of FFT.
I put together a toy example comparing OpenCV's dft() and numpy's fft2 functions in python (Intel(R) Core(TM) i7-3930K CPU).
samplesFreq_cv2 = [
cv2.dft(samples[iS])
for iS in xrange(nbSamples)]
samplesFreq_np = [
np.fft.fft2(samples[iS])
for iS in xrange(nbSamples)]
Results for sequentially transforming 20000 image patches of varying resolutions from 20x20 to 60x60:
Numpy's fft2: 1.709100 seconds
OpenCV's dft: 0.621239 seconds
This is likely not as fast as binding to a dedicates C++ library like fftw, but it's a rather low-hanging fruit.

Why numpy is 'slow' by itself?

Given the thread here
It seems that numpy is not the most ideal for ultra fast calculation. Does anyone know what overhead we must be aware of when using numpy for numerical calculation?

Well, depends on what you want to do. XOR is, for instance, hardly relevant for someone interested in doing numerical linear algebra (for which numpy is pretty fast, by virtue of using optimized BLAS/LAPACK libraries underneath).
Generally, the big idea behind getting good performance from numpy is to amortize the cost of the interpreter over many elements at a time. In other words, move the loops from python code (slow) into C/Fortran loops somewhere in the numpy/BLAS/LAPACK/etc. internals (fast). If you succeed in that operation (called vectorization) performance will usually be quite good.
Of course, you can obviously get even better performance by dumping the python interpreter and using, say, C++ instead. Whether this approach actually succeeds or not depends on how good you are at high performance programming with C++ vs. numpy, and what operation exactly you're trying to do.

Any time you have an expression like x = a * b + c / d + e, you end up with one temporary array for a * b, one temporary array for c / d, one for one of the sums and finally one allocation for the result. This is a limitation of Python types and operator overloading. You can however do things in-place explicitly using the augmented assignment (*=, +=, etc.) operators and be assured that copies aren't made.
As for the specific reason NumPy performs more slowly in that benchmark, it's hard to tell but it probably has to do with the constant overhead of checking sizes, type-marshaling, etc. that Cython/etc. don't have to worry about. On larger problems you'd probably see it get closer.

I can't really tell, but I'd guess there are two factors:
Perhaps numpy is copying more stuff? weave is often faster when you avoid allocating big temporary arrays, but this shouldn't matter here.
numpy has a bit of overhead used in iterating over (possibly) multidimensional arrays. This overhead would normally be dwarfed by number crunching, but an xor is really really fast, so all that really matters is the overhead.

Your sub-question: a = sin(x), how many roundtrips are there.
The trick is to pass a numpy array to sin(x), then there is only one 'roundtrip' for the whole array, since numpy will return an array of sin-values. There is no python for loop involved in this operation.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.