To to a convolution / cross-correlation of different kernels on a 3D NumPy array, I want to calculate many smaller FFTs in parallel. As I found out the #njit(parallel = True) tag of NUMBA does not support the FFT / IFFT functions of SciPy or NumPy.
Is there any chance to calculate several 3D FFTs multi-threaded with NUMBA without having to implement the FFT algorithm myself? Or does the NUMBA parallel = True tag work without the #njit tag? I don't care too much about code compilation, the multithreading part is what I am really interested in.
I know that I could always use Python's build-in modules for multithreading / multiprocessing - but I am wondering if there is a more elegant solution using NUMBA for that purpose?
Tank you in advance for your help and all the best,
Valentin
You cannot parallelize a code (using multiple threads like Numba does) that use any pure-Python type because of the GIL (Global Interpreter Lock). Rewriting your own FFT algorithm will likely be pretty inefficient. Indeed, FFT libraries (typically used by Python libraries) are often very optimized.
The most famous and one of the fastest is the FFTW. It generate an algorithm (possibly at runtime or ahead of time) by assembling small portions of codes regarding the parameters of the algorithm. It beats almost all carefully-optimized human implementations often by a large margin. FFTW support the computation of parallel multidimensional FFTs. Hopefully, there are Python wrappers of the library you can use.
Alternatively, if no Python wrappers are correct, you can write a simple C/C++ function calling the FFTW internally which is itself called from Python. Cython can help to do that quite easily. Note that it seems Numba #njit functions can be mixed with Cython code. This can be useful if your FFT is computed in the middle of a complex Numba #njit code.
First of all, I've read multiple forums, papers and articles in the subject.
I had not had the need to implement the use of a GPU in my processes, however they have become more robust. The problem is that I have a somewhat complex function created in python, vectorized in numpy and using #jit to make it run faster.
However, my GPU (AMD) does not show use in the task panel (0%). I have seen PyOpenCL, however I want to know if there is something simpler than translating the code. The function is fast, the problem is that I want to use parallel processing to iterate that function 18 million times which currently takes me 5 hours into multiple proceso, I know that I can use multiprocessing on the CPU, but I want to use my GPU, there is some 'easy' way to split the task on the GPU?
We had some discussion, whether Numba can compile code for GPU automatically. I think it was able to, but now this is deprecated way. The other approach is to use #numba.cuda.jit and write code in terms of CUDA blocks, threads and so on. It works well. With it you enter big and fascinating (I am not joking) world of CUDA programming. You can maybe parallelize running of your big function with different parameters. Maybe you will even not need to rewrite it itself for this...
I am trying to run python code in my NVIDIA GPU and googling seemed to tell me that numbapro was the module that I am looking for. However, according to this, numbapro is no longer continued but has been moved to the numba library. I tried out numba and it's #jit decorator does seem to speed up some of my code very much. However, as I read up on it more, it seems to me that jit simply compiles your code during run-time and in doing so, it does some heavy optimization and hence the speed-up.
This is further re-enforced by the fact that jit does not seem to speed up the already optimized numpy operations such as numpy.dot etc.
Am I getting confused and way off the track here? What exactly does jit do? And if it does not make my code run on the GPU, how else do I do it?
You have to specifically tell Numba to target the GPU, either via a ufunc:
http://numba.pydata.org/numba-doc/latest/cuda/ufunc.html
or by programming your functions in a way that explicitly takes the GPU into account:
http://numba.pydata.org/numba-doc/latest/cuda/examples.html
http://numba.pydata.org/numba-doc/latest/cuda/index.html
The plain jit function does not target the GPU and will typically not speed-up calls to things like np.dot. Typically Numba excels where you can either avoid creating intermediate temporary numpy arrays or if the code you are writing is hard to write in a vectorized fashion to begin with.
I am about to write some computationally-intensive Python code that'll almost certainly spend most of its time inside numpy's linear algebra functions.
The problem at hand is embarrassingly parallel. Long story short, the easiest way for me to take advantage of that would be by using multiple threads. The main barrier is almost certainly going to be the Global Interpreter Lock (GIL).
To help design this, it would be useful to have a mental model for which numpy operations can be expected to release the GIL for their duration. To this end, I'd appreciate any rules of thumb, dos and don'ts, pointers etc.
In case it matters, I'm using 64-bit Python 2.7.1 on Linux, with numpy 1.5.1 and scipy 0.9.0rc2, built with Intel MKL 10.3.1.
Quite some numpy routines release GIL, so they can be efficiently parallel in threads (info). Maybe you don't need to do anything special!
You can use this question to find whether the routines you need are among the ones that release GIL. In short, search for ALLOW_THREADS or nogil in the source.
(Also note that MKL has the ability to use multiple threads for a routine, so that's another easy way to get parallelism, although possibly not the fastest kind).
You will probably find answers to all your questions regarding NumPy and parallel programming on the official wiki.
Also, have a look at this recipe page -- it contains example code on how to use NumPy with multiple threads.
Embarrassingly parallel? Numpy? Sounds like a good candidate for PyCUDA or PyOpenCL.
I want to compute magnetic fields of some conductors using the Biot–Savart law and I want to use a 1000x1000x1000 matrix. Before I use MATLAB, but now I want to use Python. Is Python slower than MATLAB ? How can I make Python faster?
EDIT:
Maybe the best way is to compute the big array with C/C++ and then transfering them to Python. I want to visualise then with VPython.
EDIT2: Which is better in my case: C or C++?
You might find some useful results at the bottom of this link
http://wiki.scipy.org/PerformancePython
From the introduction,
A comparison of weave with NumPy, Pyrex, Psyco, Fortran (77 and 90) and C++ for solving Laplace's equation.
It also compares MATLAB and seems to show similar speeds to when using Python and NumPy.
Of course this is only a specific example, your application might be allow better or worse performance. There is no harm in running the same test on both and comparing.
You can also compile NumPy with optimized libraries such as ATLAS which provides some BLAS/LAPACK routines. These should be of comparable speed to MATLAB.
I'm not sure if the NumPy downloads are already built against it, but I think ATLAS will tune libraries to your system if you compile NumPy,
http://www.scipy.org/Installing_SciPy/Windows
The link has more details on what is required under the Windows platform.
EDIT:
If you want to find out what performs better, C or C++, it might be worth asking a new question. Although from the link above C++ has best performance. Other solutions are quite close too i.e. Pyrex, Python/Fortran (using f2py) and inline C++.
The only matrix algebra under C++ I have ever done was using MTL and implementing an Extended Kalman Filter. I guess, though, in essence it depends on the libraries you are using LAPACK/BLAS and how well optimised it is.
This link has a list of object-oriented numerical packages for many languages.
http://www.oonumerics.org/oon/
NumPy and MATLAB both use an underlying BLAS implementation for standard linear algebra operations. For some time both used ATLAS, but nowadays MATLAB apparently also comes with other implementations like Intel's Math Kernel Library (MKL). Which one is faster by how much depends on the system and how the BLAS implementation was compiled. You can also compile NumPy with MKL and Enthought is working on MKL support for their Python distribution (see their roadmap). Here is also a recent interesting blog post about this.
On the other hand, if you need more specialized operations or data structures then both Python and MATLAB offer you various ways for optimization (like Cython, PyCUDA,...).
Edit: I corrected this answer to take into account different BLAS implementations. I hope it is now a fair representation of the current situation.
The only valid test is to benchmark it. It really depends on what your platform is, and how well the Biot-Savart Law maps to Matlab or NumPy/SciPy built-in operations.
As for making Python faster, Google's working on Unladen Swallow, a JIT compiler for Python. There are probably other projects like this as well.
As per your edit 2, I recommend very strongly that you use Fortran because you can leverage the available linear algebra subroutines (Lapack and Blas) and it is way simpler than C/C++ for matrix computations.
If you prefer to go with a C/C++ approach, I would use C, because you presumably need raw performance on a presumably simple interface (matrix computations tend to have simple interfaces and complex algorithms).
If, however, you decide to go with C++, you can use the TNT (the Template Numerical Toolkit, the C++ implementation of Lapack).
Good luck.
If you're just using Python (with NumPy), it may be slower, depending on which pieces you use, whether or not you have optimized linear algebra libraries installed, and how well you know how to take advantage of NumPy.
To make it faster, there are a few things you can do. There is a tool called Cython that allows you to add type declarations to Python code and translate it into a Python extension module in C. How much benefit this gets you depends a bit on how diligent you are with your type declarations - if you don't add any at all, you won't see much of any benefit. Cython also has support for NumPy types, though these are a bit more complicated than other types.
If you have a good graphics card and are willing to learn a bit about GPU computing, PyCUDA can also help. (If you don't have an nvidia graphics card, I hear there is a PyOpenCL in the works as well). I don't know your problem domain, but if it can be mapped into a CUDA problem then it should be able to handle your 10^9 elements nicely.
And here is an updated "comparison" between MATLAB and NumPy/MKL based on some linear algebra functions:
http://dpinte.wordpress.com/2010/03/16/numpymkl-vs-matlab-performance/
The dot product is not that slow ;-)
I couldn't find much hard numbers to answer this same question so I went ahead and did the testing myself. The results, scripts, and data sets used are all available here on my post on MATLAB vs Python speed for vibration analysis.
Long story short, the FFT function in MATLAB is better than Python but you can do some simple manipulation to get comparable results and speed. I also found that importing data was faster in Python compared to MATLAB (even for MAT files using the scipy.io).
I would also like to point out that Python (+NumPy) can easily interface with Fortran via the F2Py module, which basically nets you native Fortran speeds on the pieces of code you offload into it.