My algorithm uses Numba do run a simulation on a GPU, and I need to do a matrix inversion, on CPU I know how to do this with numpy, but the cost of moving the data to the CPU just to do this calculation doesn't worth it.
Actually searching around the net I saw that this might be possible using other libraries (scikit-cuda, cupy, pytorch, among others). But I would like to know if there is how to do this just with Numba or if I'll have to chose another library to do this.
Most NumPy stuff doesn't mix with Numba CUDA (Numba docs on the little NumPy support in CUDA). The recent Issue 4726 echoes the same sentiment and a dev suggests CuPy, where you use CuPy arrays on the GPU and CuPy functions to do the work. They mentioned that CuPy arrays are compatible with (CPU) Numba, but as always, you should verify that for your possible use case. CuPy alone is intended for GPU stuff, so you might end up just using that.
NumPy's functions generally check their argument types for alternative implementations (dispatching), and CuPy functions mirror NumPy functions to enable this. For example, numpy.linalg.inv(myCUPYarray) will end up calling cupy.linalg.inv(myCUPYarray). This'll help you "duck-type" your code between NumPy arrays and CuPy arrays.
Related
To to a convolution / cross-correlation of different kernels on a 3D NumPy array, I want to calculate many smaller FFTs in parallel. As I found out the #njit(parallel = True) tag of NUMBA does not support the FFT / IFFT functions of SciPy or NumPy.
Is there any chance to calculate several 3D FFTs multi-threaded with NUMBA without having to implement the FFT algorithm myself? Or does the NUMBA parallel = True tag work without the #njit tag? I don't care too much about code compilation, the multithreading part is what I am really interested in.
I know that I could always use Python's build-in modules for multithreading / multiprocessing - but I am wondering if there is a more elegant solution using NUMBA for that purpose?
Tank you in advance for your help and all the best,
Valentin
You cannot parallelize a code (using multiple threads like Numba does) that use any pure-Python type because of the GIL (Global Interpreter Lock). Rewriting your own FFT algorithm will likely be pretty inefficient. Indeed, FFT libraries (typically used by Python libraries) are often very optimized.
The most famous and one of the fastest is the FFTW. It generate an algorithm (possibly at runtime or ahead of time) by assembling small portions of codes regarding the parameters of the algorithm. It beats almost all carefully-optimized human implementations often by a large margin. FFTW support the computation of parallel multidimensional FFTs. Hopefully, there are Python wrappers of the library you can use.
Alternatively, if no Python wrappers are correct, you can write a simple C/C++ function calling the FFTW internally which is itself called from Python. Cython can help to do that quite easily. Note that it seems Numba #njit functions can be mixed with Cython code. This can be useful if your FFT is computed in the middle of a complex Numba #njit code.
I am calculating Fourier transforms with tensorflow using tf.signal.fft. I have successfully installed tensorflow-gpu and have the right drivers and versions for my code to actually use my CUDA enabled GPU. Indeed I can check that the GPU is being used (even though always at about 1-2%, but its memory is usually at 80%).
I am solving a partial differential equation with the Fourier split-step method where each time increment looks like psi(t+dt) = InverseFourier [ potential(t) * Fourier( psi(t) ) ].
While the InverseFourier and Fourier are tensorflow methods, the potential is just a numpy array that also needs calculating at each step. My doubt now is: does this numpy calculation actually run on the CPU? So, before the GPU one can be carrier out, the array must be moved from RAM to the GPU memory. Maybe this causes an overhead and hence a time delay?
Am I completely wrong? Is there a way to check for overhead times? Should I just do everything with tensorflow functions?
I am trying to run python code in my NVIDIA GPU and googling seemed to tell me that numbapro was the module that I am looking for. However, according to this, numbapro is no longer continued but has been moved to the numba library. I tried out numba and it's #jit decorator does seem to speed up some of my code very much. However, as I read up on it more, it seems to me that jit simply compiles your code during run-time and in doing so, it does some heavy optimization and hence the speed-up.
This is further re-enforced by the fact that jit does not seem to speed up the already optimized numpy operations such as numpy.dot etc.
Am I getting confused and way off the track here? What exactly does jit do? And if it does not make my code run on the GPU, how else do I do it?
You have to specifically tell Numba to target the GPU, either via a ufunc:
http://numba.pydata.org/numba-doc/latest/cuda/ufunc.html
or by programming your functions in a way that explicitly takes the GPU into account:
http://numba.pydata.org/numba-doc/latest/cuda/examples.html
http://numba.pydata.org/numba-doc/latest/cuda/index.html
The plain jit function does not target the GPU and will typically not speed-up calls to things like np.dot. Typically Numba excels where you can either avoid creating intermediate temporary numpy arrays or if the code you are writing is hard to write in a vectorized fashion to begin with.
I am doing some numeric simulations of quantum computation, and I wish to find the eigenvectors of a big hermitian matrix (~2^14 rows/columns)
I am running on a 24 core/48 threads XEON machine. The code was originally written with the help of the Qutip library. I found out that the included eigenstates() function only utilizes a single thread on my machine so I am trying to find a faster way to do that.
I tried using scipy.linalg eig() and eigh() functions as well as scipy.sparse.linalg eig() and eigh() but both seem slower than the function built in Qutip.
I've seen some suggestion that I might get some speedup from using slepc4py, however the documentation of the package seems very lacking. I cant find out how to convert the numpy complex array to a SLEPC matrix.
A = PETSc.Mat().create()
A[:,:] = B[:,:]
# where B is a scipy array of complex type
TypeError: Cannot cast array data from dtype('complex128') to dtype('float64') according to the rule 'safe'
The eigensolver in QuTiP uses the SciPy eigensolver. How many threads are used depends on the BLAS library that SciPy is linked to, as well as whether you are using the sparse or dense solver. In the dense case, the eigensolver will use multiple cores if the underlying BLAS takes advantage (e.g. Intel MKL). The sparse solver uses mostly sparse matvec operations which are memory bandwidth limited, and thus are most efficient using a single core. If you want all eigenvalues then you are basically stuck using dense solvers. However, if you need only a few., Such as the lowest few eigenstates, then sparse is the way to go.
I ended up finding a simpler way to use all the cores , it seems like qutip didn't tell mkl to use all of the cores.
in my python code,I added :
import ctypes
mkl_rt = ctypes.CDLL('libmkl_rt.so')
mkl_get_max_threads = mkl_rt.mkl_get_max_threads
mkl_rt.mkl_set_num_threads(ctypes.byref(ctypes.c_int(48)))
this forced Intel mkl to use all the cores, and gave me a nice speedup.
(answer from question)
I am currently need to run FFT on 1024 sample points signal. So far I have implementing my own DFT algorithm in python, but it is very slow. If I use the NUMPY fftpack, or even move to C++ and use FFTW, do you guys think it would be better?
If you are implementing the DFFT entirely within Python, your code will run orders of magnitude slower than either package you mentioned. Not just because those libraries are written in much lower-level languages, but also (FFTW in particular) they are written so heavily optimized, taking advantage of cache locality, vector units, and basically every trick in the book, that it would not surprise me if they ran at 10,000x the speed of a naive Python implementation. Even if you are using numpy in your implementation, it will still pale in comparison.
So yes; use numpy's fftpack. If that is not fast enough, you can try the python bindings for FFTW (PyFFTW), but the speedup from fftpack to fftw will not be nearly as dramatic. I really doubt there's a need to drop into C++ just for FFTs - they're sort of the ideal case for Python bindings.
If you need speed, then you want to go for FFTW, check out the pyfftw project.
In order to use processor SIMD instructions, you need to align the data and there is not an easy way of doing so in numpy. Moreover, pyfftw allows you to use true multithreading, so trust me, it will be much faster.
In case you wish to stick to Python (handling and maintaining custom C++ bindings can be time consuming), you have the alternative of using OpenCV's implementation of FFT.
I put together a toy example comparing OpenCV's dft() and numpy's fft2 functions in python (Intel(R) Core(TM) i7-3930K CPU).
samplesFreq_cv2 = [
cv2.dft(samples[iS])
for iS in xrange(nbSamples)]
samplesFreq_np = [
np.fft.fft2(samples[iS])
for iS in xrange(nbSamples)]
Results for sequentially transforming 20000 image patches of varying resolutions from 20x20 to 60x60:
Numpy's fft2: 1.709100 seconds
OpenCV's dft: 0.621239 seconds
This is likely not as fast as binding to a dedicates C++ library like fftw, but it's a rather low-hanging fruit.