I am doing some numeric simulations of quantum computation, and I wish to find the eigenvectors of a big hermitian matrix (~2^14 rows/columns)
I am running on a 24 core/48 threads XEON machine. The code was originally written with the help of the Qutip library. I found out that the included eigenstates() function only utilizes a single thread on my machine so I am trying to find a faster way to do that.
I tried using scipy.linalg eig() and eigh() functions as well as scipy.sparse.linalg eig() and eigh() but both seem slower than the function built in Qutip.
I've seen some suggestion that I might get some speedup from using slepc4py, however the documentation of the package seems very lacking. I cant find out how to convert the numpy complex array to a SLEPC matrix.
A = PETSc.Mat().create()
A[:,:] = B[:,:]
# where B is a scipy array of complex type
TypeError: Cannot cast array data from dtype('complex128') to dtype('float64') according to the rule 'safe'
The eigensolver in QuTiP uses the SciPy eigensolver. How many threads are used depends on the BLAS library that SciPy is linked to, as well as whether you are using the sparse or dense solver. In the dense case, the eigensolver will use multiple cores if the underlying BLAS takes advantage (e.g. Intel MKL). The sparse solver uses mostly sparse matvec operations which are memory bandwidth limited, and thus are most efficient using a single core. If you want all eigenvalues then you are basically stuck using dense solvers. However, if you need only a few., Such as the lowest few eigenstates, then sparse is the way to go.
I ended up finding a simpler way to use all the cores , it seems like qutip didn't tell mkl to use all of the cores.
in my python code,I added :
import ctypes
mkl_rt = ctypes.CDLL('libmkl_rt.so')
mkl_get_max_threads = mkl_rt.mkl_get_max_threads
mkl_rt.mkl_set_num_threads(ctypes.byref(ctypes.c_int(48)))
this forced Intel mkl to use all the cores, and gave me a nice speedup.
(answer from question)
Related
My algorithm uses Numba do run a simulation on a GPU, and I need to do a matrix inversion, on CPU I know how to do this with numpy, but the cost of moving the data to the CPU just to do this calculation doesn't worth it.
Actually searching around the net I saw that this might be possible using other libraries (scikit-cuda, cupy, pytorch, among others). But I would like to know if there is how to do this just with Numba or if I'll have to chose another library to do this.
Most NumPy stuff doesn't mix with Numba CUDA (Numba docs on the little NumPy support in CUDA). The recent Issue 4726 echoes the same sentiment and a dev suggests CuPy, where you use CuPy arrays on the GPU and CuPy functions to do the work. They mentioned that CuPy arrays are compatible with (CPU) Numba, but as always, you should verify that for your possible use case. CuPy alone is intended for GPU stuff, so you might end up just using that.
NumPy's functions generally check their argument types for alternative implementations (dispatching), and CuPy functions mirror NumPy functions to enable this. For example, numpy.linalg.inv(myCUPYarray) will end up calling cupy.linalg.inv(myCUPYarray). This'll help you "duck-type" your code between NumPy arrays and CuPy arrays.
To to a convolution / cross-correlation of different kernels on a 3D NumPy array, I want to calculate many smaller FFTs in parallel. As I found out the #njit(parallel = True) tag of NUMBA does not support the FFT / IFFT functions of SciPy or NumPy.
Is there any chance to calculate several 3D FFTs multi-threaded with NUMBA without having to implement the FFT algorithm myself? Or does the NUMBA parallel = True tag work without the #njit tag? I don't care too much about code compilation, the multithreading part is what I am really interested in.
I know that I could always use Python's build-in modules for multithreading / multiprocessing - but I am wondering if there is a more elegant solution using NUMBA for that purpose?
Tank you in advance for your help and all the best,
Valentin
You cannot parallelize a code (using multiple threads like Numba does) that use any pure-Python type because of the GIL (Global Interpreter Lock). Rewriting your own FFT algorithm will likely be pretty inefficient. Indeed, FFT libraries (typically used by Python libraries) are often very optimized.
The most famous and one of the fastest is the FFTW. It generate an algorithm (possibly at runtime or ahead of time) by assembling small portions of codes regarding the parameters of the algorithm. It beats almost all carefully-optimized human implementations often by a large margin. FFTW support the computation of parallel multidimensional FFTs. Hopefully, there are Python wrappers of the library you can use.
Alternatively, if no Python wrappers are correct, you can write a simple C/C++ function calling the FFTW internally which is itself called from Python. Cython can help to do that quite easily. Note that it seems Numba #njit functions can be mixed with Cython code. This can be useful if your FFT is computed in the middle of a complex Numba #njit code.
So I'm trying to help another SO user, and in the process I can't create a Cython program to do something simple outside of NumPy which forces me to use the GIL. So that makes using OpenMP (multicore) impossible. Then I came across an interesting post whereas you can import from SciPy the Fortran libraries directly into Cython code (BLAS, LAPACK) which are installed with NumPy, in my case Intel MKL equivalent functions. All I'm trying to do is a simple vector multiplication of 2 vectors 1000x1 dimensions by another which is transposed, resulting in a 1000x1000 matrix. But I can't find the relevant Fortran routine (equivalent to NumPy multiply) that will do the trick. All the routines seem to do matrix multiplication instead. So the cool feature now in SciPy is to add this to your Cython module: import scipy.linalg.cython_blas as blas and
cimport scipy.linalg.cython_lapack as lapack then in theory I started with the Fortran library dgemm by calling blas.dgemm(options) but it does the matrix product rather than just element-wise multiplication. Does anyone know the Fortan module that will do the simple multiplication of 2 1000x1 vectors, 1 transposed, resulting in a 1000x1000 matrix? And if you can add the input syntax that would be great. I'm passing C contiguous memory views to the function i.e. [::1] Cython NumPy vectors.
What you're describing is a pure NumPy feature called "broadcasting". These broadcasting operations are done using C (or Cython) code. You can always access these in Cython through the Python C API such as PyNumber_Multiply (although they probably won't release the GIL) but normal multiplication in Cython should delegate to that function anyway so you normally don't need to call (or import) it directly.
BLAS/LAPACK is mostly used in linear algebra stuff and even if you could "use" a function exposed there for this purpose it won't be the same NumPy uses (normally).
I compute the dot product as follows:
import numpy as np
A = np.random.randn(80000, 3000)
B = np.random.randn(3000, 50)
C = np.dot(A, B)
Running this script takes about 9 seconds:
Mac#MacBook-Pro:~/python_dot_product$ time python dot.py
real 0m9.042s
user 0m10.927s
sys 0m0.911s
Could I do any better?
Does numpy already use the ideal balance for the cores?
The last two answers at this SO answer should be helpful.
The last one pointed me to SciPy documentation, which includes this quote:
"[np.dot(A,B) is evaluated using BLAS, which] will normally be a
library carefully tuned to run as fast as possible on your hardware by
taking advantage of cache memory and assembler implementation. But
many architectures now have a BLAS that also takes advantage of a
multicore machine. If your numpy/scipy is compiled using one of these,
then dot() will be computed in parallel (if this is faster) without
you doing anything."
So it sounds like it depends on your specific hardware and SciPy compilation. Sometimes np.dot(A,B) will utilize your multiple cores/processors, sometimes it might not.
To find out which case is yours, I suggest running your toy example (with larger matrices) while you have your system monitor open, so you can see whether just one CPU spikes in activity, or if multiple ones do.
I am currently need to run FFT on 1024 sample points signal. So far I have implementing my own DFT algorithm in python, but it is very slow. If I use the NUMPY fftpack, or even move to C++ and use FFTW, do you guys think it would be better?
If you are implementing the DFFT entirely within Python, your code will run orders of magnitude slower than either package you mentioned. Not just because those libraries are written in much lower-level languages, but also (FFTW in particular) they are written so heavily optimized, taking advantage of cache locality, vector units, and basically every trick in the book, that it would not surprise me if they ran at 10,000x the speed of a naive Python implementation. Even if you are using numpy in your implementation, it will still pale in comparison.
So yes; use numpy's fftpack. If that is not fast enough, you can try the python bindings for FFTW (PyFFTW), but the speedup from fftpack to fftw will not be nearly as dramatic. I really doubt there's a need to drop into C++ just for FFTs - they're sort of the ideal case for Python bindings.
If you need speed, then you want to go for FFTW, check out the pyfftw project.
In order to use processor SIMD instructions, you need to align the data and there is not an easy way of doing so in numpy. Moreover, pyfftw allows you to use true multithreading, so trust me, it will be much faster.
In case you wish to stick to Python (handling and maintaining custom C++ bindings can be time consuming), you have the alternative of using OpenCV's implementation of FFT.
I put together a toy example comparing OpenCV's dft() and numpy's fft2 functions in python (Intel(R) Core(TM) i7-3930K CPU).
samplesFreq_cv2 = [
cv2.dft(samples[iS])
for iS in xrange(nbSamples)]
samplesFreq_np = [
np.fft.fft2(samples[iS])
for iS in xrange(nbSamples)]
Results for sequentially transforming 20000 image patches of varying resolutions from 20x20 to 60x60:
Numpy's fft2: 1.709100 seconds
OpenCV's dft: 0.621239 seconds
This is likely not as fast as binding to a dedicates C++ library like fftw, but it's a rather low-hanging fruit.