Fastest way to compute matrix dot product

Fastest way to compute matrix dot product - python

I compute the dot product as follows:
import numpy as np
A = np.random.randn(80000, 3000)
B = np.random.randn(3000, 50)
C = np.dot(A, B)
Running this script takes about 9 seconds:
Mac#MacBook-Pro:~/python_dot_product$ time python dot.py
real 0m9.042s
user 0m10.927s
sys 0m0.911s
Could I do any better?
Does numpy already use the ideal balance for the cores?

The last two answers at this SO answer should be helpful.
The last one pointed me to SciPy documentation, which includes this quote:
"[np.dot(A,B) is evaluated using BLAS, which] will normally be a
library carefully tuned to run as fast as possible on your hardware by
taking advantage of cache memory and assembler implementation. But
many architectures now have a BLAS that also takes advantage of a
multicore machine. If your numpy/scipy is compiled using one of these,
then dot() will be computed in parallel (if this is faster) without
you doing anything."
So it sounds like it depends on your specific hardware and SciPy compilation. Sometimes np.dot(A,B) will utilize your multiple cores/processors, sometimes it might not.
To find out which case is yours, I suggest running your toy example (with larger matrices) while you have your system monitor open, so you can see whether just one CPU spikes in activity, or if multiple ones do.

Related

Speed up N-d array dot

I'm trying to do some matrix computations as
def get_P(X,Z):
n_sample,n_m,n_t,n_f = X.shape
res = np.zeros((n_sample,n_m,n_t,n_t))
for i in range(n_sample):
res[i,:,:,:] = np.dot(X[i,:,:,:],Z[i,:,:])
return res
Because the size of X and Z is large, it takes more than 600ms to compute one np.dot, and I have 10k rows in X.
Is there anyway we can speed it up?

Well, there might be some avoidable overhead posed by your zero initialization (which gets overwritten right away): Just use np.ndarray instead.
Other than that: numpy is fairly well-optimized. Probably you can speed things up if you used dtype=numpy.float32 instead of the default 64-bit floating point numbers for your X, Z and res – but that's about it. Dot products are mostly spending time going linear through RAM and multiplying and summing numbers – things that numpy, compilers and CPUs are radically good at these days.
Note that numpy will only use one CPU core at a time in its default configuration - it might make sense to parallelize; for example, if you've got 16 CPU cores, you'd make 16 separate res partitions and calculate subsets of your range(n_sample) dot products on each core; python does bring the multithreading / async facilities to do so – you'll find plenty of examples, and explaining how would lead too far.
If you can spend the development time, and need massive amounts of data, so that this pays: you can use e.g. GPUs to multiply matrices; these are really good at that, and cuBLASlt (GEMM) is an excellent implementation, but honestly, you'd mostly be abandoning Numpy and would need to work things out yourself – in C/C++.

You can use numpy einsum to do this multiplication in one vectorized step.
It will be much faster than this loop based dot product. For examples, check this link https://rockt.github.io/2018/04/30/einsum

Scipy minimize function seems to be creating multiple threads by itself?

I am using the scipy minimize function. The function that it's calling was compiled with Cython and has an underlying C++ implementation that I wrote, but that shouldn't really matter. For some reason when I run my program, it creates as many threads as it can to fill all my cpus. For example if I run top I see that 800% of a cpu is being used or on htop I can see that 8 individual processors are being used, when I only created the program to be run on one. I didn't think that scipy even had parallel processing functionality and I can't find any documentation related to this. What could possible be going on and is there any way to control it?

If some BLAS-implementation (with threading-support) is available (default on Ubuntu for example), some expressions like np.dot() (only the dense case as far as i know) will automatically be run in parallel (reference). Another possible example is sparse-matrix factorization with SuperLU.
Of course different minimizers will behave different.
Newton-type methods (core: solve a system of sparse linear-equations) are probably based on SuperLU (if the code is not one of the common old Fortran/C ones, where the whole code is self-contained). CG-type methods are heavily based on matrix-vector products (np.dot; so the dense-case will be parallel).
For some control over this, start with this SO question.

Numpy fft.pack vs FFTW vs Implement DFT on your own

I am currently need to run FFT on 1024 sample points signal. So far I have implementing my own DFT algorithm in python, but it is very slow. If I use the NUMPY fftpack, or even move to C++ and use FFTW, do you guys think it would be better?

If you are implementing the DFFT entirely within Python, your code will run orders of magnitude slower than either package you mentioned. Not just because those libraries are written in much lower-level languages, but also (FFTW in particular) they are written so heavily optimized, taking advantage of cache locality, vector units, and basically every trick in the book, that it would not surprise me if they ran at 10,000x the speed of a naive Python implementation. Even if you are using numpy in your implementation, it will still pale in comparison.
So yes; use numpy's fftpack. If that is not fast enough, you can try the python bindings for FFTW (PyFFTW), but the speedup from fftpack to fftw will not be nearly as dramatic. I really doubt there's a need to drop into C++ just for FFTs - they're sort of the ideal case for Python bindings.

If you need speed, then you want to go for FFTW, check out the pyfftw project.
In order to use processor SIMD instructions, you need to align the data and there is not an easy way of doing so in numpy. Moreover, pyfftw allows you to use true multithreading, so trust me, it will be much faster.

In case you wish to stick to Python (handling and maintaining custom C++ bindings can be time consuming), you have the alternative of using OpenCV's implementation of FFT.
I put together a toy example comparing OpenCV's dft() and numpy's fft2 functions in python (Intel(R) Core(TM) i7-3930K CPU).
samplesFreq_cv2 = [
cv2.dft(samples[iS])
for iS in xrange(nbSamples)]
samplesFreq_np = [
np.fft.fft2(samples[iS])
for iS in xrange(nbSamples)]
Results for sequentially transforming 20000 image patches of varying resolutions from 20x20 to 60x60:
Numpy's fft2: 1.709100 seconds
OpenCV's dft: 0.621239 seconds
This is likely not as fast as binding to a dedicates C++ library like fftw, but it's a rather low-hanging fruit.

why isn't numpy.mean multithreaded?

I've been looking for ways to easily multithread some of my simple analysis code since I had noticed numpy it was only using one core, despite the fact that it is supposed to be multithreaded.
I know that numpy is configured for multiple cores, since I can see tests using numpy.dot use all my cores, so I just reimplemented mean as a dot product, and it runs way faster. Is there some reason mean can't run this fast on its own? I find similar behavior for larger arrays, although the ratio is close to 2 than the 3 shown in my example.
I've been reading a bunch of posts on similar numpy speed issues, and apparently its way more complicated than I would have thought. Any insight would be helpful, I'd prefer to just use mean since it's more readable and less code, but I might switch to dot based means.
In [27]: data = numpy.random.rand(10,10)
In [28]: a = numpy.ones(10)
In [29]: %timeit numpy.dot(data,a)/10.0
100000 loops, best of 3: 4.8 us per loop
In [30]: %timeit numpy.mean(data,axis=1)
100000 loops, best of 3: 14.8 us per loop
In [31]: numpy.dot(data,a)/10.0 - numpy.mean(data,axis=1)
Out[31]:
array([ 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
0.00000000e+00, 1.11022302e-16, 0.00000000e+00,
0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
-1.11022302e-16])

I've been looking for ways to easily multithread some of my simple analysis code since I had noticed numpy it was only using one core, despite the fact that it is supposed to be multithreaded.
Who says it's supposed to be multithreaded?
numpy is primarily designed to be as fast as possible on a single core, and to be as parallelizable as possible if you need to do so. But you still have to parallelize it.
In particular, you can operate on independent sub-objects at the same time, and slow operations release the GIL when possible—although "when possible" may not be nearly enough. Also, numpy objects are designed to be shared or passed between processes as easily as possible, to facilitate using multiprocessing.
There are some specialized methods that are automatically parallelized, but most of the core methods are not. In particular, dot is implemented on top of BLAS when possible, and BLAS is automatically parallelized on most platforms, but mean is implemented in plain C code.
See Parallel Programming with numpy and scipy for details.
So, how do you know which methods are parallelized and which aren't? And, of those which aren't, how do you know which ones can be nicely manually-threaded and which need multiprocessing?
There's no good answer to that. You can make educated guesses (X seems like it's probably implemented on top of ATLAS, and my copy of ATLAS is implicitly threaded), or you can read the source.
But usually, the best thing to do is try it and test. If the code is using 100% of one core and 0% of the others, add manual threading. If it's now using 100% of one core and 10% of the others and barely running faster, change the multithreading to multiprocessing. (Fortunately, Python makes this pretty easy, especially if you use the Executor classes from concurrent.futures or the Pool classes from multiprocessing. But you still often need to put some thought into it, and test the relative costs of sharing vs. passing if you have large arrays.)
Also, as kwatford points out, just because some method doesn't seem to be implicitly parallel doesn't mean it won't be parallel in the next version of numpy, or the next version of BLAS, or on a different platform, or even on a machine with slightly different stuff installed on it. So, be prepared to re-test. And do something like my_mean = numpy.mean and then use my_mean everywhere, so you can just change one line to my_mean = pool_threaded_mean.

Basically, because the BLAS library has an optimized dot product that they can easily call for dot that is inherently parallel. They admit they could extend numpy to parallelize other operations, but opted not to go that route. However, they give several tips on how to parallelize your numpy code (basically to divide work among N cores (e.g., N=4), split your array into N sub-arrays and send jobs for each sub-array to its own thread and then combine your results).
See http://wiki.scipy.org/ParallelProgramming :
Use parallel primitives
One of the great strengths of numpy is that you can express array operations very cleanly. For example to compute the product of the matrix A and the matrix B, you just do:
>>> C = numpy.dot(A,B)
Not only is this simple and clear to read and write, since numpy knows you want to do a matrix dot product it can use an optimized implementation obtained as part of "BLAS" (the Basic Linear Algebra Subroutines). This will normally be a library carefully tuned to run as fast as possible on your hardware by taking advantage of cache memory and assembler implementation. But many architectures now have a BLAS that also takes advantage of a multicore machine. If your numpy/scipy is compiled using one of these, then dot() will be computed in parallel (if this is faster) without you doing anything. Similarly for other matrix operations, like inversion, singular value decomposition, determinant, and so on. For example, the open source library ATLAS allows compile time selection of the level of parallelism (number of threads). The proprietary MKL library from Intel offers the possibility to chose the level of parallelism at runtime. There is also the GOTO library that allow run-time selection of the level of parallelism. This is a commercial product but the source code is distributed free for academic use.
Finally, scipy/numpy does not parallelize operations like
>>> A = B + C
>>> A = numpy.sin(B)
>>> A = scipy.stats.norm.isf(B)
These operations run sequentially, taking no advantage of multicore machines (but see below). In principle, this could be changed without too much work. OpenMP is an extension to the C language which allows compilers to produce parallelizing code for appropriately-annotated loops (and other things). If someone sat down and annotated a few core loops in numpy (and possibly in scipy), and if one then compiled numpy/scipy with OpenMP turned on, all three of the above would automatically be run in parallel. Of course, in reality one would want to have some runtime control - for example, one might want to turn off automatic parallelization if one were planning to run several jobs on the same multiprocessor machine.

Python eigenvalue computations run much slower than those of MATLAB on my computer. Why?

I would like to compute the eigenvalues of large-ish matrices (about 1000x1000) using Python 2.6.5. I have been unable to do so quickly. I have not found any other threads addressing this question.
When I run
a = rand(1000,1000);
tic;
for i =1:10
eig(a);
end
toc;
in MATLAB it takes about 30 seconds. A similar test in Python requires 216 seconds. Running it through R using RPy did not speed up the computation appreciably. A test in Octave took 93 seconds. I am a bit baffled at the difference in speed.
The only instance of a question like this one I can find online is this, which is several years old. The poster in that question has a different Python directory structure (which I attribute to the age of the post, although I could be mistaken), so I have not been confident enough to attempt to follow the instructions posted by the correspondent.
My package manager says that I have LAPACK installed, and I am using NumPy and SciPy for the Python calculations:
from numpy import *
from scipy import *
from numpy.linalg import *
import time
a = randn(1000,1000)
tic = time.clock()
for i in range(0,10):
eig(a)
toc = time.clock()
print "Elapsed time is ", toc-tic
I am pretty new to Python, so I may have done something silly. Please let me know if I need to provide any more information.

I think what you're seeing is the difference between the Intel Math Kernel Library (MKL) that's being used by Matlab and whatever LAPACK implementation you have on your system (ATLAS, maybe?) that scipy is linked against. You can see how much faster the MKL is in these benchmarks.
I imagine that you would get much better performance if you could rebuild Scipy against the Intel MKL libraries. If you're using Windows, pre-built copies can be downloaded from here, or you might consider using something like the Enthought Python Distribution.

I do get a difference in timings, but not as drastic as yours. My MATLAB (R2010b) timing was ~25 seconds and python (2.7) timing was ~60 seconds.
I'm not really surprised by these numbers as MATLAB is solely a numerical and matrix manipulation language, and it has the advantage of its JIT accelerator over python, which is a general purpose language. Generally, the differences between MATLAB and python+numpy are quite small, but become apparent when the matrix size is large, as in your case.
That doesn't mean there aren't ways to improve python's performance. The PerformancePython article on scipy's website gives a good introduction to the different ways in which you can improve the performance of python.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.