why isn't numpy.mean multithreaded?

why isn't numpy.mean multithreaded? - python

I've been looking for ways to easily multithread some of my simple analysis code since I had noticed numpy it was only using one core, despite the fact that it is supposed to be multithreaded.
I know that numpy is configured for multiple cores, since I can see tests using numpy.dot use all my cores, so I just reimplemented mean as a dot product, and it runs way faster. Is there some reason mean can't run this fast on its own? I find similar behavior for larger arrays, although the ratio is close to 2 than the 3 shown in my example.
I've been reading a bunch of posts on similar numpy speed issues, and apparently its way more complicated than I would have thought. Any insight would be helpful, I'd prefer to just use mean since it's more readable and less code, but I might switch to dot based means.
In [27]: data = numpy.random.rand(10,10)
In [28]: a = numpy.ones(10)
In [29]: %timeit numpy.dot(data,a)/10.0
100000 loops, best of 3: 4.8 us per loop
In [30]: %timeit numpy.mean(data,axis=1)
100000 loops, best of 3: 14.8 us per loop
In [31]: numpy.dot(data,a)/10.0 - numpy.mean(data,axis=1)
Out[31]:
array([ 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
0.00000000e+00, 1.11022302e-16, 0.00000000e+00,
0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
-1.11022302e-16])

I've been looking for ways to easily multithread some of my simple analysis code since I had noticed numpy it was only using one core, despite the fact that it is supposed to be multithreaded.
Who says it's supposed to be multithreaded?
numpy is primarily designed to be as fast as possible on a single core, and to be as parallelizable as possible if you need to do so. But you still have to parallelize it.
In particular, you can operate on independent sub-objects at the same time, and slow operations release the GIL when possible—although "when possible" may not be nearly enough. Also, numpy objects are designed to be shared or passed between processes as easily as possible, to facilitate using multiprocessing.
There are some specialized methods that are automatically parallelized, but most of the core methods are not. In particular, dot is implemented on top of BLAS when possible, and BLAS is automatically parallelized on most platforms, but mean is implemented in plain C code.
See Parallel Programming with numpy and scipy for details.
So, how do you know which methods are parallelized and which aren't? And, of those which aren't, how do you know which ones can be nicely manually-threaded and which need multiprocessing?
There's no good answer to that. You can make educated guesses (X seems like it's probably implemented on top of ATLAS, and my copy of ATLAS is implicitly threaded), or you can read the source.
But usually, the best thing to do is try it and test. If the code is using 100% of one core and 0% of the others, add manual threading. If it's now using 100% of one core and 10% of the others and barely running faster, change the multithreading to multiprocessing. (Fortunately, Python makes this pretty easy, especially if you use the Executor classes from concurrent.futures or the Pool classes from multiprocessing. But you still often need to put some thought into it, and test the relative costs of sharing vs. passing if you have large arrays.)
Also, as kwatford points out, just because some method doesn't seem to be implicitly parallel doesn't mean it won't be parallel in the next version of numpy, or the next version of BLAS, or on a different platform, or even on a machine with slightly different stuff installed on it. So, be prepared to re-test. And do something like my_mean = numpy.mean and then use my_mean everywhere, so you can just change one line to my_mean = pool_threaded_mean.

Basically, because the BLAS library has an optimized dot product that they can easily call for dot that is inherently parallel. They admit they could extend numpy to parallelize other operations, but opted not to go that route. However, they give several tips on how to parallelize your numpy code (basically to divide work among N cores (e.g., N=4), split your array into N sub-arrays and send jobs for each sub-array to its own thread and then combine your results).
See http://wiki.scipy.org/ParallelProgramming :
Use parallel primitives
One of the great strengths of numpy is that you can express array operations very cleanly. For example to compute the product of the matrix A and the matrix B, you just do:
>>> C = numpy.dot(A,B)
Not only is this simple and clear to read and write, since numpy knows you want to do a matrix dot product it can use an optimized implementation obtained as part of "BLAS" (the Basic Linear Algebra Subroutines). This will normally be a library carefully tuned to run as fast as possible on your hardware by taking advantage of cache memory and assembler implementation. But many architectures now have a BLAS that also takes advantage of a multicore machine. If your numpy/scipy is compiled using one of these, then dot() will be computed in parallel (if this is faster) without you doing anything. Similarly for other matrix operations, like inversion, singular value decomposition, determinant, and so on. For example, the open source library ATLAS allows compile time selection of the level of parallelism (number of threads). The proprietary MKL library from Intel offers the possibility to chose the level of parallelism at runtime. There is also the GOTO library that allow run-time selection of the level of parallelism. This is a commercial product but the source code is distributed free for academic use.
Finally, scipy/numpy does not parallelize operations like
>>> A = B + C
>>> A = numpy.sin(B)
>>> A = scipy.stats.norm.isf(B)
These operations run sequentially, taking no advantage of multicore machines (but see below). In principle, this could be changed without too much work. OpenMP is an extension to the C language which allows compilers to produce parallelizing code for appropriately-annotated loops (and other things). If someone sat down and annotated a few core loops in numpy (and possibly in scipy), and if one then compiled numpy/scipy with OpenMP turned on, all three of the above would automatically be run in parallel. Of course, in reality one would want to have some runtime control - for example, one might want to turn off automatic parallelization if one were planning to run several jobs on the same multiprocessor machine.

Related

Fastest way to compute matrix dot product

I compute the dot product as follows:
import numpy as np
A = np.random.randn(80000, 3000)
B = np.random.randn(3000, 50)
C = np.dot(A, B)
Running this script takes about 9 seconds:
Mac#MacBook-Pro:~/python_dot_product$ time python dot.py
real 0m9.042s
user 0m10.927s
sys 0m0.911s
Could I do any better?
Does numpy already use the ideal balance for the cores?

The last two answers at this SO answer should be helpful.
The last one pointed me to SciPy documentation, which includes this quote:
"[np.dot(A,B) is evaluated using BLAS, which] will normally be a
library carefully tuned to run as fast as possible on your hardware by
taking advantage of cache memory and assembler implementation. But
many architectures now have a BLAS that also takes advantage of a
multicore machine. If your numpy/scipy is compiled using one of these,
then dot() will be computed in parallel (if this is faster) without
you doing anything."
So it sounds like it depends on your specific hardware and SciPy compilation. Sometimes np.dot(A,B) will utilize your multiple cores/processors, sometimes it might not.
To find out which case is yours, I suggest running your toy example (with larger matrices) while you have your system monitor open, so you can see whether just one CPU spikes in activity, or if multiple ones do.

Scipy minimize function seems to be creating multiple threads by itself?

I am using the scipy minimize function. The function that it's calling was compiled with Cython and has an underlying C++ implementation that I wrote, but that shouldn't really matter. For some reason when I run my program, it creates as many threads as it can to fill all my cpus. For example if I run top I see that 800% of a cpu is being used or on htop I can see that 8 individual processors are being used, when I only created the program to be run on one. I didn't think that scipy even had parallel processing functionality and I can't find any documentation related to this. What could possible be going on and is there any way to control it?

If some BLAS-implementation (with threading-support) is available (default on Ubuntu for example), some expressions like np.dot() (only the dense case as far as i know) will automatically be run in parallel (reference). Another possible example is sparse-matrix factorization with SuperLU.
Of course different minimizers will behave different.
Newton-type methods (core: solve a system of sparse linear-equations) are probably based on SuperLU (if the code is not one of the common old Fortran/C ones, where the whole code is self-contained). CG-type methods are heavily based on matrix-vector products (np.dot; so the dense-case will be parallel).
For some control over this, start with this SO question.

Using scipy routines outside of the GIL

This is sort of a general question related to a specific implementation I have in mind, about whether it's safe to use python routines designed for use inside the GIL in a shared memory environment. Specifically what I'd like to do is use scipy.optimize.curve_fit on a large array inside a cython function.
The data can be expressed as a 2d numpy array (say, of floats) with the axis to be fit along and the other the serialized axis to be parallelized over. Then I'd just like to release the GIL and start looping through the data with a cython.parallel.prange (the idea being then that I can have all my cores working on fitting at once).
The main issue I can foresee is that curve_fit does not operate "in place"; it returns the fit values of the parameters (and optionally their covariance matrix) and so has to allocate that memory at some point. (Of course I also have no idea about any intermediate memory allocation the routine performs.) I'm worried about how this will operate outside the GIL with many threads working concurrently.
I realize that the answer could just be "it should work fine go try it," but I'm hoping to get some idea of what to look out for. I also realize that this question is similar to others about parallelizing scipy/numpy routines, but I think this one is worded differently in that falls within the cython scope of a C environment for python.
Thanks for any help/suggestions.

Not safe. If CPython could safely run that kind of code without the GIL, we wouldn't have the GIL in the first place.

You may find the following discussion to be of interest on Parallel Programming in SciPy.
[I would have posted this as merely a comment, but I lack the requisite reputation.]

Numpy fft.pack vs FFTW vs Implement DFT on your own

I am currently need to run FFT on 1024 sample points signal. So far I have implementing my own DFT algorithm in python, but it is very slow. If I use the NUMPY fftpack, or even move to C++ and use FFTW, do you guys think it would be better?

If you are implementing the DFFT entirely within Python, your code will run orders of magnitude slower than either package you mentioned. Not just because those libraries are written in much lower-level languages, but also (FFTW in particular) they are written so heavily optimized, taking advantage of cache locality, vector units, and basically every trick in the book, that it would not surprise me if they ran at 10,000x the speed of a naive Python implementation. Even if you are using numpy in your implementation, it will still pale in comparison.
So yes; use numpy's fftpack. If that is not fast enough, you can try the python bindings for FFTW (PyFFTW), but the speedup from fftpack to fftw will not be nearly as dramatic. I really doubt there's a need to drop into C++ just for FFTs - they're sort of the ideal case for Python bindings.

If you need speed, then you want to go for FFTW, check out the pyfftw project.
In order to use processor SIMD instructions, you need to align the data and there is not an easy way of doing so in numpy. Moreover, pyfftw allows you to use true multithreading, so trust me, it will be much faster.

In case you wish to stick to Python (handling and maintaining custom C++ bindings can be time consuming), you have the alternative of using OpenCV's implementation of FFT.
I put together a toy example comparing OpenCV's dft() and numpy's fft2 functions in python (Intel(R) Core(TM) i7-3930K CPU).
samplesFreq_cv2 = [
cv2.dft(samples[iS])
for iS in xrange(nbSamples)]
samplesFreq_np = [
np.fft.fft2(samples[iS])
for iS in xrange(nbSamples)]
Results for sequentially transforming 20000 image patches of varying resolutions from 20x20 to 60x60:
Numpy's fft2: 1.709100 seconds
OpenCV's dft: 0.621239 seconds
This is likely not as fast as binding to a dedicates C++ library like fftw, but it's a rather low-hanging fruit.

Parallel computing

I have a two dimensional table (Matrix)
I need to process each line in this matrix independently from the others.
The process of each line is time consuming.
I'd like to use parallel computing resources in our university (Canadian Grid something)
Can I have some advise on how to start ? I never used parallel computing before.
Thanks :)

Start here: http://docs.python.org/library/multiprocessing.html
Be sure to read this: http://docs.python.org/library/multiprocessing.html#examples
This may be helpful: http://www.slideshare.net/pvergain/multiprocessing-with-python-presentation.
While excellent, it includes threads and multiprocessing, even though multiprocessing is often far, far superior to attempting multi-threading.
For Grid computing, multi-threading is largely useless.
Also, you probably also want to read up on celery.

I am one of the developper of a new library called scoop.
It was built exactly for this purpose (grid or super-computing, scientific computing). I suggest you give it a try.
In your case, all you would have to do is a call like this:
futures.map(YourFunc, matrixLine)
It will then be distributed on your grid or whatever environment you choose.

Like the commentators have said, find someone to talk to in your university. The answer to your question will be specific to what software is installed on the grid. If you have access to a grid, it's highly likely you also have access to a person whose job it is to answer your questions (and they will be pleased to help) - find this person!

From what you describe, I would say: first have a look at numpy.
Numpy provides methods to compute the columns and rows in a vectorized manner with nearly C speed. Depending on your problem, this could be faster than parallel computation with pure CPython.
You can than use parallel computing with numpy-arrays to get a really big speed up.
Possible ways to do this is using multiprocessing or Ipython on a cluster.

It is recommended that you use C++/C for performing this computation. You can use the OpenMP API using the #include<omp.h> header. You can start your parallel region using the #pragma amp parallel directive. Since you are parallelising a for-loop for computing your matrix multiplication, you can use #pragma omp parallel for { } to start your for-loop inside the parallel region. OpenMP will automatically take care of the process synchronisation.
Check this out for a sample code: https://gist.github.com/metallurgix/0dfafc03215ce89fc595
Remember to use a big matrix to see actual improvements in speed. A smaller matrix will perform poorly in fact due to increased task overhead created due to forking and joining the multiple threads.
You can also check out MPI if you want to parallelise your code using multiple processors instead of multiple threads.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.