Parallelize python loop numpy.searchsorted using cython - python

I've coded a function using cython containing the following loop. Each row of array A1 is binary searched for all values in array A2. So each loop iteration returns a 2D array of index values. Arrays A1 and A2 enter as function arguments, properly typed.
The array C is pre-allocated at highest indent level as required in cython.
I simplified things a little for this question.
...
cdef np.ndarray[DTYPEint_t, ndim=3] C = np.zeros([N,M,M], dtype=DTYPEint)
for j in range(0,N):
C[j,:,:] = np.searchsorted(A1[j,:], A2, side='left' )
All's fine so far, things compile and run as expected. However, to gain even more speed I want to parallelize the j-loop. First attempt was simply writing
for j in prange(0,N, nogil=True):
C[j,:,:] = np.searchsorted(A1[j,:], A2, side='left' )
I tried many coding variations such as putting things in a separate nogil_function, assigning the result to an intermediate array and write a nested loop to avoid the assignment to the sliced part of C.
Errors usually are of the form "Accessing Python attribute not allowed without gil"
I can't get it to work. Any suggestions on how I can do this?
EDIT:
This is my setup.py
try:
from setuptools import setup
from setuptools import Extension
except ImportError:
from distutils.core import setup
from distutils.extension import Extension
from Cython.Build import cythonize
import numpy
extensions = [Extension("matchOnDistanceVectors",
sources=["matchOnDistanceVectors.pyx"],
extra_compile_args=["/openmp", "/O2"],
extra_link_args=[]
)]
setup(
ext_modules = cythonize(extensions),
include_dirs=[numpy.get_include()]
)
I'm on windows 7 compiling with msvc. I did specify the /openmp flag, my arrays are of sizes 200*200. So everything seems in order...

I believe that searchsorted releases the GIL itself (see https://github.com/numpy/numpy/blob/e2805398f9a63b825f4a2aab22e9f169ff65aae9/numpy/core/src/multiarray/item_selection.c, line 1664 "NPY_BEGIN_THREADS_DEF").
Therefore, you can do
for j in prange(0,N, nogil=True):
with gil:
C[j,:,:] = np.searchsorted(A1[j,:], A2, side='left' )
That temporarily claims back the GIL to do the necessary work on Python objects (which is hopefully quick), and then it should be released again inside searchsorted allowing in the run largely in parallel.
To update I did a quick test of this (A1.shape==(105,100), A2.shape==(302,302), numbers are chosen pretty arbitrarily). For 10 repeats the serial version took 4.5 second, the parallel version took 1.4 seconds (test was run on a 4 core CPU). You don't get the 4x full speed-up, but you get close.
This was compiled as described in the documentation. I suspect if you aren't seeing speed-up then it could be any of: 1) your arrays are small enough that the function-call/numpy checking types and sizes overhead is dominating; 2) You aren't compiling it with OpenMP enabled; or 3) your compiler doesn't support OpenMP.

You have a bit of a catch 22. You need the GIL to call numpy.searchsorted but the GIL prevents any kind of parallel processing. Your best bet is to write your own nogil version of searchsorted:
cdef mySearchSorted(double[:] array, double target) nogil:
# binary search implementation
for j in prange(0,N, nogil=True):
for k in range(A2.shape[0]):
for L in range(A2.shape[1]):
C[j, k, L] = mySearchSorted(A1[j, :], A2[k, L])
numpy.searchsorted also has a non trivial amount of overhead, so if N is large it makes sense to use your own searchsorted just to reduce the overhead.

Related

Decomposition of matrices for CPLEX and machine learning application

I am dealing with big matrices and time to time my code ends with 'killed:9' message in my terminal. I'm working on Mac OSx.
A wise programmer tells me the problem in my code is liked to the stored matrix I am dealing with.
nn = 35000
dd = 35
XX = np.random.rand(nn,dd)
XX = XX.dot(XX.T) #it should be faster than np.dot(XX,XX.T)
yy = np.random.rand(nn,1)
XX = np.multiply(XX,yy.T)
I have to store this huge matrix XX, my guess: I split the matrix with
upp = np.triu(XX)
Do I actually save space in terms of stored data?
What if later on I store
low = app.T
am I wasting memory and computational time?
It should take up the same total amount of memory. To avoid the error you are probably looking at a few options:
Process batch wise
If you create your model over the CPLEX API, once you supplied the data it is handled by CPLEX I believe. So you could split the data and load it piece by piece and add it to the model consecutively.
Allocate memory manually
If you use Cython you can use the function malloc to allocate memory manually for your array, the size will very likely be no issue then.
Option 1 would be the preferred option in my opinion.
EDIT:
I constructed a little example. It actually combines the two options. The array is not stored as a Python object, but as a C array and the values are computed piecewise.
I am allocating the memory for the array using Cython and malloc. To run the code you have to install Cython.Then you can open a python interpreter at the directory you saved the file and write:
import pyximport;pyximport.install()
import nameofscript
An example for processing your array:
import numpy as np
from libc.stdlib cimport malloc # Allocate memory manually
from cython.parallel import prange # Parallel processing without GIL
dd = 35
# With cdef we can define C variables in Cython.
cdef double **XXN
cdef double y[35000]
cdef int i, j, nn
nn = 35000
# Allocate memory for the Matrix with 1.225 billion double elements
XXN = <double **>malloc(nn * sizeof(double *))
for i in range(nn):
XXN[i] = <double *>malloc(nn * sizeof(double))
XX = np.random.rand(nn,dd)
for i in range(nn):
for j in range(nn):
# Compute the values for the new matrix element by element
XXN[i][j] = XX[i].dot(XX[j].T)
# Multiply the new matrix with y column wise
for i in prange(nn, nogil=True, num_threads=4):
for j in range(nn):
XXN[i][j] = XXN[i][j] * y[i]
Save this file as nameofscript.pyx and run it as described above. I have briefly tested this script and it runs about half an hour on my machine. You can extend this script and use the result array XXN for your further computations.
A little example for parallelization: I did not initialize y and did not assign any values. If you declare y as a C array, you can e. g. assign some values from python objects to fill it with values. Then, you can conduct the last multiplication without GIL, in a parallelized manner, as shown in the code sample.
Regarding computational efficiency: This is probably not the fastest way (which may be writing your code for the CPLEX C Interface entirely maybe), but it does not throw the memory error and does run in an acceptable time if you do not have to repeat this computation too often.

Optimizing a multithreaded numpy array function

Given 2 large arrays of 3D points (I'll call the first "source", and the second "destination"), I needed a function that would return indices from "destination" which matched elements of "source" as its closest, with this limitation: I can only use numpy... So no scipy, pandas, numexpr, cython...
To do this i wrote a function based on the "brute force" answer to this question. I iterate over elements of source, find the closest element from destination and return its index. Due to performance concerns, and again because i can only use numpy, I tried multithreading to speed it up. Here are both threaded and unthreaded functions and how they compare in speed on an 8 core machine.
import timeit
import numpy as np
from numpy.core.umath_tests import inner1d
from multiprocessing.pool import ThreadPool
def threaded(sources, destinations):
# Define worker function
def worker(point):
dlt = (destinations-point) # delta between destinations and given point
d = inner1d(dlt,dlt) # get distances
return np.argmin(d) # return closest index
# Multithread!
p = ThreadPool()
return p.map(worker, sources)
def unthreaded(sources, destinations):
results = []
#for p in sources:
for i in range(len(sources)):
dlt = (destinations-sources[i]) # difference between destinations and given point
d = inner1d(dlt,dlt) # get distances
results.append(np.argmin(d)) # append closest index
return results
# Setup the data
n_destinations = 10000 # 10k random destinations
n_sources = 10000 # 10k random sources
destinations= np.random.rand(n_destinations,3) * 100
sources = np.random.rand(n_sources,3) * 100
#Compare!
print 'threaded: %s'%timeit.Timer(lambda: threaded(sources,destinations)).repeat(1,1)[0]
print 'unthreaded: %s'%timeit.Timer(lambda: unthreaded(sources,destinations)).repeat(1,1)[0]
Retults:
threaded: 0.894030461056
unthreaded: 1.97295164054
Multithreading seems beneficial but I was hoping for more than 2X increase given the real life dataset i deal with are much larger.
All recommendations to improve performance (within the limitations described above) will be greatly appreciated!
Ok, I've been reading Maya documentation on python and I came to these conclusions/guesses:
They're probably using CPython inside (several references to that documentation and not any other).
They're not fond of threads (lots of non-thread safe methods)
Since the above, I'd say it's better to avoid threads. Because of the GIL problem, this is a common problem and there are several ways to do the earlier.
Try to build a tool C/C++ extension. Once that is done, use threads in C/C++. Personally, I'd only try SIP to work, and then move on.
Use multiprocessing. Even if your custom python distribution doesn't include it, you can get to a working version since it's all pure python code. multiprocessing is not affected by the GIL since it spawns separate processes.
The above should've worked out for you. If not, try another parallel tool (after some serious praying).
On a side note, if you're using outside modules, be most mindful of trying to match maya's version. This may have been the reason because you couldn't build scipy. Of course, scipy has a huge codebase and the windows platform is not the most resilient to build stuff.

Cython for loop conversion

Using cython -a, I found that a for i in range(0, a, b) statement was run as a python loop (very yellow line in cython -a html output). i, a and b were cdef-ed as int64_t.
Then I tried the 'old' syntax for i from 0 <= i < b by a. From the output of cython -a it seemed to compile quite optimal as expected.
Is it expected behaviour that range(0, a, b) is not optimized here or is this rather bound to the implementation?
Automatic range conversion is only applied when cython can determine the sign of the step at compile time. As the step in this case is a signed type it cannot and so falls back to the python loop.
Note that currently even when the type is unsigned cython still falls back onto the python loop, this is a (rather old) outstanding further optimisation that the compiler could do but doesn't. Have a look at this ticket for more information:
http://trac.cython.org/ticket/546

numpy OpenBLAS set maximum number of threads

I am using numpy and my model involves intensive matrix-matrix multiplication.
To speed up, I use OpenBLAS multi-threaded library to parallelize the numpy.dot function.
My setting is as follows,
OS : CentOS 6.2 server #CPUs = 12, #MEM = 96GB
python version: Python2.7.6
numpy : numpy 1.8.0
OpenBLAS + IntelMKL
$ OMP_NUM_THREADS=8 python test_mul.py
code, of which I took from https://gist.github.com/osdf/
test_mul.py :
import numpy
import sys
import timeit
try:
import numpy.core._dotblas
print 'FAST BLAS'
except ImportError:
print 'slow blas'
print "version:", numpy.__version__
print "maxint:", sys.maxint
print
x = numpy.random.random((1000,1000))
setup = "import numpy; x = numpy.random.random((1000,1000))"
count = 5
t = timeit.Timer("numpy.dot(x, x.T)", setup=setup)
print "dot:", t.timeit(count)/count, "sec"
when I use OMP_NUM_THREADS=1 python test_mul.py, the result is
dot: 0.200172233582 sec
OMP_NUM_THREADS=2
dot: 0.103047609329 sec
OMP_NUM_THREADS=4
dot: 0.0533880233765 sec
things go well.
However, when I set OMP_NUM_THREADS=8.... the code starts to "occasionally works".
sometimes it works, sometimes it does not even run and and gives me core dumps.
when OMP_NUM_THREADS > 10. the code seems to break all the time..
I am wondering what is happening here ? Is there something like a MAXIMUM number threads that each process can use ? Can I raise that limit, given that I have 12 CPUs in my machine ?
Thanks
Firstly, I don't really understand what you mean by 'OpenBLAS + IntelMKL'. Both of those are BLAS libraries, and numpy should only link to one of them at runtime. You should probably check which of these two numpy is actually using. You can do this by calling:
$ ldd <path-to-site-packages>/numpy/core/_dotblas.so
Update: numpy/core/_dotblas.so was removed in numpy v1.10, but you can check the linkage of numpy/core/multiarray.so instead.
For example, I link against OpenBLAS:
...
libopenblas.so.0 => /opt/OpenBLAS/lib/libopenblas.so.0 (0x00007f788c934000)
...
If you are indeed linking against OpenBLAS, did you build it from source? If you did, you should see that in the Makefile.rule there is a commented option:
...
# You can define maximum number of threads. Basically it should be
# less than actual number of cores. If you don't specify one, it's
# automatically detected by the the script.
# NUM_THREADS = 24
...
By default OpenBLAS will try to set the maximum number of threads to use automatically, but you could try uncommenting and editing this line yourself if it is not detecting this correctly.
Also, bear in mind that you will probably see diminishing returns in terms of performance from using more threads. Unless your arrays are very large it is unlikely that using more than 6 threads will give much of a performance boost because of the increased overhead involved in thread creation and management.

Portable/fast way to obtain a pointer to Numpy/Numpypy data

I recently tried PyPy and was intrigued by the approach. I have lots of C extensions for Python, which all use PyArray_DATA() to obtain a pointer to the data sections of numpy arrays. Unfortunately, PyPy doesn't appear to export an equivalent for their numpypy arrays in their cpyext module, so I tried following the recommendation on their website to use ctypes. This pushes the task of obtaining the pointer to the Python level.
There appear to be two ways:
import ctypes as C
p_t = C.POINTER(C.c_double)
def get_ptr_ctypes(x):
return x.ctypes.data_as(p_t)
def get_ptr_array(x):
return C.cast(x.__array_interface__['data'][0], p_t)
Only the second one works on PyPy, so for compatibility the choice is clear. For CPython, both are slow as hell and a complete bottleneck for my application! Is there a fast and portable way of obtaining this pointer? Or is there an equivalent of PyArray_DATA() for PyPy (possibly undocumented)?
I still haven't found an entirely satisfactory solution, but nevertheless there is something one can do to obtain the pointer with a lot less overhead in CPython. First off, the reason why both ways mentioned above are so slow is that both .ctypes and .__array_interface__ are on-demand attributes, which are set by array_ctypes_get() and array_interface_get() in numpy/numpy/core/src/multiarray/getset.c. The first imports ctypes and creates a numpy.core._internal._ctypes instance, while the second one creates a new dictionary and populates it with lots of unnecessary stuff in addition to the data pointer.
There is nothing one can do on the Python level about this overhead, but one can write a micro-module on the C-level that bypasses most of the overhead:
#include <Python.h>
#include <numpy/arrayobject.h>
PyObject *_get_ptr(PyObject *self, PyObject *obj) {
return PyLong_FromVoidPtr(PyArray_DATA(obj));
}
static PyMethodDef methods[] = {
{"_get_ptr", _get_ptr, METH_O, "Wrapper to PyArray_DATA()"},
{NULL, NULL, 0, NULL}
};
PyMODINIT_FUNC initaccel(void) {
Py_InitModule("accel", methods);
}
Compile as usual as an Extension in setup.py, and import as
try:
from accel import _get_ptr
def get_ptr(x):
return C.cast(_get_ptr(x), p_t)
except ImportError:
get_ptr = get_ptr_array
On PyPy, from accel import _get_ptr will fail and get_ptr will fall back to get_ptr_array, which works with Numpypy.
As far as performance goes, for light-weight C function calls, ctypes + accel._get_ptr() is still quite a bit slower than the native CPython extension, which has essentially no overhead. It is of course much faster than get_ptr_ctypes() and get_ptr_array() above, so that the overhead may become insignificant for medium-weight C function calls.
One has gained compatibility with PyPy, although I have to say that after spending quite a bit of time trying to evaluate PyPy for my scientific computation applications, I don't see a future for it as long as they (quite stubbornly) refuse to support the full CPython API.
Update
I found that ctypes.cast() was now becoming the bottleneck after introducing accel._get_ptr(). One can get rid of the casts by declaring all pointers in the interface as ctypes.c_void_p. This is what I ended up with:
def get_ptr_ctypes2(x):
return x.ctypes._data
def get_ptr_array(x):
return x.__array_interface__['data'][0]
try:
from accel import _get_ptr as get_ptr
except ImportError:
get_ptr = get_ptr_array
Here, get_ptr_ctypes2() avoids the cast by accessing the hidden ndarray.ctypes._data attribute directly. Here are some timing results for calling heavy-weight and light-weight C functions from Python:
heavy C (few calls) light C (many calls)
ctypes + get_ptr_ctypes(): 0.71 s 15.40 s
ctypes + get_ptr_ctypes2(): 0.68 s 13.30 s
ctypes + get_ptr_array(): 0.65 s 11.50 s
ctypes + accel._get_ptr(): 0.63 s 9.47 s
native CPython: 0.62 s 8.54 s
Cython (no decorators): 0.64 s 9.96 s
So, with accel._get_ptr() and no ctypes.cast()s, ctypes' speed is actually competitive with a native CPython extension. So I just have to wait until someone rewrites h5py, matplotlib and scipy with ctypes to be able to try PyPy for anything serious...
That might not be answer enough, but hopefully a good hint. I am using scipy.weave.inline() in some parts of my code. I do not know much about the speed of the interface itself, because the function I execute is quite heavy and relies on a few pointers/arrays only, but it seems fast to me. Maybe you can get some inspiration from the scipy.weave code, particularly from attempt_function_call
https://github.com/scipy/scipy/blob/master/scipy/weave/inline_tools.py#L390
If you want to have a look at the C++ code that is generated by scipy.weave,
produce a simple example from here: http://docs.scipy.org/doc/scipy/reference/tutorial/weave.html ,
run the python script
get the scipy.weave cache folder:
import scipy.weave.catalog as ctl
ctl.default_dir()
Out[5]: '/home/user/.python27_compiled'
have a look at the generated C++ code in the folder

Categories