Using Cython correctly in sample code with numpy - python

I was wondering if I'm missing something when using Cython with Numpy because I haven't seen much of an improvement. I wrote this code as an example.
Naive version:
import numpy as np
from skimage.util import view_as_windows
it = 16
arr = np.arange(1000*1000, dtype=np.float64).reshape(1000,1000)
windows = view_as_windows(arr, (it, it), it)
container = np.zeros((windows.shape[0], windows.shape[1]))
def test(windows):
for i in range(windows.shape[0]):
for j in range(windows.shape[1]):
container[i,j] = np.mean(windows[i,j])
return container
%%timeit
test(windows)
1 loops, best of 3: 131 ms per loop
Cythonized version:
%%cython --annotate
import numpy as np
cimport numpy as np
from skimage.util import view_as_windows
import cython
cdef int step = 16
arr = np.arange(1000*1000, dtype=np.float64).reshape(1000,1000)
windows = view_as_windows(arr, (step, step), step)
#cython.boundscheck(False)
def cython_test(np.ndarray[np.float64_t, ndim=4] windows):
cdef np.ndarray[np.float64_t, ndim=2] container = np.zeros((windows.shape[0], windows.shape[1]),dtype=np.float64)
cdef int i, j
I = windows.shape[0]
J = windows.shape[1]
for i in range(I):
for j in range(J):
container[i,j] = np.mean(windows[i,j])
return container
%timeit cython_test(windows)
10 loops, best of 3: 126 ms per loop
As you can see, there is a very modest improvement, so maybe I'm doing something wrong. By the way, the annotation that Cython produces the following:
As you can see, the numpy lines have a yellow background even after including the efficient indexing syntax np.ndarray[DTYPE_t, ndim=2]. Why?
By the way, in my view the ideal outcome is being able to use most numpy functions but still get some reasonable improvement after taking advantage of efficient indexing syntax or maybe memory views as in HYRY's answer.
UPDATE
It seems I'm not doing anything wrong in the code I posted above and that the yellow background in some lines is normal, so I was left wondering the following: In which situations I can get a benefit from typing cdef np.ndarray[np.float64_t, ndim=2] in front of numpy arrays? I suppose there are specific instances where this is helpful, otherwise there wouldn't be much purpose in doing it.

You need to implement the mean() function yourself to speedup the code, this is because the overhead of calling a numpy function is very high.
#cython.boundscheck(False)
#cython.wraparound(False)
def cython_test(double[:, :, :, :] windows):
cdef double[:, ::1] container
cdef int i, j, k, l
cdef int n0, n1, n2, n3
cdef double inv_n
cdef double s
n0, n1, n2, n3 = windows.base.shape
container = np.zeros((n0, n1))
inv_n = 1.0 / (n2 * n3)
for i in range(n0):
for j in range(n1):
s = 0
for k in range(n2):
for l in range(n3):
s += windows[i, j, k, l]
container[i,j] = s * inv_n
return container.base
Here is the %timeit results:
python_test(windows): 63.7 ms
cython_test(windows): 1.24 ms
np.mean(windows, axis=(2, 3)): 2.66 ms

Related

numpy function cythonization

I have the following function in pure python:
import numpy as np
def subtractPython(a, b):
xAxisCount = a.shape[0]
yAxisCount = a.shape[1]
shape = (xAxisCount, yAxisCount, xAxisCount)
results = np.zeros(shape)
for index in range(len(b)):
subtracted = (a - b[index])
results[:, :, index] = subtracted
return results
I tried to cythonize it this way:
import numpy as np
cimport numpy as np
DTYPE = np.int
ctypedef np.int_t DTYPE_t
def subtractPython(np.ndarray[DTYPE_t, ndim=2] a, np.ndarray[DTYPE_t, ndim=2] b):
cdef int xAxisCount = a.shape[0]
cdef int yAxisCount = a.shape[1]
cdef np.ndarray[DTYPE_t, ndim=3] results = np.zeros([xAxisCount, yAxisCount, xAxisCount], dtype=DTYPE)
cdef int lenB = len(b)
cdef np.ndarray[DTYPE_t, ndim=2] subtracted
for index in range(lenB):
subtracted = (a - b[index])
results[:, :, index] = subtracted
return results
However, Im not seeing any speedup. Is there something I'm missing or this process can't be sped up?
EDIT -> I've realized that I'm not actually cythonizing the subtraction algorithm in the above code. I've managed to cythonize it, but it has the exact same runtime as a - b[:, None], so I guess this is the maximum speed of this operation.
This is basically a - b[:, None] -> has same runtime
%%cython
import numpy as np
cimport numpy as np
DTYPE = np.int
ctypedef np.int_t DTYPE_t
cimport cython
#cython.boundscheck(False) # turn off bounds-checking for entire function
#cython.wraparound(False) # turn off negative index wrapping for entire function
def subtract(np.ndarray[DTYPE_t, ndim=2] a, np.ndarray[DTYPE_t, ndim=2] b):
cdef np.ndarray[DTYPE_t, ndim=3] result = np.zeros([b.shape[0], a.shape[0], a.shape[1]], dtype=DTYPE)
cdef int lenB = b.shape[0]
cdef int lenA = a.shape[0]
cdef int lenColB = b.shape[1]
cdef int rowA, rowB, column
for rowB in range(lenB):
for rowA in range(lenA):
for column in range(lenColB):
result[rowB, rowA, column] = a[rowA, column] - b[rowB, column]
return result
When trying to optimize a function, one always should know what is the bottle-neck of this function - without you will spend a lot of time running in the wrong direction.
Let's use your python-function as baseline (actually I use result=np.zeros(shape,dtype=a.dtype) otherwise your method returns floats which is probably a bug):
>>> import numpy as np
>>> a=np.random.randint(1,1000,(300,300), dtype=np.int)
>>> b=np.random.randint(1,1000,(300,300), dtype=np.int)
>>> %timeit subtractPython(a,b)
274 ms ± 3.61 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
The first question we should ask ourselves is: Is this task memory or CPU bound? Obviously, this is a memory-bound task - a subtraction is nothing compared to needed memory-read- and write-accesses.
This means, all above we have to optimize the memory layout in order to reduce cache-misses. As a rule of thumb, our memory accesses should access one consecutive memory address after another.
Is this the case? No, the array result is in C-order, i.e. row-major-order and thus the access
results[:, :, index] = subtracted
isn't consecutive. On the other hand,
results[index, :, :] = subtracted
would be a consecutive access. Let's change the way information is stored in result:
def subtract1(a, b):
xAxisCount = a.shape[0]
yAxisCount = a.shape[1]
shape = (xAxisCount, xAxisCount, yAxisCount) #<=== Change order
results = np.zeros(shape, dtype=a.dtype)
for index in range(len(b)):
subtracted = (a - b[index])
results[index, :, :] = subtracted #<===== consecutive access
return results
The timings are now:
>>> %timeit subtract1(a,b)
>>> 35.8 ms ± 285 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
There are also 2 more small improvements: we don't have to initialize result with zeros and we can save some python overhead, but this gives us just about 5%:
def subtract2(a, b):
xAxisCount = a.shape[0]
yAxisCount = a.shape[1]
shape = (xAxisCount, xAxisCount, yAxisCount)
results = np.empty(shape, dtype=a.dtype) #<=== no need for zeros
for index in range(len(b)):
results[index, :, :] = (a-b[index]) #<===== less python overhead
return results
>>> %timeit subtract2(a,b)
34.5 ms ± 203 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Now this is about factor 8 faster than the original version.
You could use Cython to try to speed-up this even further - but the task is probably still memory-bound, so don't expect to get it significantly faster - after all cython cannot make the memory work faster. However, without proper profiling it is hard to tell, how much improvement is possible - would not be surprised, if someone would come up with a faster version.

Fastest way to find indices of condition in numpy array

I'm trying to find the fastest way to to get the functionality of numpy's 'where' statement on a 2D numpy array; namely, retrieving the indices where a condition is met. It is simply much slower than other languages I have used (e.g., IDL, Matlab).
I have cythonized a function that marches through the array in nested for-loops. There is almost an order of magnitude increase in speed, but I would like to increase performance even more, if possible.
TEST.py:
from cython_where import *
import time
import numpy as np
data = np.zeros((2600,5200))
data[100:200,100:200] = 10
t0 = time.time()
inds,ct = cython_where(data,'EQ',10)
print time.time() - t0
t1 = time.time()
tmp = np.where(data == 10)
print time.time() - t1
My cython_where.pyx program:
from __future__ import division
import numpy as np
cimport numpy as np
cimport cython
DTYPE1 = np.float
ctypedef np.float_t DTYPE1_t
DTYPE2 = np.int
ctypedef np.int_t DTYPE2_t
#cython.boundscheck(False)
#cython.wraparound(False)
#cython.nonecheck(False)
def cython_where(np.ndarray[DTYPE1_t, ndim=2] data, oper, DTYPE1_t val):
assert data.dtype == DTYPE1
cdef int xmax = data.shape[0]
cdef int ymax = data.shape[1]
cdef unsigned int x, y
cdef int count = 0
cdef np.ndarray[DTYPE2_t, ndim=1] xind = np.zeros(100000,dtype=int)
cdef np.ndarray[DTYPE2_t, ndim=1] yind = np.zeros(100000,dtype=int)
if(oper == 'EQ' or oper == 'eq'): #I didn't want to include GT, GE, LT, LE here
for x in xrange(xmax):
for y in xrange(ymax):
if(data[x,y] == val):
xind[count] = x
yind[count] = y
count += 1
return tuple([xind[0:count],yind[0:count]]),count
Output of TEST.py:
cython_test]$ python TEST.py
0.0139019489288
0.0982608795166
I've also tried numpy's argwhere, which is about as fast as where. I'm pretty new to numpy and cython, so if you have any other ideas to really increase performance, I'm all ears!
Contributions:
Numpy can be speed up on flattened array for a 4x gain:
%timeit np.where(data==10)
1 loops, best of 3: 105 ms per loop
%timeit np.unravel_index(np.where(data.ravel()==10),data.shape)
10 loops, best of 3: 26.0 ms per loop
I think you can optimize your cython code with that, avoiding computing k=i*ncol+j for each cell.
Numba give a simple alternative :
from numba import jit
dtype=data.dtype
#jit(nopython=True)
def numbaeq(flatdata,x,nrow,ncol):
size=ncol*nrow
ix=np.empty(size,dtype=dtype)
jx=np.empty(size,dtype=dtype)
count=0
k=0
while k<size:
if flatdata[k]==x :
ix[count]=k//ncol
jx[count]=k%ncol
count+=1
k+=1
return ix[:count],jx[:count]
def whereequal(data,x): return numbaeq(data.ravel(),x,*data.shape)
which gives :
%timeit whereequal(data,10)
10 loops, best of 3: 20.2 ms per loop
Not great optimisation for numba on such problem, under cython performance.
k//ncol and k%ncol can be computed at same time with a optimized divmod operation.
ultimate steps are assembly language and parallélisation , but it's other sports.

How can I speed up Cython code to compute conditional log likelihood of dirichlet?

I have a function that computes the conditional (on kth alpha) log likelihood of a dirichlet distribution. I have it written in Cython and compiled, but my code calls it about 12M times and it seems to be the bottleneck, so I'm hoping to speed it up.
cimport numpy as np
import numpy as np
import math
DTYPE = np.float64
ctypedef np.float64_t DTYPE_t
def logFullConAlphaK(np.ndarray p,np.ndarray alpha, np.int k):
assert p.dtype == np.float64 and alpha.dtype == np.float64
cdef double t1=sum(np.log(p))
cdef DTYPE_t y=((alpha[k-1]-1)*t1)-np.log(alpha[k-1])+(p.shape[0]*
(math.lgamma(sum(alpha))- math.lgamma(alpha[k-1])))
return y
I compile the Cython into a .pyd file that I use in my code. Any thoughts on how I can speed this up?
Thanks
1) By declaring the data types and dimensions of your input arrays and for p.shape[0]:
def logFullConAlphaK(np.ndarray[DTYPE_t, ndim=1] p,
np.ndarray[DTYPE_t, ndim=1] alpha, int k):
...
cdef int tmp
tmp = p.shape[0]
2) By using C functions instead of Python functions from the module math:
cdef extern from "math.h":
double log(double x) nogil
3) Using NumPy's np.ndarray.sum() method
4) Using Cython directives to avoid some overhead
Altogether:
#cython: wraparound=False
#cython: boundscheck=False
#cython: cdivision=True
#cython: nonecheck=False
import math
cimport numpy as np
import numpy as np
cdef extern from "math.h":
double log(double x) nogil
DTYPE = np.float64
ctypedef np.float64_t DTYPE_t
def logFullConAlphaK(np.ndarray[DTYPE_t, ndim=1] p,
np.ndarray[DTYPE_t, ndim=1] alpha, int k):
assert p.dtype == np.float64 and alpha.dtype == np.float64
cdef double t1
cdef int tmp
t1 = np.log(p).sum()
tmp = p.shape[0]
cdef DTYPE_t y=((alpha[k-1]-1)*t1)-log(alpha[k-1])+(tmp*
(math.lgamma(alpha.sum()) - math.lgamma(alpha[k-1])))
return y
Some performance comparisons among the OP's original solution, #cel's solution and mine:
In [2]: timeit solOP(a, b, 10)
1000 loops, best of 3: 273 µs per loop
In [3]: timeit solcel(a, b, 10)
10000 loops, best of 3: 30.5 µs per loop
In [4]: timeit solS(a, b, 10)
100000 loops, best of 3: 15.8 µs per loop
Take this (probably completely unrealistic) sample data:
n = 1000000
p = np.random.rand(n)
alpha = np.random.rand(n)
k = 12
I get following timings:
%timeit logFullConAlphaK(p, alpha, k) -> 1 loops, best of 3: 174 ms per loop
%timeit logFullConAlphaK_opt(p, alpha, k) -> 100 loops, best of 3: 13.3 ms per loop
This version already gives you an order of magnitude in speed. Note that almost all speedup comes from using np.sum over the built-in sum. All other changes are just for cleaner code, they do not have an impact on the speed.
cimport numpy as np
import numpy as np
import math
def logFullConAlphaK_opt(double[:] p, double[:] alpha, int k):
cdef double t1=np.sum(np.log(p))
cdef double y=((alpha[k-1]-1)*t1)-np.log(alpha[k-1])+(p.shape[0]*
(math.lgamma(np.sum(alpha))- math.lgamma(alpha[k-1])))
return y

Why doesn't Numba improve this iteration ...?

I am trying out Numba in speeding up a function that computes a minimum conditional probability of joint occurrence.
import numpy as np
from numba import double
from numba.decorators import jit, autojit
X = np.random.random((100,2))
def cooccurance_probability(X):
P = X.shape[1]
CS = np.sum(X, axis=0) #Column Sums
D = np.empty((P, P), dtype=np.float) #Return Matrix
for i in range(P):
for j in range(P):
D[i, j] = (X[:,i] * X[:,j]).sum() / max(CS[i], CS[j])
return D
cooccurance_probability_numba = autojit(cooccurance_probability)
However I am finding that the performance of cooccurance_probability and cooccurance_probability_numba to be pretty much the same.
%timeit cooccurance_probability(X)
1 loops, best of 3: 302 ms per loop
%timeit cooccurance_probability_numba(X)
1 loops, best of 3: 307 ms per loop
Why is this? Could it be due to the numpy element by element operation?
I am following as an example:
http://nbviewer.ipython.org/github/ellisonbg/talk-sicm2-2013/blob/master/NumbaCython.ipynb
[Note: I could half the execution time due to the symmetric nature of the problem - but that isn't my main concern]
My guess would be that you're hitting the object layer instead of generating native code due to the calls to sum, which means that Numba isn't going to speed things up significantly. It just doesn't know how to optimize/translate sum (at this point). Additionally it's usually better to unroll vectorized operations into explicit loops with Numba. Notice that the ipynb that you link to only calls out to np.sqrt which I believe does get translated to machine code, and it operates on elements, not slices. I would try to expand out the sum in the inner loop as an explicit additional loop over elements, rather than taking slices and using the sum method.
My experience is that Numba can work wonders sometimes, but it doesn't speed-up arbitrary python code. You need to get a sense of the limitations and what it can optimize effectively. Also note that v0.11 is a bit different in this regard as compared to 0.12 and 0.13 due to the major refactoring that Numba went through between those versions.
Below is a solution using Josh's advice, which is spot on. It appears however the max() works fine in the below implementation. It would be great if there was a list of "safe" python / numpy functions.
Note: I reduced the dimensionality of the original matrix to 100 x 200]
import numpy as np
from numba import double
from numba.decorators import jit, autojit
X = np.random.random((100,200))
def cooccurance_probability_explicit(X):
C = X.shape[0]
P = X.shape[1]
# - Column Sums - #
CS = np.zeros((P,), dtype=np.float)
for p in range(P):
for c in range(C):
CS[p] += X[c,p]
D = np.empty((P, P), dtype=np.float) #Return Matrix
for i in range(P):
for j in range(P):
# - Compute Elemental Pairwise Sums over each Product Vector - #
pws = 0
for c in range(C):
pws += (X[c,i] * X[c,j])
D[i,j] = pws / max(CS[i], CS[j])
return D
cooccurance_probability_explicit_numba = autojit(cooccurance_probability_explicit)
%timeit results:
%timeit cooccurance_probability(X)
10 loops, best of 3: 83 ms per loop
%timeit cooccurance_probability_explicit(X)
1 loops, best of 3: 2.55s per loop
%timeit cooccurance_probability_explicit_numba(X)
100 loops, best of 3: 7.72 ms per loop
The interesting thing about the results is that the explicitly written version executed by python is very slow due to the large type checking overheads. But passing through Numba works it's magic. (Numba is ~11.5 times faster than the python solution using Numpy).
Update: Added a Cython Function for Comparison (thanks to moarningsun: Cython function with variable sized matrix input)
%load_ext cythonmagic
%%cython
import numpy as np
cimport numpy as np
def cooccurance_probability_cy(double[:,:] X):
cdef int C, P, i, j, k
C = X.shape[0]
P = X.shape[1]
cdef double pws
cdef double [:] CS = np.sum(X, axis=0)
cdef double [:,:] D = np.empty((P,P), dtype=np.float)
for i in range(P):
for j in range(P):
pws = 0.0
for c in range(C):
pws += (X[c, i] * X[c, j])
D[i,j] = pws / max(CS[i], CS[j])
return D
%timeit results:
%timeit cooccurance_probability_cy(X)
100 loops, best of 3: 12 ms per loop

Incomplete gamma functions: can this code get any faster in cython, C, or Fortran?

As part of a large piece of code, I need to calculate arrays of incomplete gamma functions. For example, I need a function that returns (the log of) (gamma(z + m, a, inf)/m!) for m in [0, m_max], for various values of m_max (typically around 400), z, and a. I need to do this quickly. Currently, this step is the the slowest in my code by around a factor of ~2. However, the full code takes ~a day to run, so reducing the computation time of this step by 2 would save me a lot of wall time.
I am using the following cython code for the calculation:
import numpy as np
cimport numpy as np
from mpmath import mp
sp_max = 5000
def log_factorial(k):
return np.sum(np.log(np.arange(1., k + 1., dtype=np.float)))
log_factorial_ary = np.vectorize(log_factorial)(np.arange(sp_max))
gamma_mem = mp.memoize(mp.gamma)
gammainc_mem = mp.memoize(mp.gammainc)
def gammainc_up_fct_ary_log(np.int m_max, np.float z, np.float a):
cdef np.ndarray gi_list = np.zeros(m_max + 1, dtype=np.float)
gi_list[0] = np.float(gammainc_mem(z, a))
cdef np.ndarray i_array = np.arange(1., m_max + 1., dtype=np.float)
cdef Py_ssize_t i
for i in np.arange(1, m_max + 1):
gi_list[i] = (i_array[i-1] - 1. + z)*gi_list[i-1]/i + np.exp((i_array[i-1] - 1. + z)*np.log(a) - a - log_factorial_ary[i])
return gi_list
As an example, when I call gammainc_up_fct_ary_log(400,-0.3,10.0) it takes around ~0.015-0.025 seconds. I would like to speed this up by at least a factor of 2 (or, ideally, as fast as possible).
Is there a clear way to speed up this computation using cython? If not, would C or Fortran be significantly faster? If so, what is the fastest way to write this function in that language and then call the code from python (the rest of my code is written in python/cython).
Thanks in advance.
There are several big issues in your cython version:
i_array is useless, you can safely replace i_array[i-1] by just i
You're not getting the most of cython. If you have a look to the output of cython -a on your code, you'll see that cython is just generating calls to the C-API, while you need calls to C code to have it run fast.
Here is an example of what you could achieve (incomplete, but the speedup is already great)
import numpy as np
cimport numpy as np
cimport cython
from mpmath import mp
cdef extern from "math.h":
double log(double x) nogil
double exp(double x) nogil
sp_max = 5000
def log_factorial(k):
return np.sum(np.log(np.arange(1., k + 1., dtype=np.float)))
factorial_ary = np.array([np.float(mp.factorial(m)) for m in np.arange(sp_max)])
log_factorial_ary = np.vectorize(log_factorial)(np.arange(sp_max))
gamma_mem = mp.memoize(mp.gamma)
gammainc_mem = mp.memoize(mp.gammainc)
def gammainc_up_fct_ary_log(m_max, z, a):
return gammainc_up_fct_ary_log_impl(m_max, z, a)
#cython.boundscheck(False)
#cython.wraparound(False)
#cython.cdivision(True)
cdef gammainc_up_fct_ary_log_impl(int m_max, double z, double a):
cdef double[::1] gi_list = np.zeros(m_max + 1, dtype=np.float)
gi_list[0] = gammainc_mem(z, a)
cdef Py_ssize_t i
for i in range(1, m_max + 1):
t0 = (i - 1. + z)
t1 = (i - 1. + z)*log(a) - a
gi_list[i] = t0*gi_list[i-1]/i + exp(t1 - log_factorial_ary[i])
return gi_list
running this code gives me:
python -m timeit -s 'from ff import gammainc_up_fct_ary_log' 'gammainc_up_fct_ary_log(400,-0.3,10.0)'
10000 loops, best of 3: 132 usec per loop
while your version hardly gives:
python -m timeit -s 'from ff import gammainc_up_fct_ary_log' 'gammainc_up_fct_ary_log(400,-0.3,10.0)'
100 loops, best of 3: 2.44 msec per loop

Categories