Defining NumPy arrays in Cython without incurring python overhead - python

I have been trying to learn Cython to speed up some of my calculations. Here is a subset of what I am trying to do: this is simply integrating a differential equation using a recursive formula while making use of NumPy arrays. I have already achieved a factor of ~100x speed increase over the pure python version. However it seems like I can gain added speed based on looking at the HTML file generated for my code by the -a cython command. My code is as follows (lines that become yellow in the HTML file that I would like to make white are labeled):
%%cython
import numpy as np
cimport numpy as np
cimport cython
from libc.math cimport exp,sqrt
#cython.boundscheck(False)
cdef double riccati_int(double j, double w, double h, double an, double d):
cdef:
double W
double an1
W = sqrt(w**2 + d**2)
#dark_yellow
an1 = ((d - (W + w) * an) * exp(-2 * W * h / j ) - d - (W - w) * an) /
((d * an - W + w) * exp(-2 * W * h / j) - d * an - W - w)
return an1
def acalc(double j, double w):
cdef:
int xpos, i, n
np.ndarray[np.int_t, ndim=1] xvals
np.ndarray[np.double_t, ndim=1] h, a
xpos = 74
xvals = np.array([0, 8, 23, 123, 218], dtype=np.int) #dark_yellow
h = np.array([1, .1, .01, .1], dtype=np.double) #dark_yellow
a = np.empty(219, dtype=np.double) #dark_yellow
a[0] = 1 / (w + sqrt(w**2 + 1)) #light_yellow
for i in range(h.size): #dark_yellow
for n in range(xvals[i], xvals[i + 1]): #light_yellow
if n < xpos:
a[n+1] = riccati_int(j, w, h[i], a[n], 1.) #light_yellow
else:
a[n+1] = riccati_int(j, w, h[i], a[n], 0.) #light_yellow
return a
It seems to me like all 9 lines that I labeled above should be able to be made white with the proper adjustments. One issue is the ability to define NumPy arrays the proper way. But probably even more important is the ability to get the first labeled line to work efficiently, since this is where the bulk of the calculation is done. I tried reading the generated C code that the HTML file displays after clicking on a yellow line, but I honestly have no clue how to read that code. If anybody could please help me out, it would be greatly appreciated.

I think you don't need to care about yellow lines that is not in loop. Add following compiler directives will make the three lines in loop faster:
#cython.cdivision(True)
cdef double riccati_int(double j, double w, double h, double an, double d):
pass
#cython.boundscheck(False)
#cython.wraparound(False)
def acalc(double j, double w):
pass

I'm not sure, whether it makes a difference, but you could do use memory-views for the arrays, e. g.
cdef double [:] h = np.array([1, .1, .01, .1], dtype=np.double) #dark_yellow
cdef double [:] a = np.empty(219, dtype=np.double) #dark_yellow
Also creating an numpy array for four static values is a bit overdone. This can be replaced by a static C array
cdef double *h = [1, .1, .01, .1]
However, as mentioned, what in the loop is, that matters most. Since line profiler won't work for cython (afaik) use time module to benchmark within the function, besides using cProfile. It might give you an idea, that the intensity of the line color in the cython log has to be assessed in context.
It is recommended to use the python types for indexing, as I learned
size_t i, n
Py_ssize_t i, n
The second one is the signed version

Related

Very slow assigning process in cython

I have the following simple function written using cython syntaxes:
%%cython
import numpy as np
cimport cython
import math
#cython.boundscheck(False)
#cython.wraparound(False)
def calc_cy(float[:, ::1] matrix, int nXX, int nYY, float git, float dgit, float[:, ::1] bus, float[:, ::1] kapa):
cdef Py_ssize_t x_max = nXX + 1
cdef Py_ssize_t y_max = nYY + 1
result = np.zeros((x_max, y_max), dtype=np.float32)
cdef float[:, ::1] result_view = result
cdef float tmp = 0.0, tmp1 = 0.0, pref = 0.0, dgit_u = 0.0
cdef Py_ssize_t x, y
pref = 5.1008 * 10.0**-5 * (3.92**(0.08 / 5.214 * (10**2) / (git + 78.05)))
dgit = dgit/30601
for x in range(x_max):
for y in range(y_max):
dgit_u = dgit * (matrix[x, y]**1.692 / pref)
tmp = kapa[x, y] + dgit_u
tmp1 = bus[x, y] - (2.7182**(- tmp ** 4.0 / 1.73)) * dgit_u / 7.13
#result_view[x, y] = tmp
return result
If I run this function for 100 loop with random variable (following code), it only takes around 0.09 sec. But if I uncomment "result_view[x, y] = tmp" in the line before the last line in the function and I run the same loop, it takes 2.7 sec. Does anyone know, why the assignment process to the result_view arrays is very slow? Any comment would be highly appreciated.
nXX, nYY = 999, 999
git, dgit = np.float32(35.0), np.float32(0.01)
matrix = np.random.uniform(0,1,size=(nXX+1,nYY+1)).astype(np.float32)
bus = np.random.uniform(0,1,size=(nXX+1,nYY+1)).astype(np.float32)
kapa = np.random.uniform(0,1,size=(nXX+1,nYY+1)).astype(np.float32)
past = time.time()
for i in range(100):
calc_cy(matrix, nXX, nYY, git, dgit, bus, kapa)
print (time.time() - past)
Many thanks!
I tried to recast the data type, but it didn't solve the problem. I also checked to make sure the data type generated by the function is the same as the data type needed by the array. I expected that the assigning process should only take 1 sec at maximum, but it is taking 2 sec.
But if I uncomment result_view[x, y] = tmp in the line before the last line in the function and I run the same loop, it takes 2.7 sec.
This is something seen quite a bit in optimization questions. What you're seeing is that if you don't use the result of the loop then the C compiler eliminates the whole loop body and it seems really quick.
The 2.7s is the speed it actually takes to run.
It looks to be like you're typing most variables correctly so there aren't any quick obvious optimisations.

Nested loops with cython for image processing

I'm trying to iterate over a 2D image containing floating-point depth data, it has a somewhat normal resolution (640, 480), but python has been too slow, so I've been trying to optimize the problem by using cython.
I've tried to move the looping to other functions, shifting around the nogil statement, didn't seem to work, after reworking the problem, I was able to get a portion of it working. But this last part is escaping me to no avail.
I've attempted to get rid of python objects from the prange() loop by moving them to the with gil section beforehand, hence:
cdef int[:] w_list = array.array(range(0, w_inc, interpolation))
instead of
for r in range(0, w_inc, interpolation):
but the error persists
My code works in two parts:
The split_data() method subsections the image into num quadrants that are stored in a 3D array bits. These are use to make splitting up the work to multiple thread/processes easier. This part works okay.
#cython.cdivision(True)
#cython.boundscheck(False)
cpdef split_data(double[:, :] frame, int h, int w, int num):
cdef double[:, :, :] bits = np.zeros(shape=(num, h // num, w // num), dtype=float)
cdef int c_count = os.cpu_count()
cdef int i, j, k
for i in prange(num, nogil=True, num_threads=c_count):
for j in prange(h // num):
for k in prange(w // num):
bits[i, j, k] = frame[i * (h // num) + j, i * (w // num) + k]
return bits
The scatter_data() method takes the bits array from the previous function and then creates another 3D array with length num where num is the length of bits, called points which is a series of 3D coordinates representing valid depth points. It then uses prange() to extract the valid depth data from each of these bits and stores them into points
#cython.cdivision(True)
#cython.boundscheck(False)
cpdef scatter_data(double[:, :] depths, object validator=None,
int h=-1, int w=-1, int interpolation=1):
# Handles if h or w is -1 (default)
if h < 0 or w < 0:
h = depths.shape[0] if h < 0 else h
w = depths.shape[1] if w < 0 else w
cdef int max_num = w * h
cdef int c_count = os.cpu_count()
cdef int h_inc = h // c_count, w_inc = w // c_count
cdef double[:, :, :] points = np.zeros(shape=(c_count, max_num, 3), dtype=float)
cdef double[:, :, :] bits = split_data(depths, h, w, c_count)
cdef int count = 0
cdef int i, r, c
cdef int[:] w_list = array.array(range(0, w_inc, interpolation))
cdef int[:] h_list = array.array(range(0, h_inc, interpolation))
for i in prange(c_count, nogil=True, num_threads=c_count):
count = 0
for r in w_list:
for c in h_list:
if depths[c, r] != 0:
points[i, count, 0] = w - r
points[i, count, 1] = c
points[i, count, 2] = depths[c, r]
count = count + 1
points = points[:count]
return points
and for completeness
3. Here are my import statements
import cython
from cython.parallel import prange
from cpython cimport array
import array
cimport numpy as np
import numpy as np
import os
When compiling the code I keep getting error messages something along the lines of:
Error compiling Cython file:
------------------------------------------------------------
...
cdef int[:] w_list = array.array(range(0, w_inc, interpolation))
cdef int[:] h_list = array.array(range(0, h_inc, interpolation))
for i in prange(c_count, nogil=True, num_threads=c_count):
count = 0
for r in w_list:
^
------------------------------------------------------------
data_util/cy_scatter.pyx:70:17: Iterating over Python object not allowed without gil
and
Error compiling Cython file:
------------------------------------------------------------
...
cdef int[:] w_list = array.array(range(0, w_inc, interpolation))
cdef int[:] h_list = array.array(range(0, h_inc, interpolation))
for i in prange(c_count, nogil=True, num_threads=c_count):
count = 0
for r in w_list:
^
------------------------------------------------------------
data_util/cy_scatter.pyx:70:17: Coercion from Python not allowed without the GIL
and
Error compiling Cython file:
------------------------------------------------------------
...
cdef int[:] w_list = array.array(range(0, w_inc, interpolation))
cdef int[:] h_list = array.array(range(0, h_inc, interpolation))
for i in prange(c_count, nogil=True, num_threads=c_count):
count = 0
for r in w_list:
^
------------------------------------------------------------
data_util/cy_scatter.pyx:70:17: Converting to Python object not allowed without gil
Is there a way to do this? And if so, how do I do this?
You just want to iterate by index rather than by iterating over a Python iterator:
for ri in range(w_list.shape[0]):
r = w_list[ri]
This is somewhere where best practice in Python differs from best practice in Cython - Cython only accelerates iterating over numeric loops. The way you're trying to do it will fall back to being a Python iterator which is both slower, and requires the GIL.

Parallelising an Exhaustive Search in Cython

I'm fairly new to Cython, and I'm trying to Cythonize some code of mine. I have a 3D array, X, of complex values (which I'm treating as a large 'stack' of square arrays) which has shape on the scale of (small, small, huge), and I need to find the location and the absolute value of the largest above-diagonal item. I currently have an exhaustive search like this:
cdef double complex[:,:,:] Xcp = X.copy()
cdef Py_ssize_t h = Xcp.shape[0]
cdef Py_ssize_t w = Xcp.shape[1]
cdef Py_ssize_t l = Xcp.shape[2]
cdef Py_ssize_t j, k, m
cdef double tmptop = 0.0
cdef Py_ssize_t[:] coords = np.zeros((3), dtype="intp")
cdef double it
for j in range(l):
for k in range(w):
for m in range(k+1, h):
it = cabs(Xcp[m, k, j])
if it > tmptop:
tmptop = it
coords[0] = m
coords[1] = k
coords[2] = j
Note that I'm getting cabs from here:
cdef extern from "complex.h":
double cabs(double complex)
This code is already quite a lot faster than what I had previously in Numpy, but I do feel as though it could be sped up, in particular paralellised.
I have tried changing the loop to this:
with nogil:
for j in prange(l):
for k in range(w):
for m in range(k+1, h):
it = abs(Xcp[m, k, j])
if it > tmptop:
tmptop = it
coords[0] = m
coords[1] = k
coords[2] = j
Though I'm now getting the wrong result. What's going on here?

iterating through specified axis in cython

I am learning cython and I have modified the code in the tutorial to try to do numerical differentiation:
import numpy as np
cimport numpy as np
import cython
np.import_array()
def test3(a, int order=2, int axis=-1):
cdef int i
if axis<0:
axis = len(a.shape) + axis
out = np.empty(a.shape, np.double)
cdef np.flatiter ita = np.PyArray_IterAllButAxis(a, &axis)
cdef np.flatiter ito = np.PyArray_IterAllButAxis(out, &axis)
cdef int a_axis_stride = a.strides[axis]
cdef int o_axis_stride = out.strides[axis]
cdef int axis_length = out.shape[axis]
cdef double value
while np.PyArray_ITER_NOTDONE(ita):
# first element
pt1 = <double*>((<char*>np.PyArray_ITER_DATA(ita)))
pt2 = (<double*>((<char*>np.PyArray_ITER_DATA(ita)) + 1*a_axis_stride))
pt3 = (<double*>((<char*>np.PyArray_ITER_DATA(ita)) + 2*a_axis_stride))
value = -1.5*pt1[0] + 2*pt2[0] - 0.5*pt3[0]
(<double*>((<char*>np.PyArray_ITER_DATA(ito))))[0] = value
for i in range(axis_length-2):
pt1 = (<double*>((<char*>np.PyArray_ITER_DATA(ita)) + i*a_axis_stride))
pt2 = (<double*>((<char*>np.PyArray_ITER_DATA(ita)) + (i+2)*a_axis_stride))
value = -0.5*pt1[0] + 0.5*pt2[0]
(<double*>((<char*>np.PyArray_ITER_DATA(ito)) + (i+1)*o_axis_stride))[0] = value
# last element
pt1 = (<double*>((<char*>np.PyArray_ITER_DATA(ita))+ (axis_length-3)*a_axis_stride))
pt2 = (<double*>((<char*>np.PyArray_ITER_DATA(ita))+ (axis_length-2)*a_axis_stride))
pt3 = (<double*>((<char*>np.PyArray_ITER_DATA(ita))+ (axis_length-1)*a_axis_stride))
value = 1.5*pt3[0] - 2*pt2[0] + 0.5*pt1[0]
(<double*>((<char*>np.PyArray_ITER_DATA(ito))+(axis_length-1)*o_axis_stride))[0] = value
np.PyArray_ITER_NEXT(ita)
np.PyArray_ITER_NEXT(ito)
return out
The code produces correct results, and it is indeed faster than my own numpy implementation without cython. The problem is the following:
I thought about only having one pt1 = (<double*>((<char*>np.PyArray_ITER_DATA(ita)) + i*a_axis_stride)) statement and then use pt1[0], pt1[-1], pt1[1] to access the array elements, but this only works if the specified axis is the last one. If I am differentiating a different axis (not the last one), then (<double*>((<char*>np.PyArray_ITER_DATA(ita)) + i*a_axis_stride)) points to the right one, but pt[-1] and pt[1] point to the elements right before and after pt[0], which is along the last axis. The current version works, but if I want to implement higher-order differentiation which requires more points to evaluate, then I will end up having many such lines, and I'm not sure if there are better/more efficient ways to do it using pt[1] or
something like pt[xxx] to access neighbouring points (along the specified axis).
Are there other ways to speed up this piece of code? I am looking for some minor details that I may have overlooked or subtle things that can have a big effect.
To my slight surprise I can't actually beat your version using Cython typed memoryviews - the numpy iterators look pretty quick. However I think I can manage a significant increase in readability to let you use the Python slicing syntax. The only restriction is that the input array must be C contiguous to allow it to be reshaped easily (I think Fortran contiguous might also work, but I haven't tested)
The basic trick is to flatten all the axes before and after selected axis so it is a known 3D shape, at which point you can use Cython memoryviews.
#cython.boundscheck(False)
def test4(a,order=2,axis=-1):
assert a.flags['C_CONTIGUOUS'] # otherwise the reshape doesn't work
before = np.product(a.shape[:axis])
after = np.product(a.shape[(axis+1):])
cdef double[:,:,::1] a_new = a.reshape((before, a.shape[axis], after)) # this should not involve copying memory - it's just a new view
cdef double[:] a_slice
cdef double[:,:,::1] out = np.empty_like(a_new)
assert a_new.shape[1] > 3
cdef int m,n,i
for m in range(a_new.shape[0]):
for n in range(a_new.shape[2]):
a_slice = a_new[m,:,n]
out[m,0,n] = -1.5*a_slice[0] + 2*a_slice[1] - 0.5*a_slice[2]
for i in range(a_slice.shape[0]-2):
out[m,i+1,n] = -0.5*a_slice[i] + 0.5*a_slice[i+2]
# last element
out[m,-1,n] = 1.5*a_slice[-1] - 2*a_slice[-2] + 0.5*a_slice[-3]
return np.asarray(out).reshape(a.shape)
The speed is very slightly slower than your version I think.
In terms of improving your code, you could work out the stride in doubles instead of bytes (a_axis_stride_dbl = a_axis_stride/sizeof(double)) and then index as pt[i*a_axis_stride_dbl]). It probably won't gain much speed but will be more readable. (This is what you ask about in point 1)

Incomplete gamma functions: can this code get any faster in cython, C, or Fortran?

As part of a large piece of code, I need to calculate arrays of incomplete gamma functions. For example, I need a function that returns (the log of) (gamma(z + m, a, inf)/m!) for m in [0, m_max], for various values of m_max (typically around 400), z, and a. I need to do this quickly. Currently, this step is the the slowest in my code by around a factor of ~2. However, the full code takes ~a day to run, so reducing the computation time of this step by 2 would save me a lot of wall time.
I am using the following cython code for the calculation:
import numpy as np
cimport numpy as np
from mpmath import mp
sp_max = 5000
def log_factorial(k):
return np.sum(np.log(np.arange(1., k + 1., dtype=np.float)))
log_factorial_ary = np.vectorize(log_factorial)(np.arange(sp_max))
gamma_mem = mp.memoize(mp.gamma)
gammainc_mem = mp.memoize(mp.gammainc)
def gammainc_up_fct_ary_log(np.int m_max, np.float z, np.float a):
cdef np.ndarray gi_list = np.zeros(m_max + 1, dtype=np.float)
gi_list[0] = np.float(gammainc_mem(z, a))
cdef np.ndarray i_array = np.arange(1., m_max + 1., dtype=np.float)
cdef Py_ssize_t i
for i in np.arange(1, m_max + 1):
gi_list[i] = (i_array[i-1] - 1. + z)*gi_list[i-1]/i + np.exp((i_array[i-1] - 1. + z)*np.log(a) - a - log_factorial_ary[i])
return gi_list
As an example, when I call gammainc_up_fct_ary_log(400,-0.3,10.0) it takes around ~0.015-0.025 seconds. I would like to speed this up by at least a factor of 2 (or, ideally, as fast as possible).
Is there a clear way to speed up this computation using cython? If not, would C or Fortran be significantly faster? If so, what is the fastest way to write this function in that language and then call the code from python (the rest of my code is written in python/cython).
Thanks in advance.
There are several big issues in your cython version:
i_array is useless, you can safely replace i_array[i-1] by just i
You're not getting the most of cython. If you have a look to the output of cython -a on your code, you'll see that cython is just generating calls to the C-API, while you need calls to C code to have it run fast.
Here is an example of what you could achieve (incomplete, but the speedup is already great)
import numpy as np
cimport numpy as np
cimport cython
from mpmath import mp
cdef extern from "math.h":
double log(double x) nogil
double exp(double x) nogil
sp_max = 5000
def log_factorial(k):
return np.sum(np.log(np.arange(1., k + 1., dtype=np.float)))
factorial_ary = np.array([np.float(mp.factorial(m)) for m in np.arange(sp_max)])
log_factorial_ary = np.vectorize(log_factorial)(np.arange(sp_max))
gamma_mem = mp.memoize(mp.gamma)
gammainc_mem = mp.memoize(mp.gammainc)
def gammainc_up_fct_ary_log(m_max, z, a):
return gammainc_up_fct_ary_log_impl(m_max, z, a)
#cython.boundscheck(False)
#cython.wraparound(False)
#cython.cdivision(True)
cdef gammainc_up_fct_ary_log_impl(int m_max, double z, double a):
cdef double[::1] gi_list = np.zeros(m_max + 1, dtype=np.float)
gi_list[0] = gammainc_mem(z, a)
cdef Py_ssize_t i
for i in range(1, m_max + 1):
t0 = (i - 1. + z)
t1 = (i - 1. + z)*log(a) - a
gi_list[i] = t0*gi_list[i-1]/i + exp(t1 - log_factorial_ary[i])
return gi_list
running this code gives me:
python -m timeit -s 'from ff import gammainc_up_fct_ary_log' 'gammainc_up_fct_ary_log(400,-0.3,10.0)'
10000 loops, best of 3: 132 usec per loop
while your version hardly gives:
python -m timeit -s 'from ff import gammainc_up_fct_ary_log' 'gammainc_up_fct_ary_log(400,-0.3,10.0)'
100 loops, best of 3: 2.44 msec per loop

Categories