Parallelising an Exhaustive Search in Cython

Parallelising an Exhaustive Search in Cython - python

I'm fairly new to Cython, and I'm trying to Cythonize some code of mine. I have a 3D array, X, of complex values (which I'm treating as a large 'stack' of square arrays) which has shape on the scale of (small, small, huge), and I need to find the location and the absolute value of the largest above-diagonal item. I currently have an exhaustive search like this:
cdef double complex[:,:,:] Xcp = X.copy()
cdef Py_ssize_t h = Xcp.shape[0]
cdef Py_ssize_t w = Xcp.shape[1]
cdef Py_ssize_t l = Xcp.shape[2]
cdef Py_ssize_t j, k, m
cdef double tmptop = 0.0
cdef Py_ssize_t[:] coords = np.zeros((3), dtype="intp")
cdef double it
for j in range(l):
for k in range(w):
for m in range(k+1, h):
it = cabs(Xcp[m, k, j])
if it > tmptop:
tmptop = it
coords[0] = m
coords[1] = k
coords[2] = j
Note that I'm getting cabs from here:
cdef extern from "complex.h":
double cabs(double complex)
This code is already quite a lot faster than what I had previously in Numpy, but I do feel as though it could be sped up, in particular paralellised.
I have tried changing the loop to this:
with nogil:
for j in prange(l):
for k in range(w):
for m in range(k+1, h):
it = abs(Xcp[m, k, j])
if it > tmptop:
tmptop = it
coords[0] = m
coords[1] = k
coords[2] = j
Though I'm now getting the wrong result. What's going on here?

Related

Nested loops with cython for image processing

I'm trying to iterate over a 2D image containing floating-point depth data, it has a somewhat normal resolution (640, 480), but python has been too slow, so I've been trying to optimize the problem by using cython.
I've tried to move the looping to other functions, shifting around the nogil statement, didn't seem to work, after reworking the problem, I was able to get a portion of it working. But this last part is escaping me to no avail.
I've attempted to get rid of python objects from the prange() loop by moving them to the with gil section beforehand, hence:
cdef int[:] w_list = array.array(range(0, w_inc, interpolation))
instead of
for r in range(0, w_inc, interpolation):
but the error persists
My code works in two parts:
The split_data() method subsections the image into num quadrants that are stored in a 3D array bits. These are use to make splitting up the work to multiple thread/processes easier. This part works okay.
#cython.cdivision(True)
#cython.boundscheck(False)
cpdef split_data(double[:, :] frame, int h, int w, int num):
cdef double[:, :, :] bits = np.zeros(shape=(num, h // num, w // num), dtype=float)
cdef int c_count = os.cpu_count()
cdef int i, j, k
for i in prange(num, nogil=True, num_threads=c_count):
for j in prange(h // num):
for k in prange(w // num):
bits[i, j, k] = frame[i * (h // num) + j, i * (w // num) + k]
return bits
The scatter_data() method takes the bits array from the previous function and then creates another 3D array with length num where num is the length of bits, called points which is a series of 3D coordinates representing valid depth points. It then uses prange() to extract the valid depth data from each of these bits and stores them into points
#cython.cdivision(True)
#cython.boundscheck(False)
cpdef scatter_data(double[:, :] depths, object validator=None,
int h=-1, int w=-1, int interpolation=1):
# Handles if h or w is -1 (default)
if h < 0 or w < 0:
h = depths.shape[0] if h < 0 else h
w = depths.shape[1] if w < 0 else w
cdef int max_num = w * h
cdef int c_count = os.cpu_count()
cdef int h_inc = h // c_count, w_inc = w // c_count
cdef double[:, :, :] points = np.zeros(shape=(c_count, max_num, 3), dtype=float)
cdef double[:, :, :] bits = split_data(depths, h, w, c_count)
cdef int count = 0
cdef int i, r, c
cdef int[:] w_list = array.array(range(0, w_inc, interpolation))
cdef int[:] h_list = array.array(range(0, h_inc, interpolation))
for i in prange(c_count, nogil=True, num_threads=c_count):
count = 0
for r in w_list:
for c in h_list:
if depths[c, r] != 0:
points[i, count, 0] = w - r
points[i, count, 1] = c
points[i, count, 2] = depths[c, r]
count = count + 1
points = points[:count]
return points
and for completeness
3. Here are my import statements
import cython
from cython.parallel import prange
from cpython cimport array
import array
cimport numpy as np
import numpy as np
import os
When compiling the code I keep getting error messages something along the lines of:
Error compiling Cython file:
------------------------------------------------------------
...
cdef int[:] w_list = array.array(range(0, w_inc, interpolation))
cdef int[:] h_list = array.array(range(0, h_inc, interpolation))
for i in prange(c_count, nogil=True, num_threads=c_count):
count = 0
for r in w_list:
^
------------------------------------------------------------
data_util/cy_scatter.pyx:70:17: Iterating over Python object not allowed without gil
and
Error compiling Cython file:
------------------------------------------------------------
...
cdef int[:] w_list = array.array(range(0, w_inc, interpolation))
cdef int[:] h_list = array.array(range(0, h_inc, interpolation))
for i in prange(c_count, nogil=True, num_threads=c_count):
count = 0
for r in w_list:
^
------------------------------------------------------------
data_util/cy_scatter.pyx:70:17: Coercion from Python not allowed without the GIL
and
Error compiling Cython file:
------------------------------------------------------------
...
cdef int[:] w_list = array.array(range(0, w_inc, interpolation))
cdef int[:] h_list = array.array(range(0, h_inc, interpolation))
for i in prange(c_count, nogil=True, num_threads=c_count):
count = 0
for r in w_list:
^
------------------------------------------------------------
data_util/cy_scatter.pyx:70:17: Converting to Python object not allowed without gil
Is there a way to do this? And if so, how do I do this?

You just want to iterate by index rather than by iterating over a Python iterator:
for ri in range(w_list.shape[0]):
r = w_list[ri]
This is somewhere where best practice in Python differs from best practice in Cython - Cython only accelerates iterating over numeric loops. The way you're trying to do it will fall back to being a Python iterator which is both slower, and requires the GIL.

Cython call optimization

I've got a Python function I try to export to Cython. I have tested two implementations but I don't understand why the second one is slower than the first one. Furthermore, I am looking for ways to improve speed a little more but I have no clue how ?
Base code
import numpy as np
cimport numpy as np
cimport cython
DTYPE = np.int
ctypedef np.int_t DTYPE_t
cdef inline int int_max(int a, int b): return a if a >= b else b
cdef inline int int_min(int a, int b): return a if a <= b else b
cdef extern from "math.h":
double exp(double x)
#cython.boundscheck(False)
#cython.wraparound(False)
def bilateral_filter_C(np.ndarray[np.float_t, ndim=1] samples, int w=20):
# Filter Parameters
cdef Py_ssize_t size = samples.shape[0]
cdef float rang
cdef float sigma = 2*3.0*3.0
cdef int j, L
cdef unsigned int a, b
cdef np.float_t W, num, sub_sample, intensity
# Initialization
cdef np.ndarray[np.float_t, ndim=1] gauss = np.zeros(2*w+1, dtype=np.float)
cdef np.ndarray[np.float_t, ndim=1] sub_samples, intensities = np.empty(size, dtype=np.float)
cdef np.ndarray[np.float_t, ndim=1] samples_filtered = np.empty(size, dtype=np.float)
L = 2*w+1
for j in xrange(L):
rang = -w+1.0/L
rang *= rang
gauss[j] = exp(-rang/sigma)
<CODE TO IMPROVE>
return samples_filtered
I tried to inject those two code samples in the <CODE TO IMPROVE> section:
Most efficient approach
for i in xrange(size):
a = <unsigned int>int_max(i-w, 0)
b = <unsigned int>int_min(i+w, size-1)
L = b-a
sub_samples = samples[a:b]-samples[i]
sub_samples *= sub_samples
for j in xrange(L):
sub_samples[j] = exp(-sub_samples[j]/sigma)
intensities = gauss[w-i+a:w-i+b]*sub_samples
num = 0.0
W = 0.0
for j in xrange(L):
W += intensities[j]
num += intensities[j]*samples[a+j]
samples_filtered[i] = num/W
Result
%timeit -n1 -r10 bilateral_filter_C(x, 20)
1 loop, best of 10: 45 ms per loop
Less efficient
for i in xrange(size):
a = <unsigned int>int_max(i-w, 0)
b = <unsigned int>int_min(i+w, size-1)
num = 0.0
W = 0.0
for j in xrange(b-a):
sub_sample = samples[a+j]-samples[i]
intensity1 = gauss[w-i+a+j]*exp(-sub_sample*sub_sample/sigma)
W += intensity
num += intensity*samples[a+j]
samples_filtered[i] = num/W
Result
%timeit -n1 -r10 bilateral_filter_C(x, 20)
1 loop, best of 10: 125 ms per loop

You have a few typos:
1) You forgot to define i, just add cdef int i, j, L
2) In the second algorithm you wrote intensity1 = gauss[w-i+a+j]*exp(-sub_sample*sub_sample/sigma), it should be intensity, without the 1
3) I would add #cython.cdivision(True) to avoid the check of division by zero
With those changes and with x = np.random.rand(10000)I got the following results
%timeit bilateral_filter_C1(x, 20) # First code
10 loops, best of 3: 74.1 ms per loop
%timeit bilateral_filter_C2(x, 20) # Second code
100 loops, best of 3: 9.5 ms per loop
And, to check the results
np.all(np.equal(bilateral_filter_C1(x, 20), bilateral_filter_C2(x, 20)))
True
To avoid these problems I suggest to use the option cython my_file.pyx -a, it generates an html file that shows you the possible problems you have in your code
EDIT
Reading again the code, it seems to have more errors:
for j in xrange(L):
rang = -w+1.0/L
rang *= rang
gauss[j] = exp(-rang/sigma)
gauss has the same value always, what is the definition of rang?

Defining NumPy arrays in Cython without incurring python overhead

I have been trying to learn Cython to speed up some of my calculations. Here is a subset of what I am trying to do: this is simply integrating a differential equation using a recursive formula while making use of NumPy arrays. I have already achieved a factor of ~100x speed increase over the pure python version. However it seems like I can gain added speed based on looking at the HTML file generated for my code by the -a cython command. My code is as follows (lines that become yellow in the HTML file that I would like to make white are labeled):
%%cython
import numpy as np
cimport numpy as np
cimport cython
from libc.math cimport exp,sqrt
#cython.boundscheck(False)
cdef double riccati_int(double j, double w, double h, double an, double d):
cdef:
double W
double an1
W = sqrt(w**2 + d**2)
#dark_yellow
an1 = ((d - (W + w) * an) * exp(-2 * W * h / j ) - d - (W - w) * an) /
((d * an - W + w) * exp(-2 * W * h / j) - d * an - W - w)
return an1
def acalc(double j, double w):
cdef:
int xpos, i, n
np.ndarray[np.int_t, ndim=1] xvals
np.ndarray[np.double_t, ndim=1] h, a
xpos = 74
xvals = np.array([0, 8, 23, 123, 218], dtype=np.int) #dark_yellow
h = np.array([1, .1, .01, .1], dtype=np.double) #dark_yellow
a = np.empty(219, dtype=np.double) #dark_yellow
a[0] = 1 / (w + sqrt(w**2 + 1)) #light_yellow
for i in range(h.size): #dark_yellow
for n in range(xvals[i], xvals[i + 1]): #light_yellow
if n < xpos:
a[n+1] = riccati_int(j, w, h[i], a[n], 1.) #light_yellow
else:
a[n+1] = riccati_int(j, w, h[i], a[n], 0.) #light_yellow
return a
It seems to me like all 9 lines that I labeled above should be able to be made white with the proper adjustments. One issue is the ability to define NumPy arrays the proper way. But probably even more important is the ability to get the first labeled line to work efficiently, since this is where the bulk of the calculation is done. I tried reading the generated C code that the HTML file displays after clicking on a yellow line, but I honestly have no clue how to read that code. If anybody could please help me out, it would be greatly appreciated.

I think you don't need to care about yellow lines that is not in loop. Add following compiler directives will make the three lines in loop faster:
#cython.cdivision(True)
cdef double riccati_int(double j, double w, double h, double an, double d):
pass
#cython.boundscheck(False)
#cython.wraparound(False)
def acalc(double j, double w):
pass

I'm not sure, whether it makes a difference, but you could do use memory-views for the arrays, e. g.
cdef double [:] h = np.array([1, .1, .01, .1], dtype=np.double) #dark_yellow
cdef double [:] a = np.empty(219, dtype=np.double) #dark_yellow
Also creating an numpy array for four static values is a bit overdone. This can be replaced by a static C array
cdef double *h = [1, .1, .01, .1]
However, as mentioned, what in the loop is, that matters most. Since line profiler won't work for cython (afaik) use time module to benchmark within the function, besides using cProfile. It might give you an idea, that the intensity of the line color in the cython log has to be assessed in context.
It is recommended to use the python types for indexing, as I learned
size_t i, n
Py_ssize_t i, n
The second one is the signed version

numpy template matching using matrix multiplications

I am trying to match a template with a binary image (only black and white) by shifting the template along the image. And return the minimum distance between the template and the image with the corresponding position on which this minimum distance did occur. For example:
img:
0 1 0
0 0 1
0 1 1
template:
0 1
1 1
This template matches the image best at position (1,1) and the distance will then be 0. So far things are not too difficult and I already got some code that does the trick.
def match_template(img, template):
mindist = float('inf')
idx = (-1,-1)
for y in xrange(img.shape[1]-template.shape[1]+1):
for x in xrange(img.shape[0]-template.shape[0]+1):
#calculate Euclidean distance
dist = np.sqrt(np.sum(np.square(template - img[x:x+template.shape[0],y:y+template.shape[1]])))
if dist < mindist:
mindist = dist
idx = (x,y)
return [mindist, idx]
But for images of the size I need (image among 500 x 200 pixels and template among 250 x 100) this already takes approximately 4.5 seconds, which is way too slow. And I know the same thing can be done much quicker using matrix multiplications (in matlab I believe this can be done using im2col and repmat). Can anyone explain me how to do it in python/numpy?
btw. I know there is an opencv matchTemplate function that does exactly what I need, but since I might need to alter the code slightly later on I would prefer a solution which I fully understand and can alter.
Thanks!
edit: If anyone can explain me how opencv does this in less than 0.2 seconds that would also be great. I have had a short look at the source code, but those things somehow always look quite complicated to me.
edit2: Cython code
import numpy as np
cimport numpy as np
DTYPE = np.int
ctypedef np.int_t DTYPE_t
def match_template(np.ndarray img, np.ndarray template):
cdef float mindist = float('inf')
cdef int x_coord = -1
cdef int y_coord = -1
cdef float dist
cdef unsigned int x, y
cdef int img_width = img.shape[0]
cdef int img_height = img.shape[1]
cdef int template_width = template.shape[0]
cdef int template_height = template.shape[1]
cdef int range_x = img_width-template_width+1
cdef int range_y = img_height-template_height+1
for y from 0 <= y < range_y:
for x from 0 <= x < range_x:
dist = np.sqrt(np.sum(np.square(template - img[ x:<unsigned int>(x+template_width), y:<unsigned int>(y+template_height) ]))) #calculate euclidean distance
if dist < mindist:
mindist = dist
x_coord = x
y_coord = y
return [mindist, (x_coord,y_coord)]
img = np.asarray(img, dtype=DTYPE)
template = np.asarray(template, dtype=DTYPE)
match_template(img, template)

One possible way of doing what you want is via convolution (which can be brute force or FFT). Matrix multiplications AFAIK won't work. You need to convolve your data with the template. And find the maximum (you'll also need to do some scaling to make it work properly).
xs=np.array([[0,1,0],[0,0,1],[0,1,1]])*1.
ys=np.array([[0,1],[1,1]])*1.
print scipy.ndimage.convolve(xs,ys,mode='constant',cval=np.inf)
>>> array([[ 1., 1., inf],
[ 0., 2., inf],
[ inf, inf, inf]])
print scipy.signal.fftconvolve(xs,ys,mode='valid')
>>> array([[ 1., 1.],
[ 0., 2.]])

There may be a fancy way to get this done using pure numpy/scipy magic. But it might be easier (and more understandable when you look at the code in the future) to just drop into Cython to get this done. There's a good tutorial for integrating Cython with numpy at http://docs.cython.org/src/tutorial/numpy.html.
EDIT:
I did a quick test with your Cython code and it ran ~15 sec for a 500x400 img with a 100x200 template. After some tweaks (eliminating the numpy method calls and numpy bounds checking), I got it down under 3 seconds. That may not be enough for you, but it shows the possibility.
import numpy as np
cimport numpy as np
cimport cython
from libc.math cimport sqrt
DTYPE = np.int
ctypedef np.int_t DTYPE_t
#cython.boundscheck(False)
def match_template(np.ndarray[DTYPE_t, ndim=2] img, np.ndarray[DTYPE_t, ndim=2] template):
cdef float mindist = float('inf')
cdef int x_coord = -1
cdef int y_coord = -1
cdef float dist
cdef unsigned int x, y
cdef int img_width = img.shape[0]
cdef int img_height = img.shape[1]
cdef int template_width = template.shape[0]
cdef int template_height = template.shape[1]
cdef int range_x = img_width-template_width+1
cdef int range_y = img_height-template_height+1
cdef DTYPE_t total
cdef int delta
cdef unsigned int j, k, j_plus, k_plus
for y from 0 <= y < range_y:
for x from 0 <= x < range_x:
#dist = np.sqrt(np.sum(np.square(template - img[ x:<unsigned int>(x+template_width), y:<unsigned int>(y+template_height) ]))) #calculate euclidean distance
# Do the same operations, but in plain C
total = 0
for j from 0 <= j < template_width:
j_plus = <unsigned int>x + j
for k from 0 <= k < template_height:
k_plus = <unsigned int>y + k
delta = template[j, k] - img[j_plus, k_plus]
total += delta*delta
dist = sqrt(total)
if dist < mindist:
mindist = dist
x_coord = x
y_coord = y
return [mindist, (x_coord,y_coord)]

Using Cython to speed up connected components algorithm

First off, I am using python[2.7.2], numpy[1.6.2rc1], cython[0.16], gcc[MinGW] compiler, on a windows xp machine.
I needed a 3D connected components algorithm to process some 3D binary data (i.e. 1s and 0s) stored in numpy arrays. Unfortunately, I could not find any existing code so I adapted the code found here to work with 3D arrays. Everything works great, however speed is desirable for processing huge data sets. As a result I stumbled upon cython and decided to give it a try.
So far cython has improved the speed:
Cython: 0.339 s
Python: 0.635 s
Using cProfile, my time consuming line in the pure python version is:
new_region = min(filter(lambda i: i > 0, array_region[xMin:xMax,yMin:yMax,zMin:zMax].ravel()))
The Question: What is the correct way to "cythonize" the lines:
new_region = min(filter(lambda i: i > 0, array_region[xMin:xMax,yMin:yMax,zMin:zMax].ravel()))
for x,y,z in zip(ind[0],ind[1],ind[2]):
Any help would be appreciated and hopefully this work will help others.
Pure python version [*.py]:
import numpy as np
def find_regions_3D(Array):
x_dim=np.size(Array,0)
y_dim=np.size(Array,1)
z_dim=np.size(Array,2)
regions = {}
array_region = np.zeros((x_dim,y_dim,z_dim),)
equivalences = {}
n_regions = 0
#first pass. find regions.
ind=np.where(Array==1)
for x,y,z in zip(ind[0],ind[1],ind[2]):
# get the region number from all surrounding cells including diagnols (27) or create new region
xMin=max(x-1,0)
xMax=min(x+1,x_dim-1)
yMin=max(y-1,0)
yMax=min(y+1,y_dim-1)
zMin=max(z-1,0)
zMax=min(z+1,z_dim-1)
max_region=array_region[xMin:xMax+1,yMin:yMax+1,zMin:zMax+1].max()
if max_region > 0:
#a neighbour already has a region, new region is the smallest > 0
new_region = min(filter(lambda i: i > 0, array_region[xMin:xMax+1,yMin:yMax+1,zMin:zMax+1].ravel()))
#update equivalences
if max_region > new_region:
if max_region in equivalences:
equivalences[max_region].add(new_region)
else:
equivalences[max_region] = set((new_region, ))
else:
n_regions += 1
new_region = n_regions
array_region[x,y,z] = new_region
#Scan Array again, assigning all equivalent regions the same region value.
for x,y,z in zip(ind[0],ind[1],ind[2]):
r = array_region[x,y,z]
while r in equivalences:
r= min(equivalences[r])
array_region[x,y,z]=r
#return list(regions.itervalues())
return array_region
Pure python speedups:
#Original line:
new_region = min(filter(lambda i: i > 0, array_region[xMin:xMax+1,yMin:yMax+1,zMin:zMax+1].ravel()))
#ver A:
new_region = array_region[xMin:xMax+1,yMin:yMax+1,zMin:zMax+1]
min(new_region[new_region>0])
#ver B:
new_region = min( i for i in array_region[xMin:xMax,yMin:yMax,zMin:zMax].ravel() if i>0)
#ver C:
sub=array_region[xMin:xMax,yMin:yMax,zMin:zMax]
nlist=np.where(sub>0)
minList=[]
for x,y,z in zip(nlist[0],nlist[1],nlist[2]):
minList.append(sub[x,y,z])
new_region=min(minList)
Time results:
O: 0.0220445
A: 0.0002161
B: 0.0173195
C: 0.0002560
Cython version [*.pyx]:
import numpy as np
cimport numpy as np
DTYPE = np.int
ctypedef np.int_t DTYPE_t
cdef inline int int_max(int a, int b): return a if a >= b else b
cdef inline int int_min(int a, int b): return a if a <= b else b
def find_regions_3D(np.ndarray Array not None):
cdef int x_dim=np.size(Array,0)
cdef int y_dim=np.size(Array,1)
cdef int z_dim=np.size(Array,2)
regions = {}
cdef np.ndarray array_region = np.zeros((x_dim,y_dim,z_dim),dtype=DTYPE)
equivalences = {}
cdef int n_regions = 0
#first pass. find regions.
ind=np.where(Array==1)
cdef int xMin, xMax, yMin, yMax, zMin, zMax, max_region, new_region, x, y, z
for x,y,z in zip(ind[0],ind[1],ind[2]):
# get the region number from all surrounding cells including diagnols (27) or create new region
xMin=int_max(x-1,0)
xMax=int_min(x+1,x_dim-1)+1
yMin=int_max(y-1,0)
yMax=int_min(y+1,y_dim-1)+1
zMin=int_max(z-1,0)
zMax=int_min(z+1,z_dim-1)+1
max_region=array_region[xMin:xMax,yMin:yMax,zMin:zMax].max()
if max_region > 0:
#a neighbour already has a region, new region is the smallest > 0
new_region = min(filter(lambda i: i > 0, array_region[xMin:xMax,yMin:yMax,zMin:zMax].ravel()))
#update equivalences
if max_region > new_region:
if max_region in equivalences:
equivalences[max_region].add(new_region)
else:
equivalences[max_region] = set((new_region, ))
else:
n_regions += 1
new_region = n_regions
array_region[x,y,z] = new_region
#Scan Array again, assigning all equivalent regions the same region value.
cdef int r
for x,y,z in zip(ind[0],ind[1],ind[2]):
r = array_region[x,y,z]
while r in equivalences:
r= min(equivalences[r])
array_region[x,y,z]=r
#return list(regions.itervalues())
return array_region
Cython speedups:
Using:
cdef np.ndarray region = np.zeros((3,3,3),dtype=DTYPE)
...
region=array_region[xMin:xMax,yMin:yMax,zMin:zMax]
new_region=np.min(region[region>0])
Time: 0.170, original: 0.339 s
Results
After considering the many useful comments and answers provided, my current algorithms are running at:
Cython: 0.0219
Python: 0.4309
Cython is providing a 20x increase in speed over the pure python.
Current Cython Code:
import numpy as np
import cython
cimport numpy as np
cimport cython
from libcpp.map cimport map
DTYPE = np.int
ctypedef np.int_t DTYPE_t
cdef inline int int_max(int a, int b): return a if a >= b else b
cdef inline int int_min(int a, int b): return a if a <= b else b
#cython.boundscheck(False)
def find_regions_3D(np.ndarray[DTYPE_t,ndim=3] Array not None):
cdef unsigned int x_dim=np.size(Array,0),y_dim=np.size(Array,1),z_dim=np.size(Array,2)
regions = {}
cdef np.ndarray[DTYPE_t,ndim=3] array_region = np.zeros((x_dim,y_dim,z_dim),dtype=DTYPE)
cdef np.ndarray region = np.zeros((3,3,3),dtype=DTYPE)
cdef map[int,int] equivalences
cdef unsigned int n_regions = 0
#first pass. find regions.
ind=np.where(Array==1)
cdef np.ndarray[DTYPE_t,ndim=1] ind_x = ind[0], ind_y = ind[1], ind_z = ind[2]
cells=range(len(ind_x))
cdef unsigned int xMin, xMax, yMin, yMax, zMin, zMax, max_region, new_region, x, y, z, i, xi, yi, zi, val
for i in cells:
x=ind_x[i]
y=ind_y[i]
z=ind_z[i]
# get the region number from all surrounding cells including diagnols (27) or create new region
xMin=int_max(x-1,0)
xMax=int_min(x+1,x_dim-1)+1
yMin=int_max(y-1,0)
yMax=int_min(y+1,y_dim-1)+1
zMin=int_max(z-1,0)
zMax=int_min(z+1,z_dim-1)+1
max_region = 0
new_region = 2000000000 # huge number
for xi in range(xMin, xMax):
for yi in range(yMin, yMax):
for zi in range(zMin, zMax):
val = array_region[xi,yi,zi]
if val > max_region: # val is the new maximum
max_region = val
if 0 < val < new_region: # val is the new minimum
new_region = val
if max_region > 0:
if max_region > new_region:
if equivalences.count(max_region) == 0 or new_region < equivalences[max_region]:
equivalences[max_region] = new_region
else:
n_regions += 1
new_region = n_regions
array_region[x,y,z] = new_region
#Scan Array again, assigning all equivalent regions the same region value.
cdef int r
for i in cells:
x=ind_x[i]
y=ind_y[i]
z=ind_z[i]
r = array_region[x,y,z]
while equivalences.count(r) > 0:
r= equivalences[r]
array_region[x,y,z]=r
return array_region
Setup file [setup.py]
from distutils.core import setup
from distutils.extension import Extension
from Cython.Distutils import build_ext
import numpy
setup(
cmdclass = {'build_ext': build_ext},
ext_modules = [Extension("ConnectComp", ["ConnectedComponents.pyx"],
include_dirs =[numpy.get_include()],
language="c++",
)]
)
Build command:
python setup.py build_ext --inplace

As #gotgenes points out, you should definitely be using cython -a <file>, and trying to reduce the amount of yellow you see. Yellow corresponds to worse and worse generated C.
Things I found that reduced the amount of yellow:
This looks like a situation where there will never be any out of bounds array access, as long as the input Array has 3 dimensions, so one can turn off bounds checking:
cimport cython
#cython.boundscheck(False)
def find_regions_3d(...):
Give the compiler more information for efficient indexing, i.e. whenever you cdef an ndarray give as much information as you can:
def find_regions_3D(np.ndarray[DTYPE_t,ndim=3] Array not None):
[...]
cdef np.ndarray[DTYPE_t,ndim=3] array_region = ...
[etc.]
Give the compiler more information about positive/negative-ness. I.e. if you know a certain variable is always going to be positive, cdef it as unsigned int rather than int, as this means that Cython can eliminate any negative-indexing checks.
Unpack the ind tuple immediately, i.e.
ind = np.where(Array==1)
cdef np.ndarray[DTYPE_t,ndim=1] ind_x = ind[0], ind_y = ind[1], ind_z = ind[2]
Avoid using the for x,y,z in zip(..[0],..[1],..[2]) construct. In both cases, replace it with
cdef int i
for i in range(len(ind_x)):
x = ind_x[i]
y = ind_y[i]
z = ind_z[i]
Avoid doing the fancy indexing/slicing. And especially avoid doing it twice! And avoid using filter! I.e. replace
max_region=array_region[xMin:xMax,yMin:yMax,zMin:zMax].max()
if max_region > 0:
new_region = min(filter(lambda i: i > 0, array_region[xMin:xMax,yMin:yMax,zMin:zMax].ravel()))
if max_region > new_region:
if max_region in equivalences:
equivalences[max_region].add(new_region)
else:
equivalences[max_region] = set((new_region, ))
with the more verbose
max_region = 0
new_region = 2000000000 # "infinity"
for xi in range(xMin, xMax):
for yi in range(yMin, yMax):
for zi in range(zMin, zMax):
val = array_region[xi,yi,zi]
if val > max_region: # val is the new maximum
max_region = val
if 0 < val < new_region: # val is the new minimum
new_region = val
if max_region > 0:
if max_region > new_region:
if max_region in equivalences:
equivalences[max_region].add(new_region)
else:
equivalences[max_region] = set((new_region, ))
else:
n_regions += 1
new_region = n_regions
This doesn't look so nice, but the triple loop compiles down to about 10 or so lines of C, while the compiled version of the original is hundreds of lines long and has a lot of Python object manipulation.
(Obviously you must cdef all the variables you use, especially xi, yi, zi and val in this code.)
You don't need to store all the equivalences, since the only thing you do with the set is find the minimum element. So if you instead have equivalences mapping int to int, you can replace
if max_region in equivalences:
equivalences[max_region].add(new_region)
else:
equivalences[max_region] = set((new_region, ))
[...]
while r in equivalences:
r = min(equivalences[r])
with
if max_region not in equivalences or new_region < equivalences[max_region]:
equivalences[max_region] = new_region
[...]
while r in equivalences:
r = equivalences[r]
The last thing to do after all that would be to not use any Python objects at all, specifically, don't use a dictionary for equivalences. This is now easy, since it is mapping int to int, so one could use from libcpp.map cimport map and then cdef map[int,int] equivalences, and replace .. not in equivalences with equivalences.count(..) == 0 and .. in equivalences with equivalences.count(..) > 0. (Note that it will then require a C++ compiler.)

(copied from the above comment for others ease of reading)
I believe scipy's ndimage.label does what you want (I did not test it against your code but it should be quite efficient). Note that you have to import it explicitely:
from scipy import ndimage
ndimage.label(your_data, connectivity_struct)
then later you can apply other built-in functions (like finding the bounding rectangle, centre-of-mass, etc)

When optimizing for cython you want to make sure that in your loops mostly native C data types are used, not Python objects that come with a higher overhead. The best way to find such places is to look at the generated C code and look for lines that were translated into lots of Py* function calls. These places could usually be optimized by using cdef variables instead of python objects.
In your code I would for example suspect that the loop with zip produces lots of python objects and it would be much faster to iterate with an int index that is then used to get the elements in ind[0],.... But look at the generated C code and see what seems to call unnecessarily many python functions.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Parallelising an Exhaustive Search in Cython - python

Related

Nested loops with cython for image processing

Cython call optimization

Defining NumPy arrays in Cython without incurring python overhead

numpy template matching using matrix multiplications

Using Cython to speed up connected components algorithm

Categories

Resources