numpy template matching using matrix multiplications

numpy template matching using matrix multiplications - python

I am trying to match a template with a binary image (only black and white) by shifting the template along the image. And return the minimum distance between the template and the image with the corresponding position on which this minimum distance did occur. For example:
img:
0 1 0
0 0 1
0 1 1
template:
0 1
1 1
This template matches the image best at position (1,1) and the distance will then be 0. So far things are not too difficult and I already got some code that does the trick.
def match_template(img, template):
mindist = float('inf')
idx = (-1,-1)
for y in xrange(img.shape[1]-template.shape[1]+1):
for x in xrange(img.shape[0]-template.shape[0]+1):
#calculate Euclidean distance
dist = np.sqrt(np.sum(np.square(template - img[x:x+template.shape[0],y:y+template.shape[1]])))
if dist < mindist:
mindist = dist
idx = (x,y)
return [mindist, idx]
But for images of the size I need (image among 500 x 200 pixels and template among 250 x 100) this already takes approximately 4.5 seconds, which is way too slow. And I know the same thing can be done much quicker using matrix multiplications (in matlab I believe this can be done using im2col and repmat). Can anyone explain me how to do it in python/numpy?
btw. I know there is an opencv matchTemplate function that does exactly what I need, but since I might need to alter the code slightly later on I would prefer a solution which I fully understand and can alter.
Thanks!
edit: If anyone can explain me how opencv does this in less than 0.2 seconds that would also be great. I have had a short look at the source code, but those things somehow always look quite complicated to me.
edit2: Cython code
import numpy as np
cimport numpy as np
DTYPE = np.int
ctypedef np.int_t DTYPE_t
def match_template(np.ndarray img, np.ndarray template):
cdef float mindist = float('inf')
cdef int x_coord = -1
cdef int y_coord = -1
cdef float dist
cdef unsigned int x, y
cdef int img_width = img.shape[0]
cdef int img_height = img.shape[1]
cdef int template_width = template.shape[0]
cdef int template_height = template.shape[1]
cdef int range_x = img_width-template_width+1
cdef int range_y = img_height-template_height+1
for y from 0 <= y < range_y:
for x from 0 <= x < range_x:
dist = np.sqrt(np.sum(np.square(template - img[ x:<unsigned int>(x+template_width), y:<unsigned int>(y+template_height) ]))) #calculate euclidean distance
if dist < mindist:
mindist = dist
x_coord = x
y_coord = y
return [mindist, (x_coord,y_coord)]
img = np.asarray(img, dtype=DTYPE)
template = np.asarray(template, dtype=DTYPE)
match_template(img, template)

One possible way of doing what you want is via convolution (which can be brute force or FFT). Matrix multiplications AFAIK won't work. You need to convolve your data with the template. And find the maximum (you'll also need to do some scaling to make it work properly).
xs=np.array([[0,1,0],[0,0,1],[0,1,1]])*1.
ys=np.array([[0,1],[1,1]])*1.
print scipy.ndimage.convolve(xs,ys,mode='constant',cval=np.inf)
>>> array([[ 1., 1., inf],
[ 0., 2., inf],
[ inf, inf, inf]])
print scipy.signal.fftconvolve(xs,ys,mode='valid')
>>> array([[ 1., 1.],
[ 0., 2.]])

There may be a fancy way to get this done using pure numpy/scipy magic. But it might be easier (and more understandable when you look at the code in the future) to just drop into Cython to get this done. There's a good tutorial for integrating Cython with numpy at http://docs.cython.org/src/tutorial/numpy.html.
EDIT:
I did a quick test with your Cython code and it ran ~15 sec for a 500x400 img with a 100x200 template. After some tweaks (eliminating the numpy method calls and numpy bounds checking), I got it down under 3 seconds. That may not be enough for you, but it shows the possibility.
import numpy as np
cimport numpy as np
cimport cython
from libc.math cimport sqrt
DTYPE = np.int
ctypedef np.int_t DTYPE_t
#cython.boundscheck(False)
def match_template(np.ndarray[DTYPE_t, ndim=2] img, np.ndarray[DTYPE_t, ndim=2] template):
cdef float mindist = float('inf')
cdef int x_coord = -1
cdef int y_coord = -1
cdef float dist
cdef unsigned int x, y
cdef int img_width = img.shape[0]
cdef int img_height = img.shape[1]
cdef int template_width = template.shape[0]
cdef int template_height = template.shape[1]
cdef int range_x = img_width-template_width+1
cdef int range_y = img_height-template_height+1
cdef DTYPE_t total
cdef int delta
cdef unsigned int j, k, j_plus, k_plus
for y from 0 <= y < range_y:
for x from 0 <= x < range_x:
#dist = np.sqrt(np.sum(np.square(template - img[ x:<unsigned int>(x+template_width), y:<unsigned int>(y+template_height) ]))) #calculate euclidean distance
# Do the same operations, but in plain C
total = 0
for j from 0 <= j < template_width:
j_plus = <unsigned int>x + j
for k from 0 <= k < template_height:
k_plus = <unsigned int>y + k
delta = template[j, k] - img[j_plus, k_plus]
total += delta*delta
dist = sqrt(total)
if dist < mindist:
mindist = dist
x_coord = x
y_coord = y
return [mindist, (x_coord,y_coord)]

Related

Cython optimization slow

I am trying to optimize the following python code with cython:
from cython cimport boundscheck, wraparound
#boundscheck(False)
#wraparound(False)
def cython_color2gray(numpy.ndarray[numpy.uint8_t, ndim=3] image):
cdef int x,y,z
cdef double z_val, grey
for x in range(len(image)):
for y in range(len(image[x])):
grey = 0
for z in range(len(image[x][y])):
if z == 0:
z_val = image[x][y][0] * 0.21
grey += z_val
elif z == 1:
z_val = image[x][y][1] * 0.07
grey += z_val
elif z == 2:
z_val = image[x][y][2] * 0.72
grey += z_val
image[x][y][0] = grey
image[x][y][1] = grey
image[x][y][2] = grey
return image
However, when checking if everything is as optimized as it should be, I receive the following yellow lines (see picture). Is there anything else I can do to optimize this cython code and make it run faster?
Output cython file

Here are some key points:
len() is a python function. Since image is a np.ndarray, use the .shape attribute to get the number of elements in each dimension.
Use image[i, j, k] instead of image[i][j][k] for element access.
Use memoryviews, i.e. (assuming image is c-contiguous) unsigned char[:, :, ::1] image instead of numpy.ndarray[numpy.uint8_t, ndim=3] image. The syntax is cleaner and they are faster.
The variable grey is a double while images elements are np.uint8 (equivalent to unsigned char). So when doing image[i,j,k]=grey in python, grey gets casted to an unsigned char, i.e. the decimal digits are cut off. In Cython you have to do the cast manually.
After you know your code works as expected, you can further accelerate it with directives for the cython compiler, e.g. deactivating the boundschecks and negative indices (wraparound). Note that
these are decoraters who need to be imported.
And your code snippets becomes:
from cython cimport boundscheck, wraparound
#boundscheck(False)
#wraparound(False)
def cython_color2gray(unsigned char[:, :, ::1] image):
cdef int x,y,z
cdef double z_val, grey
for x in range(image.shape[0]):
for y in range(image.shape[1]):
grey = 0
for z in range(image.shape[2]):
if z == 0:
z_val = image[x, y, 0] * 0.21
grey += z_val
elif z == 1:
z_val = image[x, y, 1] * 0.07
grey += z_val
elif z == 2:
z_val = image[x, y, 2] * 0.72
grey += z_val
image[x, y, :] = <unsigned char> grey
return image
Looking closely, you'll see that there's no need for the most inner loop:
from cython cimport boundscheck, wraparound
#boundscheck(False)
#wraparound(False)
def cython_color2gray(unsigned char[:, :, ::1] image):
cdef int x,y,z
cdef double z_val
for x in range(image.shape[0]):
for y in range(image.shape[1]):
image[x, y, :] = <unsigned char>(image[x,y,0]*0.21 + image[x,y,1]*0.07 + image[x,y,2] * 0.72)
return image
Going one step further, you can try to accelerate Cython's generated C code by enabling your C-compiler's auto-vectorization (in sense of SIMD). For gcc/clang you can use the flags -O3 and -march=native. For MVSC it's /O2 and /arch:AVX2 (assuming your machine supports AVX2). If you're working inside a jupyter notebook, you can pass c-compiler flags via the -c=YOURFLAG argument for the cython magic, i.e.
%%cython -a -f -c=-O3 -c=-march=native
# your cython code here..

Nested loops with cython for image processing

I'm trying to iterate over a 2D image containing floating-point depth data, it has a somewhat normal resolution (640, 480), but python has been too slow, so I've been trying to optimize the problem by using cython.
I've tried to move the looping to other functions, shifting around the nogil statement, didn't seem to work, after reworking the problem, I was able to get a portion of it working. But this last part is escaping me to no avail.
I've attempted to get rid of python objects from the prange() loop by moving them to the with gil section beforehand, hence:
cdef int[:] w_list = array.array(range(0, w_inc, interpolation))
instead of
for r in range(0, w_inc, interpolation):
but the error persists
My code works in two parts:
The split_data() method subsections the image into num quadrants that are stored in a 3D array bits. These are use to make splitting up the work to multiple thread/processes easier. This part works okay.
#cython.cdivision(True)
#cython.boundscheck(False)
cpdef split_data(double[:, :] frame, int h, int w, int num):
cdef double[:, :, :] bits = np.zeros(shape=(num, h // num, w // num), dtype=float)
cdef int c_count = os.cpu_count()
cdef int i, j, k
for i in prange(num, nogil=True, num_threads=c_count):
for j in prange(h // num):
for k in prange(w // num):
bits[i, j, k] = frame[i * (h // num) + j, i * (w // num) + k]
return bits
The scatter_data() method takes the bits array from the previous function and then creates another 3D array with length num where num is the length of bits, called points which is a series of 3D coordinates representing valid depth points. It then uses prange() to extract the valid depth data from each of these bits and stores them into points
#cython.cdivision(True)
#cython.boundscheck(False)
cpdef scatter_data(double[:, :] depths, object validator=None,
int h=-1, int w=-1, int interpolation=1):
# Handles if h or w is -1 (default)
if h < 0 or w < 0:
h = depths.shape[0] if h < 0 else h
w = depths.shape[1] if w < 0 else w
cdef int max_num = w * h
cdef int c_count = os.cpu_count()
cdef int h_inc = h // c_count, w_inc = w // c_count
cdef double[:, :, :] points = np.zeros(shape=(c_count, max_num, 3), dtype=float)
cdef double[:, :, :] bits = split_data(depths, h, w, c_count)
cdef int count = 0
cdef int i, r, c
cdef int[:] w_list = array.array(range(0, w_inc, interpolation))
cdef int[:] h_list = array.array(range(0, h_inc, interpolation))
for i in prange(c_count, nogil=True, num_threads=c_count):
count = 0
for r in w_list:
for c in h_list:
if depths[c, r] != 0:
points[i, count, 0] = w - r
points[i, count, 1] = c
points[i, count, 2] = depths[c, r]
count = count + 1
points = points[:count]
return points
and for completeness
3. Here are my import statements
import cython
from cython.parallel import prange
from cpython cimport array
import array
cimport numpy as np
import numpy as np
import os
When compiling the code I keep getting error messages something along the lines of:
Error compiling Cython file:
------------------------------------------------------------
...
cdef int[:] w_list = array.array(range(0, w_inc, interpolation))
cdef int[:] h_list = array.array(range(0, h_inc, interpolation))
for i in prange(c_count, nogil=True, num_threads=c_count):
count = 0
for r in w_list:
^
------------------------------------------------------------
data_util/cy_scatter.pyx:70:17: Iterating over Python object not allowed without gil
and
Error compiling Cython file:
------------------------------------------------------------
...
cdef int[:] w_list = array.array(range(0, w_inc, interpolation))
cdef int[:] h_list = array.array(range(0, h_inc, interpolation))
for i in prange(c_count, nogil=True, num_threads=c_count):
count = 0
for r in w_list:
^
------------------------------------------------------------
data_util/cy_scatter.pyx:70:17: Coercion from Python not allowed without the GIL
and
Error compiling Cython file:
------------------------------------------------------------
...
cdef int[:] w_list = array.array(range(0, w_inc, interpolation))
cdef int[:] h_list = array.array(range(0, h_inc, interpolation))
for i in prange(c_count, nogil=True, num_threads=c_count):
count = 0
for r in w_list:
^
------------------------------------------------------------
data_util/cy_scatter.pyx:70:17: Converting to Python object not allowed without gil
Is there a way to do this? And if so, how do I do this?

You just want to iterate by index rather than by iterating over a Python iterator:
for ri in range(w_list.shape[0]):
r = w_list[ri]
This is somewhere where best practice in Python differs from best practice in Cython - Cython only accelerates iterating over numeric loops. The way you're trying to do it will fall back to being a Python iterator which is both slower, and requires the GIL.

Cython, Complex values, and BM3D algorithm

I am working on a image reconstruction algorithm and I found this repo online that would work great with my code, but unfortunately it doesnt seem to support complex valued calculations. I've read up on cython the past couple of days, but I'm pressed for time and I wanted to ask for advice before bull-dozering all over the code.
To be more exact, this is the Cython file:
from libcpp.vector cimport vector
from libcpp cimport bool
cimport numpy as np
import numpy as np
cdef extern from "../bm3d_src/mt19937ar.h":
double mt_genrand_res53()
cdef extern from "../bm3d_src/bm3d.h":
int run_bm3d( const float sigma, vector[float] &img_noisy,
vector[float] &img_basic,
vector[float] &img_denoised,
const unsigned width,
const unsigned height,
const unsigned chnls,
const bool useSD_h,
const bool useSD_w,
const unsigned tau_2D_hard,
const unsigned tau_2D_wien,
const unsigned color_space)
cdef extern from "../bm3d_src/utilities.h":
int save_image(char * name, vector[float] & img,
const unsigned width,
const unsigned height,
const unsigned chnls)
def hello():
return "Hello World"
def random():
return mt_genrand_res53()
cpdef float[:, :, :] bm3d(float[:, :, :] input_array,
float sigma,
bool useSD_h = True,
bool useSD_w = True,
str tau_2D_hard = "DCT",
str tau_2D_wien = "DCT"
):
"""
sigma: value of assumed noise of the noisy image;
input_array : input image, H x W x channum
useSD_h (resp. useSD_w): if true, use weight based
on the standard variation of the 3D group for the
first (resp. second) step, otherwise use the number
of non-zero coefficients after Hard Thresholding
(resp. the norm of Wiener coefficients);
tau_2D_hard (resp. tau_2D_wien): 2D transform to apply
on every 3D group for the first (resp. second) part.
Allowed values are 'DCT' and 'BIOR';
# FIXME : add color space support; right now just RGB
"""
cdef vector[float] input_image
cdef vector[float] basic_image
cdef vector[float] output_image
cdef vector[float] denoised_image
height = input_array.shape[0]
width = input_array.shape[1]
chnls = input_array.shape[2]
# convert the input image
input_image.resize(input_array.size)
pos = 0
for i in range(input_array.shape[0]):
for j in range(input_array.shape[1]):
for k in range(input_array.shape[2]):
input_image[pos] = input_array[i, j, k]
pos +=1
if tau_2D_hard == "DCT":
tau_2D_hard_i = 4
elif tau_2D_hard == "BIOR" :
tau_2D_hard_i = 5
else:
raise ValueError("Unknown tau_2d_hard, must be DCT or BIOR")
if tau_2D_wien == "DCT":
tau_2D_wien_i = 4
elif tau_2D_wien == "BIOR" :
tau_2D_wien_i = 5
else:
raise ValueError("Unknown tau_2d_wien, must be DCT or BIOR")
# FIXME someday we'll have color support
color_space = 0
ret = run_bm3d(sigma, input_image, basic_image, output_image,
width, height, chnls,
useSD_h, useSD_w,
tau_2D_hard_i, tau_2D_wien_i,
color_space)
if ret != 0:
raise Exception("run_bmd3d returned an error, retval=%d" % ret)
cdef np.ndarray output_array = np.zeros([height, width, chnls],
dtype = np.float32)
pos = 0
for i in range(input_array.shape[0]):
for j in range(input_array.shape[1]):
for k in range(input_array.shape[2]):
output_array[i, j, k] = output_image[pos]
pos +=1
return output_array
How would I go about making the most minimal changes such that it'll work with numpy array with dtype='complex'?
Cheers!

Cython call optimization

I've got a Python function I try to export to Cython. I have tested two implementations but I don't understand why the second one is slower than the first one. Furthermore, I am looking for ways to improve speed a little more but I have no clue how ?
Base code
import numpy as np
cimport numpy as np
cimport cython
DTYPE = np.int
ctypedef np.int_t DTYPE_t
cdef inline int int_max(int a, int b): return a if a >= b else b
cdef inline int int_min(int a, int b): return a if a <= b else b
cdef extern from "math.h":
double exp(double x)
#cython.boundscheck(False)
#cython.wraparound(False)
def bilateral_filter_C(np.ndarray[np.float_t, ndim=1] samples, int w=20):
# Filter Parameters
cdef Py_ssize_t size = samples.shape[0]
cdef float rang
cdef float sigma = 2*3.0*3.0
cdef int j, L
cdef unsigned int a, b
cdef np.float_t W, num, sub_sample, intensity
# Initialization
cdef np.ndarray[np.float_t, ndim=1] gauss = np.zeros(2*w+1, dtype=np.float)
cdef np.ndarray[np.float_t, ndim=1] sub_samples, intensities = np.empty(size, dtype=np.float)
cdef np.ndarray[np.float_t, ndim=1] samples_filtered = np.empty(size, dtype=np.float)
L = 2*w+1
for j in xrange(L):
rang = -w+1.0/L
rang *= rang
gauss[j] = exp(-rang/sigma)
<CODE TO IMPROVE>
return samples_filtered
I tried to inject those two code samples in the <CODE TO IMPROVE> section:
Most efficient approach
for i in xrange(size):
a = <unsigned int>int_max(i-w, 0)
b = <unsigned int>int_min(i+w, size-1)
L = b-a
sub_samples = samples[a:b]-samples[i]
sub_samples *= sub_samples
for j in xrange(L):
sub_samples[j] = exp(-sub_samples[j]/sigma)
intensities = gauss[w-i+a:w-i+b]*sub_samples
num = 0.0
W = 0.0
for j in xrange(L):
W += intensities[j]
num += intensities[j]*samples[a+j]
samples_filtered[i] = num/W
Result
%timeit -n1 -r10 bilateral_filter_C(x, 20)
1 loop, best of 10: 45 ms per loop
Less efficient
for i in xrange(size):
a = <unsigned int>int_max(i-w, 0)
b = <unsigned int>int_min(i+w, size-1)
num = 0.0
W = 0.0
for j in xrange(b-a):
sub_sample = samples[a+j]-samples[i]
intensity1 = gauss[w-i+a+j]*exp(-sub_sample*sub_sample/sigma)
W += intensity
num += intensity*samples[a+j]
samples_filtered[i] = num/W
Result
%timeit -n1 -r10 bilateral_filter_C(x, 20)
1 loop, best of 10: 125 ms per loop

You have a few typos:
1) You forgot to define i, just add cdef int i, j, L
2) In the second algorithm you wrote intensity1 = gauss[w-i+a+j]*exp(-sub_sample*sub_sample/sigma), it should be intensity, without the 1
3) I would add #cython.cdivision(True) to avoid the check of division by zero
With those changes and with x = np.random.rand(10000)I got the following results
%timeit bilateral_filter_C1(x, 20) # First code
10 loops, best of 3: 74.1 ms per loop
%timeit bilateral_filter_C2(x, 20) # Second code
100 loops, best of 3: 9.5 ms per loop
And, to check the results
np.all(np.equal(bilateral_filter_C1(x, 20), bilateral_filter_C2(x, 20)))
True
To avoid these problems I suggest to use the option cython my_file.pyx -a, it generates an html file that shows you the possible problems you have in your code
EDIT
Reading again the code, it seems to have more errors:
for j in xrange(L):
rang = -w+1.0/L
rang *= rang
gauss[j] = exp(-rang/sigma)
gauss has the same value always, what is the definition of rang?

Using Cython to speed up connected components algorithm

First off, I am using python[2.7.2], numpy[1.6.2rc1], cython[0.16], gcc[MinGW] compiler, on a windows xp machine.
I needed a 3D connected components algorithm to process some 3D binary data (i.e. 1s and 0s) stored in numpy arrays. Unfortunately, I could not find any existing code so I adapted the code found here to work with 3D arrays. Everything works great, however speed is desirable for processing huge data sets. As a result I stumbled upon cython and decided to give it a try.
So far cython has improved the speed:
Cython: 0.339 s
Python: 0.635 s
Using cProfile, my time consuming line in the pure python version is:
new_region = min(filter(lambda i: i > 0, array_region[xMin:xMax,yMin:yMax,zMin:zMax].ravel()))
The Question: What is the correct way to "cythonize" the lines:
new_region = min(filter(lambda i: i > 0, array_region[xMin:xMax,yMin:yMax,zMin:zMax].ravel()))
for x,y,z in zip(ind[0],ind[1],ind[2]):
Any help would be appreciated and hopefully this work will help others.
Pure python version [*.py]:
import numpy as np
def find_regions_3D(Array):
x_dim=np.size(Array,0)
y_dim=np.size(Array,1)
z_dim=np.size(Array,2)
regions = {}
array_region = np.zeros((x_dim,y_dim,z_dim),)
equivalences = {}
n_regions = 0
#first pass. find regions.
ind=np.where(Array==1)
for x,y,z in zip(ind[0],ind[1],ind[2]):
# get the region number from all surrounding cells including diagnols (27) or create new region
xMin=max(x-1,0)
xMax=min(x+1,x_dim-1)
yMin=max(y-1,0)
yMax=min(y+1,y_dim-1)
zMin=max(z-1,0)
zMax=min(z+1,z_dim-1)
max_region=array_region[xMin:xMax+1,yMin:yMax+1,zMin:zMax+1].max()
if max_region > 0:
#a neighbour already has a region, new region is the smallest > 0
new_region = min(filter(lambda i: i > 0, array_region[xMin:xMax+1,yMin:yMax+1,zMin:zMax+1].ravel()))
#update equivalences
if max_region > new_region:
if max_region in equivalences:
equivalences[max_region].add(new_region)
else:
equivalences[max_region] = set((new_region, ))
else:
n_regions += 1
new_region = n_regions
array_region[x,y,z] = new_region
#Scan Array again, assigning all equivalent regions the same region value.
for x,y,z in zip(ind[0],ind[1],ind[2]):
r = array_region[x,y,z]
while r in equivalences:
r= min(equivalences[r])
array_region[x,y,z]=r
#return list(regions.itervalues())
return array_region
Pure python speedups:
#Original line:
new_region = min(filter(lambda i: i > 0, array_region[xMin:xMax+1,yMin:yMax+1,zMin:zMax+1].ravel()))
#ver A:
new_region = array_region[xMin:xMax+1,yMin:yMax+1,zMin:zMax+1]
min(new_region[new_region>0])
#ver B:
new_region = min( i for i in array_region[xMin:xMax,yMin:yMax,zMin:zMax].ravel() if i>0)
#ver C:
sub=array_region[xMin:xMax,yMin:yMax,zMin:zMax]
nlist=np.where(sub>0)
minList=[]
for x,y,z in zip(nlist[0],nlist[1],nlist[2]):
minList.append(sub[x,y,z])
new_region=min(minList)
Time results:
O: 0.0220445
A: 0.0002161
B: 0.0173195
C: 0.0002560
Cython version [*.pyx]:
import numpy as np
cimport numpy as np
DTYPE = np.int
ctypedef np.int_t DTYPE_t
cdef inline int int_max(int a, int b): return a if a >= b else b
cdef inline int int_min(int a, int b): return a if a <= b else b
def find_regions_3D(np.ndarray Array not None):
cdef int x_dim=np.size(Array,0)
cdef int y_dim=np.size(Array,1)
cdef int z_dim=np.size(Array,2)
regions = {}
cdef np.ndarray array_region = np.zeros((x_dim,y_dim,z_dim),dtype=DTYPE)
equivalences = {}
cdef int n_regions = 0
#first pass. find regions.
ind=np.where(Array==1)
cdef int xMin, xMax, yMin, yMax, zMin, zMax, max_region, new_region, x, y, z
for x,y,z in zip(ind[0],ind[1],ind[2]):
# get the region number from all surrounding cells including diagnols (27) or create new region
xMin=int_max(x-1,0)
xMax=int_min(x+1,x_dim-1)+1
yMin=int_max(y-1,0)
yMax=int_min(y+1,y_dim-1)+1
zMin=int_max(z-1,0)
zMax=int_min(z+1,z_dim-1)+1
max_region=array_region[xMin:xMax,yMin:yMax,zMin:zMax].max()
if max_region > 0:
#a neighbour already has a region, new region is the smallest > 0
new_region = min(filter(lambda i: i > 0, array_region[xMin:xMax,yMin:yMax,zMin:zMax].ravel()))
#update equivalences
if max_region > new_region:
if max_region in equivalences:
equivalences[max_region].add(new_region)
else:
equivalences[max_region] = set((new_region, ))
else:
n_regions += 1
new_region = n_regions
array_region[x,y,z] = new_region
#Scan Array again, assigning all equivalent regions the same region value.
cdef int r
for x,y,z in zip(ind[0],ind[1],ind[2]):
r = array_region[x,y,z]
while r in equivalences:
r= min(equivalences[r])
array_region[x,y,z]=r
#return list(regions.itervalues())
return array_region
Cython speedups:
Using:
cdef np.ndarray region = np.zeros((3,3,3),dtype=DTYPE)
...
region=array_region[xMin:xMax,yMin:yMax,zMin:zMax]
new_region=np.min(region[region>0])
Time: 0.170, original: 0.339 s
Results
After considering the many useful comments and answers provided, my current algorithms are running at:
Cython: 0.0219
Python: 0.4309
Cython is providing a 20x increase in speed over the pure python.
Current Cython Code:
import numpy as np
import cython
cimport numpy as np
cimport cython
from libcpp.map cimport map
DTYPE = np.int
ctypedef np.int_t DTYPE_t
cdef inline int int_max(int a, int b): return a if a >= b else b
cdef inline int int_min(int a, int b): return a if a <= b else b
#cython.boundscheck(False)
def find_regions_3D(np.ndarray[DTYPE_t,ndim=3] Array not None):
cdef unsigned int x_dim=np.size(Array,0),y_dim=np.size(Array,1),z_dim=np.size(Array,2)
regions = {}
cdef np.ndarray[DTYPE_t,ndim=3] array_region = np.zeros((x_dim,y_dim,z_dim),dtype=DTYPE)
cdef np.ndarray region = np.zeros((3,3,3),dtype=DTYPE)
cdef map[int,int] equivalences
cdef unsigned int n_regions = 0
#first pass. find regions.
ind=np.where(Array==1)
cdef np.ndarray[DTYPE_t,ndim=1] ind_x = ind[0], ind_y = ind[1], ind_z = ind[2]
cells=range(len(ind_x))
cdef unsigned int xMin, xMax, yMin, yMax, zMin, zMax, max_region, new_region, x, y, z, i, xi, yi, zi, val
for i in cells:
x=ind_x[i]
y=ind_y[i]
z=ind_z[i]
# get the region number from all surrounding cells including diagnols (27) or create new region
xMin=int_max(x-1,0)
xMax=int_min(x+1,x_dim-1)+1
yMin=int_max(y-1,0)
yMax=int_min(y+1,y_dim-1)+1
zMin=int_max(z-1,0)
zMax=int_min(z+1,z_dim-1)+1
max_region = 0
new_region = 2000000000 # huge number
for xi in range(xMin, xMax):
for yi in range(yMin, yMax):
for zi in range(zMin, zMax):
val = array_region[xi,yi,zi]
if val > max_region: # val is the new maximum
max_region = val
if 0 < val < new_region: # val is the new minimum
new_region = val
if max_region > 0:
if max_region > new_region:
if equivalences.count(max_region) == 0 or new_region < equivalences[max_region]:
equivalences[max_region] = new_region
else:
n_regions += 1
new_region = n_regions
array_region[x,y,z] = new_region
#Scan Array again, assigning all equivalent regions the same region value.
cdef int r
for i in cells:
x=ind_x[i]
y=ind_y[i]
z=ind_z[i]
r = array_region[x,y,z]
while equivalences.count(r) > 0:
r= equivalences[r]
array_region[x,y,z]=r
return array_region
Setup file [setup.py]
from distutils.core import setup
from distutils.extension import Extension
from Cython.Distutils import build_ext
import numpy
setup(
cmdclass = {'build_ext': build_ext},
ext_modules = [Extension("ConnectComp", ["ConnectedComponents.pyx"],
include_dirs =[numpy.get_include()],
language="c++",
)]
)
Build command:
python setup.py build_ext --inplace

As #gotgenes points out, you should definitely be using cython -a <file>, and trying to reduce the amount of yellow you see. Yellow corresponds to worse and worse generated C.
Things I found that reduced the amount of yellow:
This looks like a situation where there will never be any out of bounds array access, as long as the input Array has 3 dimensions, so one can turn off bounds checking:
cimport cython
#cython.boundscheck(False)
def find_regions_3d(...):
Give the compiler more information for efficient indexing, i.e. whenever you cdef an ndarray give as much information as you can:
def find_regions_3D(np.ndarray[DTYPE_t,ndim=3] Array not None):
[...]
cdef np.ndarray[DTYPE_t,ndim=3] array_region = ...
[etc.]
Give the compiler more information about positive/negative-ness. I.e. if you know a certain variable is always going to be positive, cdef it as unsigned int rather than int, as this means that Cython can eliminate any negative-indexing checks.
Unpack the ind tuple immediately, i.e.
ind = np.where(Array==1)
cdef np.ndarray[DTYPE_t,ndim=1] ind_x = ind[0], ind_y = ind[1], ind_z = ind[2]
Avoid using the for x,y,z in zip(..[0],..[1],..[2]) construct. In both cases, replace it with
cdef int i
for i in range(len(ind_x)):
x = ind_x[i]
y = ind_y[i]
z = ind_z[i]
Avoid doing the fancy indexing/slicing. And especially avoid doing it twice! And avoid using filter! I.e. replace
max_region=array_region[xMin:xMax,yMin:yMax,zMin:zMax].max()
if max_region > 0:
new_region = min(filter(lambda i: i > 0, array_region[xMin:xMax,yMin:yMax,zMin:zMax].ravel()))
if max_region > new_region:
if max_region in equivalences:
equivalences[max_region].add(new_region)
else:
equivalences[max_region] = set((new_region, ))
with the more verbose
max_region = 0
new_region = 2000000000 # "infinity"
for xi in range(xMin, xMax):
for yi in range(yMin, yMax):
for zi in range(zMin, zMax):
val = array_region[xi,yi,zi]
if val > max_region: # val is the new maximum
max_region = val
if 0 < val < new_region: # val is the new minimum
new_region = val
if max_region > 0:
if max_region > new_region:
if max_region in equivalences:
equivalences[max_region].add(new_region)
else:
equivalences[max_region] = set((new_region, ))
else:
n_regions += 1
new_region = n_regions
This doesn't look so nice, but the triple loop compiles down to about 10 or so lines of C, while the compiled version of the original is hundreds of lines long and has a lot of Python object manipulation.
(Obviously you must cdef all the variables you use, especially xi, yi, zi and val in this code.)
You don't need to store all the equivalences, since the only thing you do with the set is find the minimum element. So if you instead have equivalences mapping int to int, you can replace
if max_region in equivalences:
equivalences[max_region].add(new_region)
else:
equivalences[max_region] = set((new_region, ))
[...]
while r in equivalences:
r = min(equivalences[r])
with
if max_region not in equivalences or new_region < equivalences[max_region]:
equivalences[max_region] = new_region
[...]
while r in equivalences:
r = equivalences[r]
The last thing to do after all that would be to not use any Python objects at all, specifically, don't use a dictionary for equivalences. This is now easy, since it is mapping int to int, so one could use from libcpp.map cimport map and then cdef map[int,int] equivalences, and replace .. not in equivalences with equivalences.count(..) == 0 and .. in equivalences with equivalences.count(..) > 0. (Note that it will then require a C++ compiler.)

(copied from the above comment for others ease of reading)
I believe scipy's ndimage.label does what you want (I did not test it against your code but it should be quite efficient). Note that you have to import it explicitely:
from scipy import ndimage
ndimage.label(your_data, connectivity_struct)
then later you can apply other built-in functions (like finding the bounding rectangle, centre-of-mass, etc)

When optimizing for cython you want to make sure that in your loops mostly native C data types are used, not Python objects that come with a higher overhead. The best way to find such places is to look at the generated C code and look for lines that were translated into lots of Py* function calls. These places could usually be optimized by using cdef variables instead of python objects.
In your code I would for example suspect that the loop with zip produces lots of python objects and it would be much faster to iterate with an int index that is then used to get the elements in ind[0],.... But look at the generated C code and see what seems to call unnecessarily many python functions.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

numpy template matching using matrix multiplications - python

Related

Cython optimization slow

Nested loops with cython for image processing

Cython, Complex values, and BM3D algorithm

Cython call optimization

Using Cython to speed up connected components algorithm

Categories

Resources