Cython optimization slow - python

I am trying to optimize the following python code with cython:
from cython cimport boundscheck, wraparound
#boundscheck(False)
#wraparound(False)
def cython_color2gray(numpy.ndarray[numpy.uint8_t, ndim=3] image):
cdef int x,y,z
cdef double z_val, grey
for x in range(len(image)):
for y in range(len(image[x])):
grey = 0
for z in range(len(image[x][y])):
if z == 0:
z_val = image[x][y][0] * 0.21
grey += z_val
elif z == 1:
z_val = image[x][y][1] * 0.07
grey += z_val
elif z == 2:
z_val = image[x][y][2] * 0.72
grey += z_val
image[x][y][0] = grey
image[x][y][1] = grey
image[x][y][2] = grey
return image
However, when checking if everything is as optimized as it should be, I receive the following yellow lines (see picture). Is there anything else I can do to optimize this cython code and make it run faster?
Output cython file

Here are some key points:
len() is a python function. Since image is a np.ndarray, use the .shape attribute to get the number of elements in each dimension.
Use image[i, j, k] instead of image[i][j][k] for element access.
Use memoryviews, i.e. (assuming image is c-contiguous) unsigned char[:, :, ::1] image instead of numpy.ndarray[numpy.uint8_t, ndim=3] image. The syntax is cleaner and they are faster.
The variable grey is a double while images elements are np.uint8 (equivalent to unsigned char). So when doing image[i,j,k]=grey in python, grey gets casted to an unsigned char, i.e. the decimal digits are cut off. In Cython you have to do the cast manually.
After you know your code works as expected, you can further accelerate it with directives for the cython compiler, e.g. deactivating the boundschecks and negative indices (wraparound). Note that
these are decoraters who need to be imported.
And your code snippets becomes:
from cython cimport boundscheck, wraparound
#boundscheck(False)
#wraparound(False)
def cython_color2gray(unsigned char[:, :, ::1] image):
cdef int x,y,z
cdef double z_val, grey
for x in range(image.shape[0]):
for y in range(image.shape[1]):
grey = 0
for z in range(image.shape[2]):
if z == 0:
z_val = image[x, y, 0] * 0.21
grey += z_val
elif z == 1:
z_val = image[x, y, 1] * 0.07
grey += z_val
elif z == 2:
z_val = image[x, y, 2] * 0.72
grey += z_val
image[x, y, :] = <unsigned char> grey
return image
Looking closely, you'll see that there's no need for the most inner loop:
from cython cimport boundscheck, wraparound
#boundscheck(False)
#wraparound(False)
def cython_color2gray(unsigned char[:, :, ::1] image):
cdef int x,y,z
cdef double z_val
for x in range(image.shape[0]):
for y in range(image.shape[1]):
image[x, y, :] = <unsigned char>(image[x,y,0]*0.21 + image[x,y,1]*0.07 + image[x,y,2] * 0.72)
return image
Going one step further, you can try to accelerate Cython's generated C code by enabling your C-compiler's auto-vectorization (in sense of SIMD). For gcc/clang you can use the flags -O3 and -march=native. For MVSC it's /O2 and /arch:AVX2 (assuming your machine supports AVX2). If you're working inside a jupyter notebook, you can pass c-compiler flags via the -c=YOURFLAG argument for the cython magic, i.e.
%%cython -a -f -c=-O3 -c=-march=native
# your cython code here..

Related

Is there a faster method to find the difference of each pixel neighbor in an image?

I'm kinda new to image processing and I'm having trouble with the speed.
What I'm trying to do is "the difference between pixel values and the neighbors of them are calculated to discover whether there is a great contrast (>100 for this case) between them and accumulated"
Equation
It is working but it is very slow. Is there any optimal way to do this?
%%cython -a
import cython
import numpy as np
#cython.boundscheck(False)
cpdef unsigned char[:, :] test(unsigned char [:, :] image):
w = image.shape[1]
h = image.shape[0];
Hi = [None] * w
Vj = [None] * h
#For Hi
for y in range(0, w):
value1 = 0
for x in range(1, h):
value = abs(image[x, y]- image[(x-1), y])
if(value > 100):
value1+= value
Hi[y]= value1

Nested loops with cython for image processing

I'm trying to iterate over a 2D image containing floating-point depth data, it has a somewhat normal resolution (640, 480), but python has been too slow, so I've been trying to optimize the problem by using cython.
I've tried to move the looping to other functions, shifting around the nogil statement, didn't seem to work, after reworking the problem, I was able to get a portion of it working. But this last part is escaping me to no avail.
I've attempted to get rid of python objects from the prange() loop by moving them to the with gil section beforehand, hence:
cdef int[:] w_list = array.array(range(0, w_inc, interpolation))
instead of
for r in range(0, w_inc, interpolation):
but the error persists
My code works in two parts:
The split_data() method subsections the image into num quadrants that are stored in a 3D array bits. These are use to make splitting up the work to multiple thread/processes easier. This part works okay.
#cython.cdivision(True)
#cython.boundscheck(False)
cpdef split_data(double[:, :] frame, int h, int w, int num):
cdef double[:, :, :] bits = np.zeros(shape=(num, h // num, w // num), dtype=float)
cdef int c_count = os.cpu_count()
cdef int i, j, k
for i in prange(num, nogil=True, num_threads=c_count):
for j in prange(h // num):
for k in prange(w // num):
bits[i, j, k] = frame[i * (h // num) + j, i * (w // num) + k]
return bits
The scatter_data() method takes the bits array from the previous function and then creates another 3D array with length num where num is the length of bits, called points which is a series of 3D coordinates representing valid depth points. It then uses prange() to extract the valid depth data from each of these bits and stores them into points
#cython.cdivision(True)
#cython.boundscheck(False)
cpdef scatter_data(double[:, :] depths, object validator=None,
int h=-1, int w=-1, int interpolation=1):
# Handles if h or w is -1 (default)
if h < 0 or w < 0:
h = depths.shape[0] if h < 0 else h
w = depths.shape[1] if w < 0 else w
cdef int max_num = w * h
cdef int c_count = os.cpu_count()
cdef int h_inc = h // c_count, w_inc = w // c_count
cdef double[:, :, :] points = np.zeros(shape=(c_count, max_num, 3), dtype=float)
cdef double[:, :, :] bits = split_data(depths, h, w, c_count)
cdef int count = 0
cdef int i, r, c
cdef int[:] w_list = array.array(range(0, w_inc, interpolation))
cdef int[:] h_list = array.array(range(0, h_inc, interpolation))
for i in prange(c_count, nogil=True, num_threads=c_count):
count = 0
for r in w_list:
for c in h_list:
if depths[c, r] != 0:
points[i, count, 0] = w - r
points[i, count, 1] = c
points[i, count, 2] = depths[c, r]
count = count + 1
points = points[:count]
return points
and for completeness
3. Here are my import statements
import cython
from cython.parallel import prange
from cpython cimport array
import array
cimport numpy as np
import numpy as np
import os
When compiling the code I keep getting error messages something along the lines of:
Error compiling Cython file:
------------------------------------------------------------
...
cdef int[:] w_list = array.array(range(0, w_inc, interpolation))
cdef int[:] h_list = array.array(range(0, h_inc, interpolation))
for i in prange(c_count, nogil=True, num_threads=c_count):
count = 0
for r in w_list:
^
------------------------------------------------------------
data_util/cy_scatter.pyx:70:17: Iterating over Python object not allowed without gil
and
Error compiling Cython file:
------------------------------------------------------------
...
cdef int[:] w_list = array.array(range(0, w_inc, interpolation))
cdef int[:] h_list = array.array(range(0, h_inc, interpolation))
for i in prange(c_count, nogil=True, num_threads=c_count):
count = 0
for r in w_list:
^
------------------------------------------------------------
data_util/cy_scatter.pyx:70:17: Coercion from Python not allowed without the GIL
and
Error compiling Cython file:
------------------------------------------------------------
...
cdef int[:] w_list = array.array(range(0, w_inc, interpolation))
cdef int[:] h_list = array.array(range(0, h_inc, interpolation))
for i in prange(c_count, nogil=True, num_threads=c_count):
count = 0
for r in w_list:
^
------------------------------------------------------------
data_util/cy_scatter.pyx:70:17: Converting to Python object not allowed without gil
Is there a way to do this? And if so, how do I do this?
You just want to iterate by index rather than by iterating over a Python iterator:
for ri in range(w_list.shape[0]):
r = w_list[ri]
This is somewhere where best practice in Python differs from best practice in Cython - Cython only accelerates iterating over numeric loops. The way you're trying to do it will fall back to being a Python iterator which is both slower, and requires the GIL.

Cython, Complex values, and BM3D algorithm

I am working on a image reconstruction algorithm and I found this repo online that would work great with my code, but unfortunately it doesnt seem to support complex valued calculations. I've read up on cython the past couple of days, but I'm pressed for time and I wanted to ask for advice before bull-dozering all over the code.
To be more exact, this is the Cython file:
from libcpp.vector cimport vector
from libcpp cimport bool
cimport numpy as np
import numpy as np
cdef extern from "../bm3d_src/mt19937ar.h":
double mt_genrand_res53()
cdef extern from "../bm3d_src/bm3d.h":
int run_bm3d( const float sigma, vector[float] &img_noisy,
vector[float] &img_basic,
vector[float] &img_denoised,
const unsigned width,
const unsigned height,
const unsigned chnls,
const bool useSD_h,
const bool useSD_w,
const unsigned tau_2D_hard,
const unsigned tau_2D_wien,
const unsigned color_space)
cdef extern from "../bm3d_src/utilities.h":
int save_image(char * name, vector[float] & img,
const unsigned width,
const unsigned height,
const unsigned chnls)
def hello():
return "Hello World"
def random():
return mt_genrand_res53()
cpdef float[:, :, :] bm3d(float[:, :, :] input_array,
float sigma,
bool useSD_h = True,
bool useSD_w = True,
str tau_2D_hard = "DCT",
str tau_2D_wien = "DCT"
):
"""
sigma: value of assumed noise of the noisy image;
input_array : input image, H x W x channum
useSD_h (resp. useSD_w): if true, use weight based
on the standard variation of the 3D group for the
first (resp. second) step, otherwise use the number
of non-zero coefficients after Hard Thresholding
(resp. the norm of Wiener coefficients);
tau_2D_hard (resp. tau_2D_wien): 2D transform to apply
on every 3D group for the first (resp. second) part.
Allowed values are 'DCT' and 'BIOR';
# FIXME : add color space support; right now just RGB
"""
cdef vector[float] input_image
cdef vector[float] basic_image
cdef vector[float] output_image
cdef vector[float] denoised_image
height = input_array.shape[0]
width = input_array.shape[1]
chnls = input_array.shape[2]
# convert the input image
input_image.resize(input_array.size)
pos = 0
for i in range(input_array.shape[0]):
for j in range(input_array.shape[1]):
for k in range(input_array.shape[2]):
input_image[pos] = input_array[i, j, k]
pos +=1
if tau_2D_hard == "DCT":
tau_2D_hard_i = 4
elif tau_2D_hard == "BIOR" :
tau_2D_hard_i = 5
else:
raise ValueError("Unknown tau_2d_hard, must be DCT or BIOR")
if tau_2D_wien == "DCT":
tau_2D_wien_i = 4
elif tau_2D_wien == "BIOR" :
tau_2D_wien_i = 5
else:
raise ValueError("Unknown tau_2d_wien, must be DCT or BIOR")
# FIXME someday we'll have color support
color_space = 0
ret = run_bm3d(sigma, input_image, basic_image, output_image,
width, height, chnls,
useSD_h, useSD_w,
tau_2D_hard_i, tau_2D_wien_i,
color_space)
if ret != 0:
raise Exception("run_bmd3d returned an error, retval=%d" % ret)
cdef np.ndarray output_array = np.zeros([height, width, chnls],
dtype = np.float32)
pos = 0
for i in range(input_array.shape[0]):
for j in range(input_array.shape[1]):
for k in range(input_array.shape[2]):
output_array[i, j, k] = output_image[pos]
pos +=1
return output_array
How would I go about making the most minimal changes such that it'll work with numpy array with dtype='complex'?
Cheers!

Defining NumPy arrays in Cython without incurring python overhead

I have been trying to learn Cython to speed up some of my calculations. Here is a subset of what I am trying to do: this is simply integrating a differential equation using a recursive formula while making use of NumPy arrays. I have already achieved a factor of ~100x speed increase over the pure python version. However it seems like I can gain added speed based on looking at the HTML file generated for my code by the -a cython command. My code is as follows (lines that become yellow in the HTML file that I would like to make white are labeled):
%%cython
import numpy as np
cimport numpy as np
cimport cython
from libc.math cimport exp,sqrt
#cython.boundscheck(False)
cdef double riccati_int(double j, double w, double h, double an, double d):
cdef:
double W
double an1
W = sqrt(w**2 + d**2)
#dark_yellow
an1 = ((d - (W + w) * an) * exp(-2 * W * h / j ) - d - (W - w) * an) /
((d * an - W + w) * exp(-2 * W * h / j) - d * an - W - w)
return an1
def acalc(double j, double w):
cdef:
int xpos, i, n
np.ndarray[np.int_t, ndim=1] xvals
np.ndarray[np.double_t, ndim=1] h, a
xpos = 74
xvals = np.array([0, 8, 23, 123, 218], dtype=np.int) #dark_yellow
h = np.array([1, .1, .01, .1], dtype=np.double) #dark_yellow
a = np.empty(219, dtype=np.double) #dark_yellow
a[0] = 1 / (w + sqrt(w**2 + 1)) #light_yellow
for i in range(h.size): #dark_yellow
for n in range(xvals[i], xvals[i + 1]): #light_yellow
if n < xpos:
a[n+1] = riccati_int(j, w, h[i], a[n], 1.) #light_yellow
else:
a[n+1] = riccati_int(j, w, h[i], a[n], 0.) #light_yellow
return a
It seems to me like all 9 lines that I labeled above should be able to be made white with the proper adjustments. One issue is the ability to define NumPy arrays the proper way. But probably even more important is the ability to get the first labeled line to work efficiently, since this is where the bulk of the calculation is done. I tried reading the generated C code that the HTML file displays after clicking on a yellow line, but I honestly have no clue how to read that code. If anybody could please help me out, it would be greatly appreciated.
I think you don't need to care about yellow lines that is not in loop. Add following compiler directives will make the three lines in loop faster:
#cython.cdivision(True)
cdef double riccati_int(double j, double w, double h, double an, double d):
pass
#cython.boundscheck(False)
#cython.wraparound(False)
def acalc(double j, double w):
pass
I'm not sure, whether it makes a difference, but you could do use memory-views for the arrays, e. g.
cdef double [:] h = np.array([1, .1, .01, .1], dtype=np.double) #dark_yellow
cdef double [:] a = np.empty(219, dtype=np.double) #dark_yellow
Also creating an numpy array for four static values is a bit overdone. This can be replaced by a static C array
cdef double *h = [1, .1, .01, .1]
However, as mentioned, what in the loop is, that matters most. Since line profiler won't work for cython (afaik) use time module to benchmark within the function, besides using cProfile. It might give you an idea, that the intensity of the line color in the cython log has to be assessed in context.
It is recommended to use the python types for indexing, as I learned
size_t i, n
Py_ssize_t i, n
The second one is the signed version

numpy template matching using matrix multiplications

I am trying to match a template with a binary image (only black and white) by shifting the template along the image. And return the minimum distance between the template and the image with the corresponding position on which this minimum distance did occur. For example:
img:
0 1 0
0 0 1
0 1 1
template:
0 1
1 1
This template matches the image best at position (1,1) and the distance will then be 0. So far things are not too difficult and I already got some code that does the trick.
def match_template(img, template):
mindist = float('inf')
idx = (-1,-1)
for y in xrange(img.shape[1]-template.shape[1]+1):
for x in xrange(img.shape[0]-template.shape[0]+1):
#calculate Euclidean distance
dist = np.sqrt(np.sum(np.square(template - img[x:x+template.shape[0],y:y+template.shape[1]])))
if dist < mindist:
mindist = dist
idx = (x,y)
return [mindist, idx]
But for images of the size I need (image among 500 x 200 pixels and template among 250 x 100) this already takes approximately 4.5 seconds, which is way too slow. And I know the same thing can be done much quicker using matrix multiplications (in matlab I believe this can be done using im2col and repmat). Can anyone explain me how to do it in python/numpy?
btw. I know there is an opencv matchTemplate function that does exactly what I need, but since I might need to alter the code slightly later on I would prefer a solution which I fully understand and can alter.
Thanks!
edit: If anyone can explain me how opencv does this in less than 0.2 seconds that would also be great. I have had a short look at the source code, but those things somehow always look quite complicated to me.
edit2: Cython code
import numpy as np
cimport numpy as np
DTYPE = np.int
ctypedef np.int_t DTYPE_t
def match_template(np.ndarray img, np.ndarray template):
cdef float mindist = float('inf')
cdef int x_coord = -1
cdef int y_coord = -1
cdef float dist
cdef unsigned int x, y
cdef int img_width = img.shape[0]
cdef int img_height = img.shape[1]
cdef int template_width = template.shape[0]
cdef int template_height = template.shape[1]
cdef int range_x = img_width-template_width+1
cdef int range_y = img_height-template_height+1
for y from 0 <= y < range_y:
for x from 0 <= x < range_x:
dist = np.sqrt(np.sum(np.square(template - img[ x:<unsigned int>(x+template_width), y:<unsigned int>(y+template_height) ]))) #calculate euclidean distance
if dist < mindist:
mindist = dist
x_coord = x
y_coord = y
return [mindist, (x_coord,y_coord)]
img = np.asarray(img, dtype=DTYPE)
template = np.asarray(template, dtype=DTYPE)
match_template(img, template)
One possible way of doing what you want is via convolution (which can be brute force or FFT). Matrix multiplications AFAIK won't work. You need to convolve your data with the template. And find the maximum (you'll also need to do some scaling to make it work properly).
xs=np.array([[0,1,0],[0,0,1],[0,1,1]])*1.
ys=np.array([[0,1],[1,1]])*1.
print scipy.ndimage.convolve(xs,ys,mode='constant',cval=np.inf)
>>> array([[ 1., 1., inf],
[ 0., 2., inf],
[ inf, inf, inf]])
print scipy.signal.fftconvolve(xs,ys,mode='valid')
>>> array([[ 1., 1.],
[ 0., 2.]])
There may be a fancy way to get this done using pure numpy/scipy magic. But it might be easier (and more understandable when you look at the code in the future) to just drop into Cython to get this done. There's a good tutorial for integrating Cython with numpy at http://docs.cython.org/src/tutorial/numpy.html.
EDIT:
I did a quick test with your Cython code and it ran ~15 sec for a 500x400 img with a 100x200 template. After some tweaks (eliminating the numpy method calls and numpy bounds checking), I got it down under 3 seconds. That may not be enough for you, but it shows the possibility.
import numpy as np
cimport numpy as np
cimport cython
from libc.math cimport sqrt
DTYPE = np.int
ctypedef np.int_t DTYPE_t
#cython.boundscheck(False)
def match_template(np.ndarray[DTYPE_t, ndim=2] img, np.ndarray[DTYPE_t, ndim=2] template):
cdef float mindist = float('inf')
cdef int x_coord = -1
cdef int y_coord = -1
cdef float dist
cdef unsigned int x, y
cdef int img_width = img.shape[0]
cdef int img_height = img.shape[1]
cdef int template_width = template.shape[0]
cdef int template_height = template.shape[1]
cdef int range_x = img_width-template_width+1
cdef int range_y = img_height-template_height+1
cdef DTYPE_t total
cdef int delta
cdef unsigned int j, k, j_plus, k_plus
for y from 0 <= y < range_y:
for x from 0 <= x < range_x:
#dist = np.sqrt(np.sum(np.square(template - img[ x:<unsigned int>(x+template_width), y:<unsigned int>(y+template_height) ]))) #calculate euclidean distance
# Do the same operations, but in plain C
total = 0
for j from 0 <= j < template_width:
j_plus = <unsigned int>x + j
for k from 0 <= k < template_height:
k_plus = <unsigned int>y + k
delta = template[j, k] - img[j_plus, k_plus]
total += delta*delta
dist = sqrt(total)
if dist < mindist:
mindist = dist
x_coord = x
y_coord = y
return [mindist, (x_coord,y_coord)]

Categories