iterating through specified axis in cython

iterating through specified axis in cython - python

I am learning cython and I have modified the code in the tutorial to try to do numerical differentiation:
import numpy as np
cimport numpy as np
import cython
np.import_array()
def test3(a, int order=2, int axis=-1):
cdef int i
if axis<0:
axis = len(a.shape) + axis
out = np.empty(a.shape, np.double)
cdef np.flatiter ita = np.PyArray_IterAllButAxis(a, &axis)
cdef np.flatiter ito = np.PyArray_IterAllButAxis(out, &axis)
cdef int a_axis_stride = a.strides[axis]
cdef int o_axis_stride = out.strides[axis]
cdef int axis_length = out.shape[axis]
cdef double value
while np.PyArray_ITER_NOTDONE(ita):
# first element
pt1 = <double*>((<char*>np.PyArray_ITER_DATA(ita)))
pt2 = (<double*>((<char*>np.PyArray_ITER_DATA(ita)) + 1*a_axis_stride))
pt3 = (<double*>((<char*>np.PyArray_ITER_DATA(ita)) + 2*a_axis_stride))
value = -1.5*pt1[0] + 2*pt2[0] - 0.5*pt3[0]
(<double*>((<char*>np.PyArray_ITER_DATA(ito))))[0] = value
for i in range(axis_length-2):
pt1 = (<double*>((<char*>np.PyArray_ITER_DATA(ita)) + i*a_axis_stride))
pt2 = (<double*>((<char*>np.PyArray_ITER_DATA(ita)) + (i+2)*a_axis_stride))
value = -0.5*pt1[0] + 0.5*pt2[0]
(<double*>((<char*>np.PyArray_ITER_DATA(ito)) + (i+1)*o_axis_stride))[0] = value
# last element
pt1 = (<double*>((<char*>np.PyArray_ITER_DATA(ita))+ (axis_length-3)*a_axis_stride))
pt2 = (<double*>((<char*>np.PyArray_ITER_DATA(ita))+ (axis_length-2)*a_axis_stride))
pt3 = (<double*>((<char*>np.PyArray_ITER_DATA(ita))+ (axis_length-1)*a_axis_stride))
value = 1.5*pt3[0] - 2*pt2[0] + 0.5*pt1[0]
(<double*>((<char*>np.PyArray_ITER_DATA(ito))+(axis_length-1)*o_axis_stride))[0] = value
np.PyArray_ITER_NEXT(ita)
np.PyArray_ITER_NEXT(ito)
return out
The code produces correct results, and it is indeed faster than my own numpy implementation without cython. The problem is the following:
I thought about only having one pt1 = (<double*>((<char*>np.PyArray_ITER_DATA(ita)) + i*a_axis_stride)) statement and then use pt1[0], pt1[-1], pt1[1] to access the array elements, but this only works if the specified axis is the last one. If I am differentiating a different axis (not the last one), then (<double*>((<char*>np.PyArray_ITER_DATA(ita)) + i*a_axis_stride)) points to the right one, but pt[-1] and pt[1] point to the elements right before and after pt[0], which is along the last axis. The current version works, but if I want to implement higher-order differentiation which requires more points to evaluate, then I will end up having many such lines, and I'm not sure if there are better/more efficient ways to do it using pt[1] or
something like pt[xxx] to access neighbouring points (along the specified axis).
Are there other ways to speed up this piece of code? I am looking for some minor details that I may have overlooked or subtle things that can have a big effect.

To my slight surprise I can't actually beat your version using Cython typed memoryviews - the numpy iterators look pretty quick. However I think I can manage a significant increase in readability to let you use the Python slicing syntax. The only restriction is that the input array must be C contiguous to allow it to be reshaped easily (I think Fortran contiguous might also work, but I haven't tested)
The basic trick is to flatten all the axes before and after selected axis so it is a known 3D shape, at which point you can use Cython memoryviews.
#cython.boundscheck(False)
def test4(a,order=2,axis=-1):
assert a.flags['C_CONTIGUOUS'] # otherwise the reshape doesn't work
before = np.product(a.shape[:axis])
after = np.product(a.shape[(axis+1):])
cdef double[:,:,::1] a_new = a.reshape((before, a.shape[axis], after)) # this should not involve copying memory - it's just a new view
cdef double[:] a_slice
cdef double[:,:,::1] out = np.empty_like(a_new)
assert a_new.shape[1] > 3
cdef int m,n,i
for m in range(a_new.shape[0]):
for n in range(a_new.shape[2]):
a_slice = a_new[m,:,n]
out[m,0,n] = -1.5*a_slice[0] + 2*a_slice[1] - 0.5*a_slice[2]
for i in range(a_slice.shape[0]-2):
out[m,i+1,n] = -0.5*a_slice[i] + 0.5*a_slice[i+2]
# last element
out[m,-1,n] = 1.5*a_slice[-1] - 2*a_slice[-2] + 0.5*a_slice[-3]
return np.asarray(out).reshape(a.shape)
The speed is very slightly slower than your version I think.
In terms of improving your code, you could work out the stride in doubles instead of bytes (a_axis_stride_dbl = a_axis_stride/sizeof(double)) and then index as pt[i*a_axis_stride_dbl]). It probably won't gain much speed but will be more readable. (This is what you ask about in point 1)

Related

Explain pitch, width, height, depth in memory for 3D arrays

I am working with CUDA and 3D textures in python (using pycuda). There is a function called Memcpy3D which has the same members as Memcpy2D plus a few extras. In it it calls you to describe things such as width_in_bytes, src_pitch, src_height, height and copy_depth. This is what I am struggling with (in 3D) and its relevance with C or F style indexing. For instance, if I simply change the ordering from F to C in the working example below, it stops working - and I don't know why.
First of all, I understand pitch to be how many bytes in memory it takes to move one index across in threadIdx.x (or the x direction, or a column). So for a float32 array of C shape (3,2,4), to move one value in x I expect to move 4 values in memory (as the indexing goes down the z axis first?). Therefore my pitch would be 4*32bits.
I understand height to be the number of rows. (In this example, 3)
I understand width to be the number of cols. (In this example, 2)
I understand depth to be the number of z slices. (In this example, 4)
I understand width_in_bytes to be the width of a row in x inclusive of the z elements behind it, i.e. a row slice, (0,:,:). This would be how many addresses in memory it takes to transverse one element in the y-direction.
So when I change the ordering from F to C in the code below, and adapt the code to change the height/width values accordingly it still doesn't work. It just presents a logic failure which makes me think I'm not understanding the concept of pitch, width, height, depth correctly.
Please educate me.
Below is a full working script that copies an array to the GPU as a texture and copies the contents back.
import pycuda.driver as drv
import pycuda.gpuarray as gpuarray
import pycuda.autoinit
from pycuda.compiler import SourceModule
import numpy as np
w = 2
h = 3
d = 4
shape = (w, h, d)
a = np.arange(24).reshape(*shape,order='F').astype('float32')
print(a.shape,a.strides)
print(a)
descr = drv.ArrayDescriptor3D()
descr.width = w
descr.height = h
descr.depth = d
descr.format = drv.dtype_to_array_format(a.dtype)
descr.num_channels = 1
descr.flags = 0
ary = drv.Array(descr)
copy = drv.Memcpy3D()
copy.set_src_host(a)
copy.set_dst_array(ary)
copy.width_in_bytes = copy.src_pitch = a.strides[1]
copy.src_height = copy.height = h
copy.depth = d
copy()
mod = SourceModule("""
texture<float, 3, cudaReadModeElementType> mtx_tex;
__global__ void copy_texture(float *dest)
{
int x = threadIdx.x;
int y = threadIdx.y;
int z = threadIdx.z;
int dx = blockDim.x;
int dy = blockDim.y;
int i = (z*dy + y)*dx + x;
dest[i] = tex3D(mtx_tex, x, y, z);
}
""")
copy_texture = mod.get_function("copy_texture")
mtx_tex = mod.get_texref("mtx_tex")
mtx_tex.set_array(ary)
dest = np.zeros(shape, dtype=np.float32, order="F")
copy_texture(drv.Out(dest), block=shape, texrefs=[mtx_tex])
print(dest)

Not sure I fully understand the problem in your code, but I'll attempt to clarify.
In CUDA, width (x) refers to the fastest-changing dimension, height (y) is the middle dimension, and depth (z) is the slowest-changing dimension. The pitch refers to the stride in bytes required to step between values along the y dimension.
In Numpy, an array defined as np.empty(shape=(3,2,4), dtype=np.float32, order="C") has strides=(32, 16, 4), and corresponds to width=4, height=2, depth=3, pitch=16.
Using "F" ordering in Numpy means the order of dimensions is reversed in memory.
Your code appears to work if I make the following changes:
#shape = (w, h, d)
shape = (d, h, w)
#a = np.arange(24).reshape(*shape,order='F').astype('float32')
a = np.arange(24).reshape(*shape,order='C').astype('float32')
...
#dest = np.zeros(shape, dtype=np.float32, order="F")
dest = np.zeros(shape, dtype=np.float32, order="C")
#copy_texture(drv.Out(dest), block=shape, texrefs=[mtx_tex])
copy_texture(drv.Out(dest), block=(w,h,d), texrefs=[mtx_tex])

High performance variable blurring in very big images using Python

I have a large collection of large images (ex. 15000x15000 pixels) that I would like to blur. I need to blur the images using a distance function, so the further away I move from some areas in the image the more heavier the blurring should be. I have a distance map describing how far a given pixel is from the areas.
Due to the large amount of images I have to consider performance. I have looked at NumPY/SciPY, they have some great functions but they seem to use a fixed kernel size and I need to reduce or increase the kernel size depending on the distance to the previous mentioned areas.
How can I solve this problem in python?
UPDATE: My solution so far based on the answer by rth:
# cython: boundscheck=False
# cython: cdivision=True
# cython: wraparound=False
import numpy as np
cimport numpy as np
def variable_average(int [:, ::1] data, int[:,::1] kernel_size):
cdef int width, height, i, j, ii, jj
width = data.shape[1]
height = data.shape[0]
cdef double [:, ::1] data_blurred = np.empty([width, height])
cdef double res
cdef int sigma, weight
for i in range(width):
for j in range(height):
weight = 0
res = 0
sigma = kernel_size[i, j]
for ii in range(i - sigma, i + sigma + 1):
for jj in range(j - sigma, j + sigma + 1):
if ii < 0 or ii >= width or jj < 0 or jj >= height:
continue
res += data[ii, jj]
weight += 1
data_blurred[i, j] = res/weight
return data_blurred
Test:
data = np.random.randint(256, size=(1024,1024))
kernel = np.random.randint(256, size=(1024,1024)) + 1
result = np.asarray(variable_average(data, kernel))
The method using the above settings takes around 186seconds to run. Is that what I can expect to ultimately squeeze out of the method or are there optimizations that I can use to further increase the performance (still using Python)?

As you have noted related scipy functions do not support variable size blurring. You could implement this in pure python with for loops, then use Cython, Numba or PyPy to get a C-like performance.
Here is a low level python implementation, than uses numpy only for data storage,
import numpy as np
def variable_blur(data, kernel_size):
""" Blur with a variable window size
Parameters:
- data: 2D ndarray of floats or integers
- kernel_size: 2D ndarray of integers, same shape as data
Returns:
2D ndarray
"""
data_blurred = np.empty(data.shape)
Ni, Nj = data.shape
for i in range(Ni):
for j in range(Nj):
res = 0.0
weight = 0
sigma = kernel_size[i, j]
for ii in range(i - sigma, i+sigma+1):
for jj in range(j - sigma, j+sigma+1):
if ii<0 or ii>=Ni or jj < 0 or jj >= Nj:
continue
res += data[ii, jj]
weight += 1
data_blurred[i, j] = res/weight
return data_blurred
data = np.random.rand(50, 20)
kernel_size = 3*np.ones((50, 20), dtype=np.int)
variable_blur(data, kernel_size)
that calculates an arithmetic average of pixels with a variable kernel size. It is a bad implementation with respect to numpy, in a sense that is it not vectorized. However, this makes it convenient to port to other high performance solutions:
Cython: simply statically typing variables, and compiling should give you C-like performance,
def variable_blur(double [:, ::1] data, long [:, ::1] kernel_size):
cdef double [:, ::1] data_blurred = np.empty(data.shape)
cdef Py_ssize_t Ni, Nj
Ni = data.shape[0]
Nj = data.shape[1]
for i in range(Ni):
# [...] etc.
see this post for a complete example, as well as the compilation notes.
Numba: Wrapping the above function with the #jit decorator, should be mostly sufficient.
PyPy: installing PyPy + the experimental numpy branch, could be another alternative worth trying. Although, then you would have to use PyPy for all your code, which might not be possible at present.
Once you have a fast implementation, you can then use multiprocessing, etc. to process different images in parallel, if need be. Or even parallelize with OpenMP in Cython the outer for loop.

I came across this while googling and thought I would share my own solution which is mostly vectorized and doesn't include any for loops on pixels. You can approximate a Gaussian blur by running a box blur multiple times in a row. So the approach I decided to use is to iteratively box blur the image, but to vary the number of iterations per pixel using a weighting function.
If you need a large blur radius, the number of iterations grows quadratically, so consider increasing the ksize.
Here is the implementation
import cv2
def variable_blur(im, sigma, ksize=3):
"""Blur an image with a variable Gaussian kernel.
Parameters
----------
im: numpy array, (h, w)
sigma: numpy array, (h, w)
ksize: int
The box blur kernel size. Should be an odd number >= 3.
Returns
-------
im_blurred: numpy array, (h, w)
"""
variance = box_blur_variance(ksize)
# Number of times to blur per-pixel
num_box_blurs = 2 * sigma**2 / variance
# Number of rounds of blurring
max_blurs = int(np.ceil(np.max(num_box_blurs))) * 3
# Approximate blurring a variable number of times
blur_weight = num_box_blurs / max_blurs
current_im = im
for i in range(max_blurs):
next_im = cv2.blur(current_im, (ksize, ksize))
current_im = next_im * blur_weight + current_im * (1 - blur_weight)
return current_im
def box_blur_variance(ksize):
x = np.arange(ksize) - ksize // 2
x, y = np.meshgrid(x, x)
return np.mean(x**2 + y**2)
And here is an example
im = np.random.rand(300, 300)
sigma = 3
# Variable
x = np.linspace(0, 1, im.shape[1])
y = np.linspace(0, 1, im.shape[0])
x, y = np.meshgrid(x, y)
sigma_arr = sigma * (x + y)
im_variable = variable_blur(im, sigma_arr)
# Gaussian
ksize = sigma * 8 + 1
im_gauss = cv2.GaussianBlur(im, (ksize, ksize), sigma)
# Gaussian replica
sigma_arr = np.full_like(im, sigma)
im_approx = variable_blur(im, sigma_arr)
Blurring results
The plot is:
Top left: Source image
Top right: Variable blurring
Bottom left: Gaussian blurring
Bottom right: Approximated Gaussian blurring

Cythonize two small numpy functions, help needed

The problem
I'm trying to Cythonize two small functions that mostly deal with numpy ndarrays for some scientific purpose. These two smalls functions are called millions of times in a genetic algorithm and account for the majority of the time taken by the algo.
I made some progress on my own and both work nicely, but i get only a tiny speed improvement (10%). More importantly, cython --annotate show that the majority of the code is still going through Python.
The code
First function:
The aim of this function is to get back slices of data and it is called millions of times in an inner nested loop. Depending on the bool in data[1][1], we either get the slice in the forward or reverse order.
#Ipython notebook magic for cython
%%cython --annotate
import numpy as np
from scipy import signal as scisignal
cimport cython
cimport numpy as np
def get_signal(data):
#data[0] contains the data structure containing the numpy arrays
#data[1][0] contains the position to slice
#data[1][1] contains the orientation to slice, forward = 0, reverse = 1
cdef int halfwinwidth = 100
cdef int midpoint = data[1][0]
cdef int strand = data[1][1]
cdef int start = midpoint - halfwinwidth
cdef int end = midpoint + halfwinwidth
#the arrays we want to slice
cdef np.ndarray r0 = data[0]['normals_forward']
cdef np.ndarray r1 = data[0]['normals_reverse']
cdef np.ndarray r2 = data[0]['normals_combined']
if strand == 0:
normals_forward = r0[start:end]
normals_reverse = r1[start:end]
normals_combined = r2[start:end]
else:
normals_forward = r1[end - 1:start - 1: -1]
normals_reverse = r0[end - 1:start - 1: -1]
normals_combined = r2[end - 1:start - 1: -1]
#return the result as a tuple
row = (normals_forward,
normals_reverse,
normals_combined)
return row
Second function
This one gets a list of tuples of numpy arrays, and we want to add up the arrays element wise, then normalize them and get the integration of the intersection.
def calculate_signal(list signal):
cdef int halfwinwidth = 100
cdef np.ndarray profile_normals_forward = np.zeros(halfwinwidth * 2, dtype='f')
cdef np.ndarray profile_normals_reverse = np.zeros(halfwinwidth * 2, dtype='f')
cdef np.ndarray profile_normals_combined = np.zeros(halfwinwidth * 2, dtype='f')
#b is a tuple of 3 np.ndarrays containing 200 floats
#here we add them up elementwise
for b in signal:
profile_normals_forward += b[0]
profile_normals_reverse += b[1]
profile_normals_combined += b[2]
#normalize the arrays
cdef int count = len(signal)
#print "Normalizing to number of elements"
profile_normals_forward /= count
profile_normals_reverse /= count
profile_normals_combined /= count
intersection_signal = scisignal.detrend(np.fmin(profile_normals_forward, profile_normals_reverse))
intersection_signal[intersection_signal < 0] = 0
intersection = np.sum(intersection_signal)
results = {"intersection": intersection,
"profile_normals_forward": profile_normals_forward,
"profile_normals_reverse": profile_normals_reverse,
"profile_normals_combined": profile_normals_combined,
}
return results
Any help is appreciated - I tried using memory views but for some reason the code got much, much slower.

After fixing the array cdef (as has been indicated, with the dtype specified), you should probably put the routine in a cdef function (which will only be callable by a def function in the same script).
In the declaration of the function, you'll need to provide the type (and the dimensions if it's an array numpy):
cdef get_signal(numpy.ndarray[DTYPE_t, ndim=3] data):
I'm not sure using a dict is a good idea though. You could make use of numpy's column or row slices like data[:, 0].

Apply 1D function on one axis of nd-array

What I want:
I want to apply a 1D function to an arbitrarily shaped ndarray, such that it modifies a certain axis. Similar to the axis argument in numpy.fft.fft.
Take the following example:
import numpy as np
def transf1d(f, x, y, out):
"""Transform `f(x)` to `g(y)`.
This function is actually a C-function that is far more complicated
and should not be modified. It only takes 1D arrays as parameters.
"""
out[...] = (f[None,:]*np.exp(-1j*x[None,:]*y[:,None])).sum(-1)
def transf_all(F, x, y, axis=-1, out=None):
"""General N-D transform.
Perform `transf1d` along the given `axis`.
Given the following:
F.shape == (2, 3, 100, 4, 5)
x.shape == (100,)
y.shape == (50,)
axis == 2
Then the output shape would be:
out.shape == (2, 3, 50, 4, 5)
This function should wrap `transf1d` such that it works on arbitrarily
shaped (compatible) arrays `F`, and `out`.
"""
if out is None:
shape = list(np.shape(F))
shape[axis] = np.size(y)
for f, o in magic_iterator(F, out):
# Given above shapes:
# f.shape == (100,)
# o.shape == (50,)
transf1d(f, x, y, o)
return out
The function transf1d takes a 1D ndarray f, and two more 1D arrays x, and y. It performs a fourier transform of f(x) from the x-axis to the y-axis. The result is stored in the out argument.
Now I want to wrap this in a more general function transf_all, that can take ndarrays of arbitrary shape along with an axis argument, that specifies along which axis to transform.
Notes:
My code is actually written in Cython. Ideally, the magic_iterator would be fast in Cython.
The function transf1d actually is a C-function that returns its output in the out argument. Hence, I couldn't get it to work with numpy.apply_along_axis.
Because transf1d is actually a pretty complicated C-function I cannot rewrite it to work on arbitrary arrays. I need to wrap it in a Cython function that deals with the additional dimensions.
Note, that the arrays x, and y can differ in their lengths.
My question:
How can I do this? How can I iterate over arbitrary dimensions of an ndarray such that at each iteration I will get a 1D array containing the specified axis?
I had a look at nditer, but I'm not sure if that is actually the right tool for this job.
Cheers!

import numpy as np
def transf1d(f, x, y, out):
"""Transform `f(x)` to `g(y)`.
This function is actually a C-function that is far more complicated
and should not be modified. It only takes 1D arrays as parameters.
"""
out[...] = (f[None,:]*np.exp(-1j*x[None,:]*y[:,None])).sum(-1)
def transf_all(F, x, y, axis=-1, out=None):
"""General N-D transform.
Perform `transf1d` along the given `axis`.
Given the following:
F.shape == (2, 3, 100, 4, 5)
x.shape == (100,)
y.shape == (50,)
axis == 2
Then the output shape would be:
out.shape == (2, 3, 50, 4, 5)
This function should wrap `transf1d` such that it works on arbitrarily
shaped (compatible) arrays `F`, and `out`.
"""
def wrapper(f):
"""
wrap transf1d for apply_along_axis compatibility
that is, having a signature of F.shape[axis] -> out.shape[axis]
"""
out = np.empty_like(y)
transf1d(f, x, y, out)
return out
return np.apply_along_axis(wrapper, axis, F)
I believe this should do what you want, although I havnt tested it. Note that the looping happening inside apply_along_axis has python-level performance though, so this only vectorizes the operation in terms of style, not in terms of performance. However, that is quite probably of no concern, assuming the decision to resort to external C code for the inner loop is justified by it being a nontrivial operation in the first place.

To answer your question:
If you really just want to iterate over all but a given axis, you can use:
for s in itertools.product(map(range, arr.shape[:axis]+arr.shape[axis+1:]):
arr[s[:axis] + (slice(None),) + s[axis:]]
Maybe there's a more elegant way to do it, but this should work.
But, don't iterate:
For your problem, I would just rewrite your function to work on a given axis of an ndarray. I think this should work:
def transfnd(f, x, y, axis, out):
s = list(f.shape)
s.insert(axis, 1)
yx = [y.size, x.size] + [1]*(f.ndim - axis - 1)
out[...] = np.sum(f.reshape(*s)*np.exp(-1j*x[None,:]*y[:,None]).reshape(*yx), axis+1)
It's really just the generalization of your current implementation, but instead of inserting a new axis in F at the beginning, it inserts it at axis (there might be a better way to do this than with the list(shape) method, but that was all I could do. Finally, you have to add trailing new axes to your yx outer product, to match as many trailing indices you have in F.
I didn't really know how to test this, but the shapes all work out, so please test it and let me know whether it works.

I found a way of iterating over all but one axis in Cython using the Numpy C-API (Code down below). However, it's not pretty. Whether it's worth the effort depends on the inner function and the size of data.
If any one knows a more elegant way to do this in Cython, please let me know.
I compared to Eelco's solution and they run at a comparable speed for large arguments. For smaller arguments the C-API solution is faster:
In [5]: y=linspace(-1,1,100);
In [6]: %timeit transf.apply_along(f, x, y, axis=1)
1 loops, best of 3: 5.28 s per loop
In [7]: %timeit transf.transfnd(f, x, y, axis=1)
1 loops, best of 3: 5.16 s per loop
As you can see, for this input both functions are roughly at the same speed.
In [8]: f=np.random.rand(10,20,50);x=linspace(0,1,20);y=linspace(-1,1,10);
In [9]: %timeit transf.apply_along(f, x, y, axis=1)
100 loops, best of 3: 15.1 ms per loop
In [10]: %timeit transf.transfnd(f, x, y, axis=1)
100 loops, best of 3: 8.55 ms per loop
However, for less large input arrays the C-API approach is faster.
The code
#cython: boundscheck=False
#cython: wraparound=False
#cython: cdivision=True
import numpy as np
cimport numpy as np
np.import_array()
cdef extern from "complex.h":
double complex cexp(double complex z) nogil
cdef void transf1d(double complex[:] f,
double[:] x,
double[:] y,
double complex[:] out,
int Nx,
int Ny) nogil:
cdef int i, j
for i in xrange(Ny):
out[i] = 0
for j in xrange(Nx):
out[i] = out[i] + f[j]*cexp(-1j*x[j]*y[i])
def transfnd(F, x, y, axis=-1, out=None):
# Make sure everything is a numpy array.
F = np.asanyarray(F, dtype=complex)
x = np.asanyarray(x, dtype=float)
y = np.asanyarray(y, dtype=float)
# Calculate absolute axis.
cdef int ax = axis
if ax < 0:
ax = np.ndim(F) + ax
# Calculate lengths of the axes `x`, and `y`.
cdef int Nx = np.size(x), Ny = np.size(y)
# Output array.
if out is None:
shape = list(np.shape(F))
shape[axis] = Ny
out = np.empty(shape, dtype=complex)
else:
out = np.asanyarray(out, dtype=complex)
# Error check.
assert np.shape(F)[axis] == Nx, \
'Array length mismatch between `F`, and `x`!'
assert np.shape(out)[axis] == Ny, \
'Array length mismatch between `out`, and `y`!'
f_shape = list(np.shape(F))
o_shape = list(np.shape(out))
f_shape[axis] = 0
o_shape[axis] = 0
assert f_shape == o_shape, 'Array shape mismatch between `F`, and `out`!'
# Construct iterator over all but one axis.
cdef np.flatiter itf = np.PyArray_IterAllButAxis(F, &ax)
cdef np.flatiter ito = np.PyArray_IterAllButAxis(out, &ax)
cdef int f_stride = F.strides[axis]
cdef int o_stride = out.strides[axis]
# Memoryview to access one slice per iteration.
cdef double complex[:] fdat
cdef double complex[:] odat
cdef double[:] xdat = x
cdef double[:] ydat = y
while np.PyArray_ITER_NOTDONE(itf):
# View the current `x`, and `y` axes.
fdat = <double complex[:Nx]> np.PyArray_ITER_DATA(itf)
fdat.strides[0] = f_stride
odat = <double complex[:Ny]> np.PyArray_ITER_DATA(ito)
odat.strides[0] = o_stride
# Perform the 1D-transformation on one slice.
transf1d(fdat, xdat, ydat, odat, Nx, Ny)
# Go to next step.
np.PyArray_ITER_NEXT(itf)
np.PyArray_ITER_NEXT(ito)
return out
# For comparison
def apply_along(F, x, y, axis=-1):
# Make sure everything is a numpy array.
F = np.asanyarray(F, dtype=complex)
x = np.asanyarray(x, dtype=float)
y = np.asanyarray(y, dtype=float)
# Calculate absolute axis.
cdef int ax = axis
if ax < 0:
ax = np.ndim(F) + ax
# Calculate lengths of the axes `x`, and `y`.
cdef int Nx = np.size(x), Ny = np.size(y)
# Error check.
assert np.shape(F)[axis] == Nx, \
'Array length mismatch between `F`, and `x`!'
def wrapper(f):
out = np.empty(Ny, complex)
transf1d(f, x, y, out, Nx, Ny)
return out
return np.apply_along_axis(wrapper, axis, F)
Build with the following setup.py
from distutils.core import setup
from Cython.Build import cythonize
import numpy as np
setup(
name = 'transf',
ext_modules = cythonize('transf.pyx'),
include_dirs = [np.get_include()],
)

Defining NumPy arrays in Cython without incurring python overhead

I have been trying to learn Cython to speed up some of my calculations. Here is a subset of what I am trying to do: this is simply integrating a differential equation using a recursive formula while making use of NumPy arrays. I have already achieved a factor of ~100x speed increase over the pure python version. However it seems like I can gain added speed based on looking at the HTML file generated for my code by the -a cython command. My code is as follows (lines that become yellow in the HTML file that I would like to make white are labeled):
%%cython
import numpy as np
cimport numpy as np
cimport cython
from libc.math cimport exp,sqrt
#cython.boundscheck(False)
cdef double riccati_int(double j, double w, double h, double an, double d):
cdef:
double W
double an1
W = sqrt(w**2 + d**2)
#dark_yellow
an1 = ((d - (W + w) * an) * exp(-2 * W * h / j ) - d - (W - w) * an) /
((d * an - W + w) * exp(-2 * W * h / j) - d * an - W - w)
return an1
def acalc(double j, double w):
cdef:
int xpos, i, n
np.ndarray[np.int_t, ndim=1] xvals
np.ndarray[np.double_t, ndim=1] h, a
xpos = 74
xvals = np.array([0, 8, 23, 123, 218], dtype=np.int) #dark_yellow
h = np.array([1, .1, .01, .1], dtype=np.double) #dark_yellow
a = np.empty(219, dtype=np.double) #dark_yellow
a[0] = 1 / (w + sqrt(w**2 + 1)) #light_yellow
for i in range(h.size): #dark_yellow
for n in range(xvals[i], xvals[i + 1]): #light_yellow
if n < xpos:
a[n+1] = riccati_int(j, w, h[i], a[n], 1.) #light_yellow
else:
a[n+1] = riccati_int(j, w, h[i], a[n], 0.) #light_yellow
return a
It seems to me like all 9 lines that I labeled above should be able to be made white with the proper adjustments. One issue is the ability to define NumPy arrays the proper way. But probably even more important is the ability to get the first labeled line to work efficiently, since this is where the bulk of the calculation is done. I tried reading the generated C code that the HTML file displays after clicking on a yellow line, but I honestly have no clue how to read that code. If anybody could please help me out, it would be greatly appreciated.

I think you don't need to care about yellow lines that is not in loop. Add following compiler directives will make the three lines in loop faster:
#cython.cdivision(True)
cdef double riccati_int(double j, double w, double h, double an, double d):
pass
#cython.boundscheck(False)
#cython.wraparound(False)
def acalc(double j, double w):
pass

I'm not sure, whether it makes a difference, but you could do use memory-views for the arrays, e. g.
cdef double [:] h = np.array([1, .1, .01, .1], dtype=np.double) #dark_yellow
cdef double [:] a = np.empty(219, dtype=np.double) #dark_yellow
Also creating an numpy array for four static values is a bit overdone. This can be replaced by a static C array
cdef double *h = [1, .1, .01, .1]
However, as mentioned, what in the loop is, that matters most. Since line profiler won't work for cython (afaik) use time module to benchmark within the function, besides using cProfile. It might give you an idea, that the intensity of the line color in the cython log has to be assessed in context.
It is recommended to use the python types for indexing, as I learned
size_t i, n
Py_ssize_t i, n
The second one is the signed version

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

iterating through specified axis in cython - python

Related

Explain pitch, width, height, depth in memory for 3D arrays

High performance variable blurring in very big images using Python

Cythonize two small numpy functions, help needed

Apply 1D function on one axis of nd-array

Defining NumPy arrays in Cython without incurring python overhead

Categories

Resources