PyOpenCL indexing 3D arrays inside kernel code

PyOpenCL indexing 3D arrays inside kernel code - python

I am using PyOpenCL to process images in Python and to send a 3D numpy array (height x width x 4) to the kernel. I am having trouble indexing the 3D array inside the kernel code. For now I am only able to copy the whole input array to the output. The current code looks like this, where img is the image with img.shape = (320, 512, 4):
__kernel void part1(__global float* img, __global float* results)
{
unsigned int x = get_global_id(0);
unsigned int y = get_global_id(1);
unsigned int z = get_global_id(2);
int index = x + 320*y + 320*512*z;
results[index] = img[index];
}
However, I do not quite understand how this work. For example, how do I index the Python equivalent of img[1, 2, 3] inside this kernel? And further, which index should be used into results for storing some item if I want it to be on the position results[1, 2, 3] in the numpy array when I get the results back to Python?
To run this I am using this Python code:
import pyopencl as cl
import numpy as np
class OpenCL:
def __init__(self):
self.ctx = cl.create_some_context()
self.queue = cl.CommandQueue(self.ctx)
def loadProgram(self, filename):
f = open(filename, 'r')
fstr = "".join(f.readlines())
self.program = cl.Program(self.ctx, fstr).build()
def opencl_energy(self, img):
mf = cl.mem_flags
self.img = img.astype(np.float32)
self.img_buf = cl.Buffer(self.ctx, mf.READ_ONLY | mf.COPY_HOST_PTR, hostbuf=self.img)
self.dest_buf = cl.Buffer(self.ctx, mf.WRITE_ONLY, self.img.nbytes)
self.program.part1(self.queue, self.img.shape, None, self.img_buf, self.dest_buf)
c = np.empty_like(self.img)
cl.enqueue_read_buffer(self.queue, self.dest_buf, c).wait()
return c
example = OpenCL()
example.loadProgram("get_energy.cl")
image = np.random.rand(320, 512, 4)
image = image.astype(np.float32)
results = example.opencl_energy(image)
print("All items are equal:", (results==image).all())

Update:
The OpenCL docs state (in 3.5), that
"Memory objects are categorized into two types: buffer objects, and image objects. A buffer
object stores a one-dimensional collection of elements whereas an image object is used to store a
two- or three- dimensional texture, frame-buffer or image."
so, a buffer always is linear, or linearized as you can see from my sample below.
import pyopencl as cl
import numpy as np
h_a = np.arange(27).reshape((3,3,3)).astype(np.float32)
ctx = cl.create_some_context()
queue = cl.CommandQueue(ctx)
mf = cl.mem_flags
d_a = cl.Buffer(ctx, mf.READ_ONLY | mf.COPY_HOST_PTR, hostbuf=h_a)
prg = cl.Program(ctx, """
__kernel void p(__global const float *d_a) {
printf("Array element is %f ",d_a[10]);
}
""").build()
prg.p(queue, (1,), None, d_a)
Gives me
"Array element is 10"
as output. So, the buffer actually is the linearized array. Nevertheless, the naive [x,y,z] approach known from numpy doesn't work that way. Using an 2 or 3-D Image instead of a buffer should work nevertheless.

Although this is not the opitimal solution, I linearized the array in Python and sent it as 1D. In kernel code I calculated x, y and z from the linear index. When returned to Pyhon I reshaped it back to the original shape.

I encountered the same problem.
On https://lists.tiker.net/pipermail/pyopencl/2009-October/000134.html
is a simple example how to use 3d arrays with PyOpenCL that worked for me. I quote the code here for future reference:
import pyopencl as cl
import numpy
import numpy.linalg as la
sizeX=4
sizeY=2
sizeZ=5
a = numpy.random.rand(sizeX,sizeY,sizeZ).astype(numpy.float32)
ctx = cl.Context()
queue = cl.CommandQueue(ctx)
mf = cl.mem_flags
a_buf = cl.Buffer(ctx, mf.READ_ONLY | mf.COPY_HOST_PTR, hostbuf=a)
dest_buf = cl.Buffer(ctx, mf.WRITE_ONLY, a.nbytes)
prg = cl.Program(ctx, """
__kernel void sum(__global const float *a, __global float *b)
{
int x = get_global_id(0);
int y = get_global_id(1);
int z = get_global_id(2);
int idx = z * %d * %d + y * %d + x;
b[idx] = a[idx] * x + 3 * y + 5 * z;
}
""" % (sizeY, sizeX, sizeX) ).build()
prg.sum(queue, a.shape, a_buf, dest_buf)
cl.enqueue_read_buffer(queue, dest_buf, a).wait()
print a

Related

Cython, Complex values, and BM3D algorithm

I am working on a image reconstruction algorithm and I found this repo online that would work great with my code, but unfortunately it doesnt seem to support complex valued calculations. I've read up on cython the past couple of days, but I'm pressed for time and I wanted to ask for advice before bull-dozering all over the code.
To be more exact, this is the Cython file:
from libcpp.vector cimport vector
from libcpp cimport bool
cimport numpy as np
import numpy as np
cdef extern from "../bm3d_src/mt19937ar.h":
double mt_genrand_res53()
cdef extern from "../bm3d_src/bm3d.h":
int run_bm3d( const float sigma, vector[float] &img_noisy,
vector[float] &img_basic,
vector[float] &img_denoised,
const unsigned width,
const unsigned height,
const unsigned chnls,
const bool useSD_h,
const bool useSD_w,
const unsigned tau_2D_hard,
const unsigned tau_2D_wien,
const unsigned color_space)
cdef extern from "../bm3d_src/utilities.h":
int save_image(char * name, vector[float] & img,
const unsigned width,
const unsigned height,
const unsigned chnls)
def hello():
return "Hello World"
def random():
return mt_genrand_res53()
cpdef float[:, :, :] bm3d(float[:, :, :] input_array,
float sigma,
bool useSD_h = True,
bool useSD_w = True,
str tau_2D_hard = "DCT",
str tau_2D_wien = "DCT"
):
"""
sigma: value of assumed noise of the noisy image;
input_array : input image, H x W x channum
useSD_h (resp. useSD_w): if true, use weight based
on the standard variation of the 3D group for the
first (resp. second) step, otherwise use the number
of non-zero coefficients after Hard Thresholding
(resp. the norm of Wiener coefficients);
tau_2D_hard (resp. tau_2D_wien): 2D transform to apply
on every 3D group for the first (resp. second) part.
Allowed values are 'DCT' and 'BIOR';
# FIXME : add color space support; right now just RGB
"""
cdef vector[float] input_image
cdef vector[float] basic_image
cdef vector[float] output_image
cdef vector[float] denoised_image
height = input_array.shape[0]
width = input_array.shape[1]
chnls = input_array.shape[2]
# convert the input image
input_image.resize(input_array.size)
pos = 0
for i in range(input_array.shape[0]):
for j in range(input_array.shape[1]):
for k in range(input_array.shape[2]):
input_image[pos] = input_array[i, j, k]
pos +=1
if tau_2D_hard == "DCT":
tau_2D_hard_i = 4
elif tau_2D_hard == "BIOR" :
tau_2D_hard_i = 5
else:
raise ValueError("Unknown tau_2d_hard, must be DCT or BIOR")
if tau_2D_wien == "DCT":
tau_2D_wien_i = 4
elif tau_2D_wien == "BIOR" :
tau_2D_wien_i = 5
else:
raise ValueError("Unknown tau_2d_wien, must be DCT or BIOR")
# FIXME someday we'll have color support
color_space = 0
ret = run_bm3d(sigma, input_image, basic_image, output_image,
width, height, chnls,
useSD_h, useSD_w,
tau_2D_hard_i, tau_2D_wien_i,
color_space)
if ret != 0:
raise Exception("run_bmd3d returned an error, retval=%d" % ret)
cdef np.ndarray output_array = np.zeros([height, width, chnls],
dtype = np.float32)
pos = 0
for i in range(input_array.shape[0]):
for j in range(input_array.shape[1]):
for k in range(input_array.shape[2]):
output_array[i, j, k] = output_image[pos]
pos +=1
return output_array
How would I go about making the most minimal changes such that it'll work with numpy array with dtype='complex'?
Cheers!

Explain pitch, width, height, depth in memory for 3D arrays

I am working with CUDA and 3D textures in python (using pycuda). There is a function called Memcpy3D which has the same members as Memcpy2D plus a few extras. In it it calls you to describe things such as width_in_bytes, src_pitch, src_height, height and copy_depth. This is what I am struggling with (in 3D) and its relevance with C or F style indexing. For instance, if I simply change the ordering from F to C in the working example below, it stops working - and I don't know why.
First of all, I understand pitch to be how many bytes in memory it takes to move one index across in threadIdx.x (or the x direction, or a column). So for a float32 array of C shape (3,2,4), to move one value in x I expect to move 4 values in memory (as the indexing goes down the z axis first?). Therefore my pitch would be 4*32bits.
I understand height to be the number of rows. (In this example, 3)
I understand width to be the number of cols. (In this example, 2)
I understand depth to be the number of z slices. (In this example, 4)
I understand width_in_bytes to be the width of a row in x inclusive of the z elements behind it, i.e. a row slice, (0,:,:). This would be how many addresses in memory it takes to transverse one element in the y-direction.
So when I change the ordering from F to C in the code below, and adapt the code to change the height/width values accordingly it still doesn't work. It just presents a logic failure which makes me think I'm not understanding the concept of pitch, width, height, depth correctly.
Please educate me.
Below is a full working script that copies an array to the GPU as a texture and copies the contents back.
import pycuda.driver as drv
import pycuda.gpuarray as gpuarray
import pycuda.autoinit
from pycuda.compiler import SourceModule
import numpy as np
w = 2
h = 3
d = 4
shape = (w, h, d)
a = np.arange(24).reshape(*shape,order='F').astype('float32')
print(a.shape,a.strides)
print(a)
descr = drv.ArrayDescriptor3D()
descr.width = w
descr.height = h
descr.depth = d
descr.format = drv.dtype_to_array_format(a.dtype)
descr.num_channels = 1
descr.flags = 0
ary = drv.Array(descr)
copy = drv.Memcpy3D()
copy.set_src_host(a)
copy.set_dst_array(ary)
copy.width_in_bytes = copy.src_pitch = a.strides[1]
copy.src_height = copy.height = h
copy.depth = d
copy()
mod = SourceModule("""
texture<float, 3, cudaReadModeElementType> mtx_tex;
__global__ void copy_texture(float *dest)
{
int x = threadIdx.x;
int y = threadIdx.y;
int z = threadIdx.z;
int dx = blockDim.x;
int dy = blockDim.y;
int i = (z*dy + y)*dx + x;
dest[i] = tex3D(mtx_tex, x, y, z);
}
""")
copy_texture = mod.get_function("copy_texture")
mtx_tex = mod.get_texref("mtx_tex")
mtx_tex.set_array(ary)
dest = np.zeros(shape, dtype=np.float32, order="F")
copy_texture(drv.Out(dest), block=shape, texrefs=[mtx_tex])
print(dest)

Not sure I fully understand the problem in your code, but I'll attempt to clarify.
In CUDA, width (x) refers to the fastest-changing dimension, height (y) is the middle dimension, and depth (z) is the slowest-changing dimension. The pitch refers to the stride in bytes required to step between values along the y dimension.
In Numpy, an array defined as np.empty(shape=(3,2,4), dtype=np.float32, order="C") has strides=(32, 16, 4), and corresponds to width=4, height=2, depth=3, pitch=16.
Using "F" ordering in Numpy means the order of dimensions is reversed in memory.
Your code appears to work if I make the following changes:
#shape = (w, h, d)
shape = (d, h, w)
#a = np.arange(24).reshape(*shape,order='F').astype('float32')
a = np.arange(24).reshape(*shape,order='C').astype('float32')
...
#dest = np.zeros(shape, dtype=np.float32, order="F")
dest = np.zeros(shape, dtype=np.float32, order="C")
#copy_texture(drv.Out(dest), block=shape, texrefs=[mtx_tex])
copy_texture(drv.Out(dest), block=(w,h,d), texrefs=[mtx_tex])

PyOpenCL Kronecker Product Kernel

I have the following code for confirming a hand written method for computing the kronecker product of two square matrices. The first portion indeed validates that my method of repeating and tiling a and b respectively yields the same output.
import pyopencl as cl
import numpy
from time import time
N = 3
num_iter = 1
a = numpy.random.rand(N,N)
b = numpy.random.rand(N,N)
c = numpy.kron(a, b)
abig = numpy.repeat(numpy.repeat(a,N,axis=1),N,axis=0)
bbig = numpy.tile(b,(N,N))
cbig = abig*bbig
print(numpy.allclose(c,cbig))
I then attempt to port this multiplication over to the GPU using PyOpenCL. I first allocate biga and bigb as d_a and d_b respectively on the GPU memory. I also allocate h_C as an empty array on the host and d_C as the same size on the device.
context = cl.create_some_context()
queue = cl.CommandQueue(context)
h_C = numpy.empty(cbig.shape)
d_a = cl.Buffer(context, cl.mem_flags.READ_ONLY | cl.mem_flags.COPY_HOST_PTR, hostbuf=abig)
d_b = cl.Buffer(context, cl.mem_flags.READ_ONLY | cl.mem_flags.COPY_HOST_PTR, hostbuf=bbig)
d_c = cl.Buffer(context, cl.mem_flags.WRITE_ONLY, h_C.nbytes)
kernelsource = open("../GPUTest.cl").read()
program = cl.Program(context, kernelsource).build()
kronecker = program.kronecker
kronecker.set_scalar_arg_dtypes([numpy.int32, None, None, None])
for i in range(num_iter):
kronecker(queue, (N**2, N**2), None, N**2, d_a, d_b, d_c)
queue.finish()
cl.enqueue_copy(queue, h_C, d_c)
print(h_C)
Here is the contents of GPUTest.cl:
__kernel void kronecker(const int N,__global float* A,__global float*B,__global float* C)
{
int i = get_global_id(0);
int j = get_global_id(1);
C[i,j] = A[i,j]*B[i,j];
}
However, my output is no where close. I believe my mistakes lie in how I'm handling the thread id's. From reading another example on matrix dot products, I was under the impression that the id's were essentially the location of the element within the block and since this is elementwise, I would only need to pull the element at the same location from A and B to multiply them together. Do these id's need to be combined into a single index to better address the way that the memory is actually allocated?
And only slightly related, but is there a way to utilize a tiling or memory sharing method? This was only a naiive attempt at the simplest way to do the calculation, I'm hoping to get to an algorithm that does not need the repeated/tiled versions of a and b. Something along the lines of taking a single element of a, multiplying the entirity of b by it, and then storing the result in a tile of c.

The issue is that the kernel does not take indices for the input and output memory addresses. The arguments should be C[i+j*N] in order to move throughout the whole block of memory appropriately.

I developed a kernel for the the kronecker product as well. I will put it here just fore reference. For
A # B = C,
where # is the the kronecker product, A is a m-x-n-matrix, B is a p-x-q-matrix and C is a x-x-y-matrix, with x=mp and y=nq, the follwoing kernel will calculate C:
__kernel void kroneckerProdFast(__global float* res,
__global float* a,
__global float* b,
int p,
int q){
int xi = get_global_id(0);
int x = get_global_size(0);
int yi = get_global_id(1);
int y = get_global_size(1);
int n = y / q;
int mi = xi / p;
int ni = yi / q;
int pi = xi % p;
int qi = yi % q;
res[xi * y + yi] = a[mi * n + ni] * b[pi * q + qi];
}
The call from PyOpenCL would be:
program.kroneckerProdFast(queue,res.shape, None, resf_buf, a_buf, b_buf,np.int32(b.shape[0]),np.int32(b.shape[1]))

Cythonize two small numpy functions, help needed

The problem
I'm trying to Cythonize two small functions that mostly deal with numpy ndarrays for some scientific purpose. These two smalls functions are called millions of times in a genetic algorithm and account for the majority of the time taken by the algo.
I made some progress on my own and both work nicely, but i get only a tiny speed improvement (10%). More importantly, cython --annotate show that the majority of the code is still going through Python.
The code
First function:
The aim of this function is to get back slices of data and it is called millions of times in an inner nested loop. Depending on the bool in data[1][1], we either get the slice in the forward or reverse order.
#Ipython notebook magic for cython
%%cython --annotate
import numpy as np
from scipy import signal as scisignal
cimport cython
cimport numpy as np
def get_signal(data):
#data[0] contains the data structure containing the numpy arrays
#data[1][0] contains the position to slice
#data[1][1] contains the orientation to slice, forward = 0, reverse = 1
cdef int halfwinwidth = 100
cdef int midpoint = data[1][0]
cdef int strand = data[1][1]
cdef int start = midpoint - halfwinwidth
cdef int end = midpoint + halfwinwidth
#the arrays we want to slice
cdef np.ndarray r0 = data[0]['normals_forward']
cdef np.ndarray r1 = data[0]['normals_reverse']
cdef np.ndarray r2 = data[0]['normals_combined']
if strand == 0:
normals_forward = r0[start:end]
normals_reverse = r1[start:end]
normals_combined = r2[start:end]
else:
normals_forward = r1[end - 1:start - 1: -1]
normals_reverse = r0[end - 1:start - 1: -1]
normals_combined = r2[end - 1:start - 1: -1]
#return the result as a tuple
row = (normals_forward,
normals_reverse,
normals_combined)
return row
Second function
This one gets a list of tuples of numpy arrays, and we want to add up the arrays element wise, then normalize them and get the integration of the intersection.
def calculate_signal(list signal):
cdef int halfwinwidth = 100
cdef np.ndarray profile_normals_forward = np.zeros(halfwinwidth * 2, dtype='f')
cdef np.ndarray profile_normals_reverse = np.zeros(halfwinwidth * 2, dtype='f')
cdef np.ndarray profile_normals_combined = np.zeros(halfwinwidth * 2, dtype='f')
#b is a tuple of 3 np.ndarrays containing 200 floats
#here we add them up elementwise
for b in signal:
profile_normals_forward += b[0]
profile_normals_reverse += b[1]
profile_normals_combined += b[2]
#normalize the arrays
cdef int count = len(signal)
#print "Normalizing to number of elements"
profile_normals_forward /= count
profile_normals_reverse /= count
profile_normals_combined /= count
intersection_signal = scisignal.detrend(np.fmin(profile_normals_forward, profile_normals_reverse))
intersection_signal[intersection_signal < 0] = 0
intersection = np.sum(intersection_signal)
results = {"intersection": intersection,
"profile_normals_forward": profile_normals_forward,
"profile_normals_reverse": profile_normals_reverse,
"profile_normals_combined": profile_normals_combined,
}
return results
Any help is appreciated - I tried using memory views but for some reason the code got much, much slower.

After fixing the array cdef (as has been indicated, with the dtype specified), you should probably put the routine in a cdef function (which will only be callable by a def function in the same script).
In the declaration of the function, you'll need to provide the type (and the dimensions if it's an array numpy):
cdef get_signal(numpy.ndarray[DTYPE_t, ndim=3] data):
I'm not sure using a dict is a good idea though. You could make use of numpy's column or row slices like data[:, 0].

Return variable length array in Numpy C-extension?

I have made some Numpy C-extensions before with great help from this site, but as far as I can see the returned parameters are all fixed length.
Is there any way to have a Numpy C-extension return a variable length numpy array instead?

You may find it easier to make numpy extensions in Cython using the Numpy C-API which simplifies the process as it allows you to mix python and c objects. In that case there is little difficult about making a variable length array, you can simply specify an array with an arbitrary shape.
The Cython numpy tutorial is probably the best source on this topic.
For example, here is a function I recently wrote:
import numpy as np
cimport numpy as np
cimport cython
dtype = np.double
ctypedef double dtype_t
np.import_ufunc()
np.import_array()
def ewma(a, d, axis):
#Calculates the exponentially weighted moving average of array a along axis using the parameter d.
cdef void *args[1]
cdef double weight[1]
weight[0] = <double>np.exp(-d)
args[0] = &weight[0]
return apply_along_axis(&ewma_func, np.array(a, dtype = float), np.double, np.double, False, &(args[0]), <int>axis)
cdef void ewma_func(int n, void* aData,int astride, void* oData, int ostride, void** args):
#Exponentially weighted moving average calculation function
cdef double avg = 0.0
cdef double weight = (<double*>(args[0]))[0]
cdef int i = 0
for i in range(n):
avg = (<double*>((<char*>aData) + i * astride))[0]*weight + avg * (1.0 - weight)
(<double*>((<char*>oData) + i * ostride))[0] = avg
ctypedef void (*func_1d)(int, void*, int, void*, int, void **)
cdef apply_along_axis(func_1d function, a, adtype, odtype, reduce, void** args, int axis):
#generic function for applying a cython function along a particular dimension
oshape = list(a.shape)
if reduce :
oshape[axis] = 1
out = np.empty(oshape, odtype)
cdef np.flatiter ita, ito
ita = np.PyArray_IterAllButAxis(a, &axis)
ito = np.PyArray_IterAllButAxis(out, &axis)
cdef int axis_length = a.shape[axis]
cdef int a_axis_stride = a.strides[axis]
cdef int o_axis_stride = out.strides[axis]
if reduce:
o_axis_stride = 0
while np.PyArray_ITER_NOTDONE(ita):
function(axis_length, np.PyArray_ITER_DATA (ita), a_axis_stride, np.PyArray_ITER_DATA (ito), o_axis_stride, args)
np.PyArray_ITER_NEXT(ita)
np.PyArray_ITER_NEXT(ito)
if reduce:
oshape.pop(axis)
out.shape = oshape
return out
If this doesn't suit you, there is a function for making a new empty array with arbitrary shape (link).

I am interpreting your question to mean "I have a function that takes a NumPy array of length n, but it will return another array of length m different from n." If that is the case, you will need to malloc a new C array in the extension, e.g.
new_array = malloc(m * sizeof(int64)); // or whatever your data type is
then create a new NumPy array with that. This example assumes a 1D array:
int npy_intp dims[1];
dims[0] = m;
PyArrayObject *out = (PyArrayObject *)PyArray_SimpleNewFromData(1, // 1D array
dims, // dimensions
NPY_INT64, // type
new_array);
PyArray_ENABLEFLAGS(out, NPY_ARRAY_OWNDATA);
Then return the new array. The important part here is to set the NPY_ARRAY_OWNDATA flag so that the memory you allocated is freed when the Python object is garbage collected.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

PyOpenCL indexing 3D arrays inside kernel code - python

Although this is not the opitimal solution, I linearized the array in Python and sent it as 1D. In kernel code I calculated x, y and z from the linear index. When returned to Pyhon I reshaped it back to the original shape.

Related

Cython, Complex values, and BM3D algorithm

Explain pitch, width, height, depth in memory for 3D arrays

PyOpenCL Kronecker Product Kernel

Cythonize two small numpy functions, help needed

Return variable length array in Numpy C-extension?

Categories

Resources