I am trying to speed up my Python program by implementing a function in C++ and embedding it in my code using CFFI. The function takes two 3x3 arrays and computes a distance.
The Python code is the following:
import cffi
import numpy as np
ffi = cffi.FFI()
ffi.cdef("""
extern double dist(const double s[3][3], const double t[3][3]);
""")
lib = ffi.dlopen("./dist.so")
S = np.array([[-1.63538, 0.379116, -1.16372],[-1.63538, 0.378137, -1.16366 ],[-1.63193, 0.379116, -1.16366]], dtype=np.float32)
T = np.array([[-1.6467834, 0.3749715, -1.1484985],[-1.6623441, 0.37410975, -1.1647063 ],[-1.6602284, 0.37400728, -1.1496595 ]], dtype=np.float32)
Sp = ffi.cast("double(*) [3]", S.ctypes.data)
Tp = ffi.cast("double(*) [3]", T.ctypes.data)
dd = lib.dist(Sp,Tp);
This solution doesn't work as intended. Indeed the arguments printed by the C function are:
Sp=[[0.000002, -0.270760, -0.020458]
[0.000002, 0.000000, 0.000000]
[0.000000, 0.000000, 0.000000]]
Tp=[[0.000002, -0.324688, -0.020588]
[0.000002, 0.000000, 0.000000]
[0.000000, 0.000000, -nan]]
I have also tried the following to initialize the pointers:
Sp = ffi.new("double *[3]")
for i in range(3):
Sp[i] = ffi.cast("double *", S[i].ctypes.data)
Tp = ffi.new("double *[3]")
for i in range(3):
Tp[i] = ffi.cast("double *", T[i].ctypes.data)
dd = lib.dist(Sp,Tp);
But this solution rises an error in dist(Sp,Tp):
TypeError: initializer for ctype 'double(*)[3]' must be a pointer to same type, not cdata 'double *[3]'
Do you have any idea on how to make it work? Thanks.
The types double[3][3] and double *[3] are not equivalent. The former is an 2D array of double 3x3, stored as 9 contiguous double. The latter is an array of 3 double pointers, which is not how 2D static arrays are implemented in C or C++.
Both numpy arrays and C++ static arrays are represented in memory as a contiguous block of elements, it's just the double *[3] type in the middle that's throwing a wrench in the works. What you'd want is to use double[3][3] proper or double[3]* (a pointer to a row of three doubles). Note that if you use the latter you may need to change the function prototype to take double [][3].
Related
I want to pass a 2d numpy array to a function written in C using ctypes. When I try to access the data in the C function I get: Segmentation fault (core dumped). The code is:
code in C
#include <stdio.h>
void np_array_complex_shape(double ** data, int * shape){
printf("%d\n", shape[0]);
printf("%d\n", shape[1]); // These print ok.
printf("%f", data[0][0]); // This one throws: Segmentation fault (core dumped)
}
code in python:
import os
import ctypes
import numpy as np
import numpy.ctypeslib as npct
test_library_file_location = "library.so"
array_2d_double = npct.ndpointer(dtype=np.double, ndim=2, flags='CONTIGUOUS')
array_1d_int = npct.ndpointer(dtype=np.int32, ndim=1, flags='CONTIGUOUS')
LIBC = ctypes.CDLL(test_library_file_location)
LIBC.np_array_complex_shape.restype = None
LIBC.np_array_complex_shape.argtypes = [array_2d_double, array_1d_int]
x = np.array([[1,2,3],[4,5,6],[7,8,9]], dtype=np.float64)
s = np.array(x.shape, dtype=np.int32)
c = LIBC.np_array_complex_shape(x, s)
I've tried different solutions I found on line, but non work. Could someone help me?
I am using linux and gcc compilers.
While you can often get away with mixing up arrays with pointers to some extent, mixing up 2D arrays with pointers to pointers will never work right. If you know that the second dimension will always be 3, then you can change double ** data to double (*data)[3] in your C code, and then everything will work. If it's not always a constant, but Python will always know what it is and you're okay with using VLAs, then you can redeclare your function as void np_array_complex_shape(int cols, double (*data)[cols], int *shape) and pass the appropriate value for cols. Otherwise, you need to change it to double *data, and then manually calculate indices as if it were flattened into a 1D array (which is just changing data[0][0] to data[0] in your simple example).
I tried to Cythonize part of my code as following to hopefully gain some speed:
# cython: boundscheck=False
import numpy as np
cimport numpy as np
import time
cpdef object my_function(np.ndarray[np.double_t, ndim = 1] array_a,
np.ndarray[np.double_t, ndim = 1] array_b,
int n_rows,
int n_columns):
cdef double minimum_of_neighbours, difference, change
cdef int i
cdef np.ndarray[np.int_t, ndim =1] locations
locations = np.argwhere(array_a > 0)
for i in locations:
minimum_of_neighbours = min(array_a[i - n_columns], array_a[i+1], array_a[i + n_columns], array_a[i-1])
if array_a[i] - minimum_of_neighbours < 0:
difference = minimum_of_neighbours - array_a[i]
change = min(difference, array_a[i] / 5.)
array_a[i] += change
array_b[i] -= change * 5.
print time.time()
return array_a, array_b
I can compile it without an error but when I use the function I got this error:
from cythonized_code import my_function
import numpy as np
array_a = np.random.uniform(low=-100, high=100, size = 100).astype(np.double)
array_b = np.random.uniform(low=0, high=20, size = 100).astype(np.double)
a, b = my_function(array_a,array_b,5,20)
# which gives me this error:
# locations = np.argwhere(array_a > 0)
# ValueError: Buffer has wrong number of dimensions (expected 1, got 2)
Do I need to declare locations type here? The reason I wanted to declare it is that it has yellow colour in the annotated HTML file generated by compiling the code.
It's a good idea not to use the python-functionality locations[i], because it is too expensive: Python would create a full-fledged Python-integer* from the lowly c-integer (which is what is stored in the locations-numpy array), register it in the garbage collector, then cast it back to int, destroy the Python-object - quite an overhead.
To get a direct access to the lowly c-integers one needs to bind locations to a type. The normal course of action would be too look up, which properties locations has:
>>> locations.ndim
2
>>> locations.dtype
dtype('int64')
which translates to cdef np.ndarray[np.int64_t, ndim =2] locations.
However, this will (probably, can not check it right now) not be enough to get rid of Python-overhead because of a Cython-quirk:
for i in locations:
...
will not be interpreted as a raw-array access but will invoke the Python-machinery. See for example here.
So you will have to change it to:
for index in range(len(locations)):
i=locations[index][0]
then Cython "understands", that you want the access to the raw c-int64 array.
Actually, it is not completely true: In this case first an nd.array is created (e.g. locations[0] or locations[1]) and then __Pyx_PyInt_As_int (which is more or less an alias for [PyLong_AsLongAndOverflow][2]) is called, which creates a PyLongObject, from which C-int value is obtained before the temporary PyLongObject and nd.array are destructed.
Here we get lucky, because length-1 numpy-arrays can be converted to Python scalars. The code would not work if the second dimension of locations would be >1.
I want to pass a numpy array into some(one else's) C++ code via CFFI. Assume I cannot (in any sense) change the C++ code, whose interface is:
double CompactPD_LH(int Nbins, double * DataArray, void * ParamsArray ) {
...
}
I pass Nbins as a python integer, ParamsArray as a dict -> structure, but DataArray (shape = 3 x NBins, which needs to be populated from a numpy array, is giving me a headache. (cast_matrix from Why is cffi so much quicker than numpy? isn't helping here :(
Here's one attempt that fails:
from blah import ffi,lib
data=np.loadtxt(histof)
DataArray=cast_matrix(data,ffi) # see https://stackoverflow.com/questions/23056057/why-is-cffi-so-much-quicker-than-numpy/23058665#23058665
result=lib.CompactPD_LH(Nbins,DataArray,ParamsArray)
For reference, cast_matrix was:
def cast_matrix(matrix, ffi):
ap = ffi.new("double* [%d]" % (matrix.shape[0]))
ptr = ffi.cast("double *", matrix.ctypes.data)
for i in range(matrix.shape[0]):
ap[i] = ptr + i*matrix.shape[1]
return ap
Also:
How to pass a Numpy array into a cffi function and how to get one back out?
https://gist.github.com/arjones6/5533938
Thanks #morningsun!
dd=np.ascontiguousarray(data.T)
DataArray = ffi.cast("double *",dd.ctypes.data)
result=lib.CompactPD_LH(Nbins,DataArray,ParamsArray)
works!
I'd like to overwrite part of a PyOpenCL array with another array.
Let's say
import numpy as np, pyopencl.array as cla
a = cla.zeros(queue,(3,3),'int8')
b = cla.ones(queue,(2,2),'int8')
Now I want to do something like a[0:2,0:2] = b and hopefully get
1 1 0
1 1 0
0 0 0
How would I do that without copying everything to the host for speed reasons?
Pyopencl arrays are able to do that - to a very limited extent at the time of this answer - with the numpy syntax (ie exactly how you wrote it) the limitation being: you can only use slices along the first axis.
import numpy as np, pyopencl.array as cla
a = cla.zeros(queue,(3,3),'int8')
b = cla.ones(queue,(2,3),'int8')
# note b is 2x3 here
a[0:2]=b #<-works
a[0:2,0:2]=b[:,0:2] #<-Throws an error about non-contiguity
So a[0:2,0:2] = b won't work as the destination sliced array have non-contiguous data.
The only solution I am aware of (as nothing in the pyopencl.array class is yet able to work with sliced arrays/non-contiguous data), is to write your own openCL kernel to do the copy "by hand".
Here is a piece of code I use to do copies on 1D or 2D pyopencl arrays of all dtype:
import numpy as np, pyopencl as cl, pyopencl.array as cla
ctx = cl.create_some_context()
queue = cl.CommandQueue(ctx)
kernel = cl.Program(ctx, """__kernel void copy(
__global char *dest, const int offsetd, const int stridexd, const int strideyd,
__global const char *src, const int offsets, const int stridexs, const int strideys,
const int word_size) {
int write_idx = offsetd + get_global_id(0) + get_global_id(1) * stridexd + get_global_id(2) * strideyd ;
int read_idx = offsets + get_global_id(0) + get_global_id(1) * stridexs + get_global_id(2) * strideys;
dest[write_idx] = src[read_idx];
}""").build()
def copy(dest,src):
assert dest.dtype == src.dtype
assert dest.shape == src.shape
if len(dest.shape) == 1 :
dest.shape=(dest.shape[0],1)
src.shape=(src.shape[0],1)
dest.strides=(dest.strides[0],0)
src.strides=(src.strides[0],0)
kernel.copy(queue, (src.dtype.itemsize,src.shape[0],src.shape[1]), None, dest.base_data, np.uint32(dest.offset), np.uint32(dest.strides[0]), np.uint32(dest.strides[1]), src.base_data, np.uint32(src.offset), np.uint32(src.strides[0]), np.uint32(src.strides[1]), np.uint32(src.dtype.itemsize))
a = cla.zeros(queue,(3,3),'int8')
b = cla.ones(queue,(2,2),'int8')
copy(a[0:2,0:2],b)
print(a)
In the pyopencl mailing list, Andreas Klöckner gave me a hint: there's an undocumented function in pyopencl.array called multiput(). Syntax is like this:
cla.multi_put([arr],indices,out=[out])
'arr' is the source array, 'out' is the destination array and 'indices' is a 1D array of ints (also on the device) which contains the linear element indices counted row-major.
For example, in my first post, indices for putting 'b' into 'a' would be (0, 1, 3, 4). You just need to put together your indices somehow and can use multiput() instead of writing a kernel. len(indices) must of course be equal to b.size. There's also a take() and multitake() function for reading elements from an array.
I'm looking for the fastest way to select the elements of a numpy array that satisfy several criteria. As an example, say I want to select all elements that lie between 0.2 and 0.8 from an array. I normally do something like this:
the_array = np.random.random(100000)
idx = (the_array > 0.2) * (the_array < 0.8)
selected_elements = the_array[idx]
However, this creates two additional arrays with the same size as the_array (one for the_array > 0.2 and one for the_array < 0.8). If the array is large, this can consume a lot of memory. Is there any way to get around this? All of the built-in numpy functions (such as logical_and) seem to do the this same thing under the hood.
You could implement a custom C call for the select. The most basic way to do this is through a ctypes implementation.
select.c
int select(float lower, float upper, float* in, float* out, int n)
{
int ii;
int outcount = 0;
float val;
for (ii=0;ii<n;ii++)
{
val = in[ii];
if ((val>lower) && (val<upper))
{
out[outcount] = val;
outcount++;
}
}
return outcount;
}
which is compiled as:
gcc -lm -shared select.c -o lib.so
And on the python side:
select.py
import ctypes as C
from numpy.ctypeslib import as_ctypes
import numpy as np
# open the library in python
lib = C.CDLL("./lib.so")
# explicitly tell ctypes the argument and return types of the function
pfloat = C.POINTER(C.c_float)
lib.select.argtypes = [C.c_float,C.c_float,pfloat,pfloat,C.c_int]
lib.select.restype = C.c_int
size = 1000000
# create numpy arrays
np_input = np.random.random(size).astype(np.float32)
np_output = np.empty(size).astype(np.float32)
# expose the array contents to ctypes
ctypes_input = as_ctypes(np_input)
ctypes_output = as_ctypes(np_output)
# call the function and get the number of selected points
outcount = lib.select(0.2,0.8,ctypes_input,ctypes_output,size)
# select those points
selected = np_output[:outcount]
Don't expect wild speedups with such a vanilla implementation, but in the C side you have the option of adding in OpenMP pragmas to get quick and dirty parallelism which may give you significant boosts.
Also as mentioned in the comments, numexpr may be a faster neater way to do all this in just a few lines.