Optimize cython functions operating on python lists - python

I am currently migrating to Cython a set of functions that are currently implemented in C++ through scipy.weave (now deprecated).
These functions operate on timeseries points that are 2D-lists (eg. [[17100, 19.2], [17101, 20.7], [17102, 20.3], ...]) both in input and in output. A sample function is subtract that accepts two timeseries and calculates a new timeserie as subtraction of the two inputs going date-by-date.
The structure and the interfaces have to be mantained for retrocompatibility, but my profiling trials show that Cython porting is about 30%-40% slower than the original scipy.weave implementation.
I have tried many ways to optimize (inner conversions to Numpy arrays and memoryviews, C pointers, ...), but the conversion time required lenghtens the overall execution time. Even trying to define input and output as C++ vectors, leveraging on Cython implicit conversions doesn't seem to be effective in order to mantain scipy.weave speed. I have also used the various hints on boundscheck, wraparound, division, ...
The highest slow-downs seem to be on functions that uses nested loops and I've seen that a little gain can be predefining the list size (cdef list target = [[-1, float('nan')]]*size).
I am aware that Cython can't be so much performing on Python structures, especially lists, but are there any other tricks or techniques with which a speedup can be obtained?
=== EDIT - ADD CODE EXAMPLE ===
A good example of the typology of functions is the following.
The function takes a 2-D list of dates/prices and a 2-D list of dates/decimal factors and searches matching dates between the two lists, calculating the output on the corresponding price/factor by multiplying or dividing (that is a third input parameter).
My best-performing cython code:
#cython.cdivision(True)
#cython.boundscheck(False)
#cython.wraparound(False)
cpdef apply_conversion(list original_timeserie, list factor_timeserie, int divide_or_multiply=False):
cdef:
Py_ssize_t i, j = 0, size = len(original_timeserie), size2 = len(factor_timeserie)
long original_date, factor_date
double original_price, factor_price, conv_price
list result = []
for i in range(size):
original_date = original_timeserie[i][0]
for j in range(j, size2):
factor_date = factor_timeserie[j][0]
if original_date == factor_date:
original_price = original_timeserie[i][1]
factor_price = factor_timeserie[j][1]
if divide_or_multiply:
if factor_price != 0:
conv_price = original_price / factor_price
else:
conv_price = float('inf')
else:
conv_price = original_price * factor_price
result.append([original_date, conv_price])
break
return result
The original scipy function:
int len = original_timeserie.length();
int len2 = factor_timeserie.length();
PyObject* py_serieconv = PyList_New(len);
PyObject* original_item = NULL;
PyObject* factor_item = NULL;
PyObject* date = NULL;
PyObject* value = NULL;
long original_date = 0;
long factor_date = 0;
double original_price = 0;
double factor_price = 0;
int j = 0;
for(int i=0;i<len;i++) {
original_item = PyList_GetItem(original_timeserie, i);
date = PyList_GetItem(original_item, 0);
original_date = PyInt_AsLong(date);
original_price = PyFloat_AsDouble( PyList_GetItem(original_item, 1) );
factor_item = NULL;
for(;j<len2;) {
factor_item = PyList_GetItem(factor_timeserie, j++);
factor_date = PyInt_AsLong(PyList_GetItem(factor_item, 0));
if (factor_date == original_date) {
factor_price = PyFloat_AsDouble(PyList_GetItem(factor_item, 1));
value = PyFloat_FromDouble(original_price * (divide_or_multiply==0 ? factor_price : 1/factor_price));
PyObject* py_new_item = PyList_New(2);
Py_XINCREF(date);
PyList_SetItem(py_new_item, 0, date);
PyList_SetItem(py_new_item, 1, value);
PyList_SetItem(py_serieconv, i, py_new_item);
break;
}
}
}
return_val = py_serieconv;
Py_XDECREF(py_serieconv);

Related

How to copy a 2D array (matrix) from python with a C function (and do some computer heavy computation) which return a 2D array (matrix) in python?

I want to copy a 2D numpy array (matrix) in a C function a get it back in python (and then do some calculation on it in C taking the speed advantage of C) . Therefore I need the C function matrix_copy to return a 2D array (or, I guess, a pointer to it). I tried with the following code but I get the following output (where one can see the second dimension of the array is lost).
matrix_in.shape:
(300, 200)
matrix_out.shape:
(300,)
How could I change the code (I guess the matrix_copy.c adding some pointer magic) so I could obtain an exact copy of the matrix_in in matrix_out?
Here is the main.py script:
from ctypes import c_void_p, c_double, c_int, cdll
from numpy.ctypeslib import ndpointer
import numpy as np
import pdb
n = 300
m = 200
matrix_in = np.random.randn(n, m)
lib = cdll.LoadLibrary("matrix_copy.so")
matrix_copy = lib.matrix_copy
matrix_copy.restype = ndpointer(dtype=c_double,
shape=(n,))
matrix_out = matrix_copy(c_void_p(matrix_in.ctypes.data),
c_int(n),
c_int(m))
print("matrix_in.shape:")
print(matrix_in.shape)
print("matrix_out.shape:")
print(matrix_out.shape)
Here is the matrix_copy.c script:
#include <stdlib.h>
#include <stdio.h>
double * matrix_copy(const double * matrix_in, int n, int m){
double * matrix_out = (double *)malloc(sizeof(double) * (n*m));
int index = 0;
for(int i=0; i< n; i++){
for(int j=0; j<m; j++){
matrix_out[i+j] = matrix_in[i+j];
//matrix_out[i][j] = matrix_in[i][j];
// some heavy computations not yet implemented
}
}
return matrix_out;
}
which I compile with the command
cc -fPIC -shared -o matrix_copy.so matrix_copy.c
And as a side note, why does the notation matrix_out[i][j] = matrix_in[i][j]; throws me an error on compilation?
matrix_copy.c:10:26: error: subscripted value is not an array, pointer, or vector
matrix_out[i][j] = matrix_in[i][j];
~~~~~~~~~~~~~^~
matrix_copy.c:10:44: error: subscripted value is not an array, pointer, or vector
matrix_out[i][j] = matrix_in[i][j];
The second dimension is 'lost' because you explicitly omit it in the named shape argument of ndpointer. Change:
matrix_copy.restype = ndpointer(dtype=c_double, shape=(n,))
to
matrix_copy.restype = ndpointer(dtype=c_double, shape=(n,m), flags='C')
Where flags='C' additionally notes that the returned data is stored contiguously in row major order.
With regards to matrix_out[i][j] = matrix_in[i][j]; throwing an error, consider that matrix_in is of type const double *. matrix_in[i] would yield a value of type const double - how do you further index this value (i.e., with [j])?
If you want to emulate accessing a 2-dimensional array via a single pointer, you must calculate offsets manually. matrix_out[i+j] is not sufficient, as you must account for the span of each sub array:
matrix_out[i * m + j] = matrix_in[i * m + j];
Note that in C, size_t is the generally preferred type to use when dealing with memory sizes or array lengths.
matrix_copy.c, simplified:
#include <stdlib.h>
double *matrix_copy(const double *matrix_in, size_t n, size_t m)
{
double *matrix_out = malloc(sizeof *matrix_out * n * m);
for (size_t i = 0; i < n; i++)
for (size_t j = 0; j < m; j++)
matrix_out[i * m + j] = matrix_in[i * m + j];
return matrix_out;
}
matrix.py, with more explicit typing:
from ctypes import c_void_p, c_double, c_size_t, cdll, POINTER
from numpy.ctypeslib import ndpointer
import numpy as np
c_double_p = POINTER(c_double)
n = 300
m = 200
matrix_in = np.random.randn(n, m).astype(c_double)
lib = cdll.LoadLibrary("matrix_copy.so")
matrix_copy = lib.matrix_copy
matrix_copy.argtypes = c_double_p, c_size_t, c_size_t
matrix_copy.restype = ndpointer(
dtype=c_double,
shape=(n,m),
flags='C')
matrix_out = matrix_copy(
matrix_in.ctypes.data_as(c_double_p),
c_size_t(n),
c_size_t(m))
print("matrix_in.shape:", matrix_in.shape)
print("matrix_out.shape:", matrix_out.shape)
print("in == out", matrix_in == matrix_out)
The incoming data is a probably single block of memory. You need to create the substructure.
In my C++ code I have to do the following on data (block) coming in via swig:
void divide2DDoubleArray(double * &block, double ** &subblockdividers, int noofsubblocks, int subblocksize){
/* The starting address of a block of doubles is used to generate
* pointers to subblocks.
*
* block: memory containing the original block of data
* subblockdividers: array of subblock addresses
* noofsubblocks: specify the number of subblocks produced
* subblocksize: specify the size of the subblocks produced
*
* Design by contract: application should make sure the memory
* in block is allocated and initialized properly.
*/
// Build 2D matrix for cols
subblockdividers=new double *[noofsubblocks];
subblockdividers[0]= block;
for (int i=1; i<noofsubblocks; ++i) {
subblockdividers[i] = &subblockdividers[i-1][subblocksize];
}
}
Now the pointer returned in subblockdividers can be used the way you would like to.
Don't forget to free subblockdividers when your done. (Note: adjustments might be needed to compile this as C code)

Threaded code crashes calling FFI process

I've converted a function to use threads (as per this answer). It behaves as expected in tests (that is, it returns identical values to the non-threaded version). However, calling it from Python using ctypes causes the calling process to crash.
First, the working function:
#[no_mangle]
pub extern fn convert_vec(lon: Array, lat: Array) -> Array {
// snip
// orig is a Vec<(f32, f32)>
// convert is a conversion function
let result: Vec<(i32, i32)> = orig.iter()
.map(|elem| convert(elem.0, elem.1))
.collect();
// convert back to vector of unsigned integer Tuples
let nvec = result.iter()
.map(|ints| Tuple { a: ints.0 as u32, b: ints.1 as u32 })
.collect();
Array::from_vec(nvec)
}
And now the threaded version, which passes tests (using cargo test) but crashes when called from Python:
#[no_mangle]
pub extern fn convert_vec_threaded(lon: Array, lat: Array) -> Array {
// snip
// orig is a Vec<(f32, f32)>
// convert is a conversion function
let mut guards: Vec<JoinHandle<Vec<(i32, i32)>>> = vec!();
// split into slices
for chunk in orig.chunks(orig.len() / NUMTHREADS as usize) {
let chunk = chunk.to_owned();
let g = thread::spawn(move || chunk
.into_iter()
.map(|elem| convert(elem.0, elem.1))
.collect());
guards.push(g);
}
let mut result: Vec<(i32, i32)> = Vec::with_capacity(orig.len());
for g in guards {
result.extend(g.join().unwrap().into_iter());
}
// convert back to vector of unsigned integer Tuples
let nvec = result.iter()
.map(|ints| Tuple { a: ints.0 as u32, b: ints.1 as u32 })
.collect();
Array::from_vec(nvec)
}
The complete testable example is available here
From the error message it looks like you used a chunk size of 0 for some inputs. [T]::chunks(size) will assert that size != 0.
If we want NUMTHREADS chunks, we could split it like this:
// Divide into NUMTHREADS chunks
let mut size = orig.len() / NUMTHREADS;
if orig.len() % NUMTHREADS > 0 { size += 1; }
// If we want to avoid the case where orig.len() == 0, we need another adjustment:
size = std::cmp::max(1, size);

C++ lib in Python: custom sorting method

I want to make a custom sorting method in C++ and import it in Python. I am not an expert in C++, here are implementation of "sort_counting"
#include <iostream>
#include <time.h>
using namespace std;
const int MAX = 30;
class cSort
{
public:
void sort( int* arr, int len )
{
int mi, mx, z = 0; findMinMax( arr, len, mi, mx );
int nlen = ( mx - mi ) + 1; int* temp = new int[nlen];
memset( temp, 0, nlen * sizeof( int ) );
for( int i = 0; i < len; i++ ) temp[arr[i] - mi]++;
for( int i = mi; i <= mx; i++ )
{
while( temp[i - mi] )
{
arr[z++] = i;
temp[i - mi]--;
}
}
delete [] temp;
}
private:
void findMinMax( int* arr, int len, int& mi, int& mx )
{
mi = INT_MAX; mx = 0;
for( int i = 0; i < len; i++ )
{
if( arr[i] > mx ) mx = arr[i];
if( arr[i] < mi ) mi = arr[i];
}
}
};
int main( int* arr )
{
cSort s;
s.sort( arr, 100 );
return *arr;
}
and then using it in python
from ctypes import cdll
lib = cdll.LoadLibrary('sort_counting.so')
result = lib.main([3,4,7,5,10,1])
compilation goes nice
How to rewrite a C++ method to receive an array and then return a sorted array?
The error is quite clear: ctypes doesn't know how to convert a python list into a int * to be passed to your function. In fact a python integer is not a simple int and a list is not just an array.
There are limitations on what ctypes can do. Converting a generic python list to an array of ints is not something that can be done automatically.
This is explained here:
None, integers, bytes objects and (unicode) strings are the only
native Python objects that can directly be used as parameters in these
function calls. None is passed as a C NULL pointer, bytes objects and
strings are passed as pointer to the memory block that contains their
data (char * or wchar_t *). Python integers are passed as the
platforms default C int type, their value is masked to fit into the C
type.
If you want to pass an integer array you should read about arrays. Instead of creating a list you have to create an array of ints using the ctypes data types and pass that in instead.
Note that you must do the conversion from python. It doesn't matter what C++ code you write. The alternative way is to use the Python C/API instead of ctypes to only write C code.
A simple example would be:
from ctypes import *
lib = cdll.LoadLibrary('sort_counting.so')
data = [3,4,7,5,10,1]
arr_type = c_int * len(data)
array = arr_type(*data)
result = lib.main(array)
data_sorted = list(result)

Weave Inline C++ Code in Python 2.7

I'm trying to rewrite this function:
def smoothen_fast(heightProfile, travelTime):
smoothingInterval = 30 * travelTime
heightProfile.extend([heightProfile[-1]]*smoothingInterval)
# Get the mean of first `smoothingInterval` items
first_mean = sum(heightProfile[:smoothingInterval]) / smoothingInterval
newHeightProfile = [first_mean]
for i in xrange(len(heightProfile)-smoothingInterval-1):
prev = heightProfile[i] # the item to be subtracted from the sum
new = heightProfile[i+smoothingInterval] # item to be added
# Calculate the sum of previous items by multiplying
# last mean with smoothingInterval
prev_sum = newHeightProfile[-1] * smoothingInterval
new_sum = prev_sum - prev + new
mean = new_sum / smoothingInterval
newHeightProfile.append(mean)
return newHeightProfile
as embedded C++ Code:
import scipy.weave as weave
heightProfile = [0.14,0.148,1.423,4.5]
heightProfileSize = len(heightProfile)
travelTime = 3
code = r"""
#include <string.h>
int smoothingInterval = 30 * travelTime;
double *heightProfileR = new double[heightProfileSize+smoothingInterval];
for (int i = 0; i < heightProfileSize; i++)
{
heightProfileR[i] = heightProfile[i];
}
for (int i = 0; i < smoothingInterval; i++)
{
heightProfileR[heightProfileSize+i] = -1;
}
double mean = 0;
for (int i = 0; i < smoothingInterval; i++)
{
mean += heightProfileR[i];
}
mean = mean/smoothingInterval;
double *heightProfileNew = new double[heightProfileSize-smoothingInterval];
for (int i = 0; i < heightProfileSize-smoothingInterval-1; i++)
{
double prev = heightProfileR[i];
double newp = heightProfile[i+smoothingInterval];
double prev_sum = heightProfileNew[i] * smoothingInterval;
double new_sum = prev_sum - prev + newp;
double meanp = new_sum / smoothingInterval;
heightProfileNew[i+1] = meanp;
}
return_val = Py::new_reference_to(Py::Double(heightProfileNew));
"""
d = weave.inline(code,['heightProfile','heightProfileSize','travelTime'])
As a return type i need the heightProfileNew. I need the access it like a list in Python later.
I look at these examples:
http://docs.scipy.org/doc/scipy/reference/tutorial/weave.html
He keeps telling me that he doesn't know Py::, but in the examples there are no Py-Includes?
I know, the question is old, but I think it is still interesting.
Assuming your using weave to improve computation speed and that you know the length of your output beforehand, I suggest that you create the result before calling inline. That way you can create the result variable in python (very easy). I would also suggest using a nd.ndarray as a result because it makes shure you use the right datatype. You can iterate ndarrays in python the same way you iterate lists.
import numpy as np
heightProfileArray = np.array(heightprofile)
# heightProfileArray = np.array(heightprofile, dtype = np.float32) if you want to make shure you have the right datatype. Another choice would be np.float64
resultArray = np.zeros_like(heightProfileArray) # same array size and data type but filled with zeros
[..]
weave.inline(code,['heightProfile','heightProfileSize','travelTime','resultArray'])
for element in resultArray:
print element
In your C++-code you can then just assign values to elements of that array:
[..]
resultArray[i+1] = 5.5;
[..]

Ctree Specializer is using for loop index for computation, not the actual array value

I'm implementing a simple Xor Reducer, but it is unable to return the appropriate value.
Python Code (Input):
class LazySpecializedFunctionSubclass(LazySpecializedFunction):
subconfig_type = namedtuple('subconfig',['dtype','ndim','shape','size','flags'])
def __init__(self, py_ast = None):
py_ast = py_ast or get_ast(self.kernel)
super(LazySlimmy, self).__init__(py_ast)
# [... other code ...]
def points(self, inpt):
iter = np.nditer(input, flags=['c_index'])
while not iter.finished:
yield iter.index
iter.iternext()
class XorReduction(LazySpecializedFunctionSubclass):
def kernel(self, inpt):
'''
Calculates the cumulative XOR of elements in inpt, equivalent to
Reduce with XOR
'''
result = 0
for point in self.points(inpt): # self.points is defined in LazySpecializedFunctionSubclass
result = point ^ result # notice how 'point' here is the actual element in self.points(inpt), not the index
return result
C Code (Output):
// <file: module.c>
void kernel(long* inpt, long* output) {
long result = 0;
for (int point = 0; point < 2; point ++) {
result = point ^ result; // Notice how it's using the index, point, not inpt[point].
};
* output = result;
};
Any ideas how to fix this?
The problem is that you are using point in different ways, in XorReduction kernel method you are iterating of the values in the array, but in the generated C code you are iterating over the indices of the array. Your C code xor reduction is thus done on the indices.
The generated C function should look more like
// <file: module.c>
void kernel(long* inpt, long* output) {
long result = 0;
for (int point = 0; point < 2; point ++) {
result = inpt[point] ^ result; // you did not reference your input in the question
};
* output = result;
};

Categories