Is there a cython-ic way to set a cdef array to zeros. I have a function with the following signature:
cdef cget_values(double[:] cpc_x, double[:] cpc_y):
The function is called as follows:
cdef double cpc_x [16]
cdef double cpc_y [16]
cget_values(cpc_x, cpc_y)
Now the first thing I would like to do is set everything in these arrays to zeros. Currently, I am doing that with a for loop as:
for i in range(16):
cpc_x[i] = 0.0
cpc_y[i] = 0.0
I was wondering if this is a reasonable approach without much overhead. I call this function a lot and was wondering if there is a more elegant/faster way to do this in cython.
I assume, you are already using #cython.boundscheck(False), so there is not much you can do to improve on it performance-wise.
For the readability reasons I would use:
cpc_x[:]=0.0
cpc_y[:]=0.0
the cython would translate this to for-loops. An other additional advantage: even if #cython.boundscheck(False) isn't used, the resulting C-code will be nonetheless without boundchecks (__Pyx_RaiseBufferIndexError). Here is the resulting code for a[:]=0.0:
{
double __pyx_temp_scalar = 0.0;
{
Py_ssize_t __pyx_temp_extent_0 = __pyx_v_a.shape[0];
Py_ssize_t __pyx_temp_stride_0 = __pyx_v_a.strides[0];
char *__pyx_temp_pointer_0;
Py_ssize_t __pyx_temp_idx_0;
__pyx_temp_pointer_0 = __pyx_v_a.data;
for (__pyx_temp_idx_0 = 0; __pyx_temp_idx_0 < __pyx_temp_extent_0; __pyx_temp_idx_0++) {
*((double *) __pyx_temp_pointer_0) = __pyx_temp_scalar;
__pyx_temp_pointer_0 += __pyx_temp_stride_0;
}
}
}
What could improve the performance is to declare the the memory views to be continuous (i.e. double[::1] instead of double[:]. The resulting C code for a[:]=0.0 would be then:
{
double __pyx_temp_scalar = 0.0;
{
Py_ssize_t __pyx_temp_extent = __pyx_v_a.shape[0];
Py_ssize_t __pyx_temp_idx;
double *__pyx_temp_pointer = (double *) __pyx_v_a.data;
for (__pyx_temp_idx = 0; __pyx_temp_idx < __pyx_temp_extent; __pyx_temp_idx++) {
*((double *) __pyx_temp_pointer) = __pyx_temp_scalar;
__pyx_temp_pointer += 1;
}
}
}
As one can see, strides[0] is no longer used in the continuous version - strides[0]=1 is evaluated during the compilation and the resulting C-code can be better optimized (see for example here).
One could be tempted to get smart and to use low-level memset-function:
from libc.string cimport memset
memset(&cpc_x[0], 0, 16*sizeof(double))
However, for bigger arrays there will no difference compared to the usage of continuous memory view (i.e. double[::1], see here for example). There might be less overhead for smaller sizes, but I never cared enough to check.
Related
Getting memory allocation errors when running a compiled version of the following code. This is an application where a struct of pointers is defined and I would like to assign a value to the pointer and then pass this struct to c code. I have seen other examples and questions on this subject and I believe this is being done correctly, however still having issues.
The code will compile fine, however it crashes Python when running it. Debugging with Visual Studio, it is showing a memory access violation. I have researched this quite a bit but am unable to come up with a reason why this is happening. Was able to reproduce this on a different computer.
I believe it has something to do with how the struct test_M is being allocated and referenced on the stack. I've tried several different variations of defining the test_M.param.gain_val, the one shown does allow the code to compile fine and I can get the output to print on the screen. However, Python crashes immediately after this.
Unfortunately I can not modify the c code because this is the format auto-generated code from Matlab/Simulink embedded coder.
Any help would be appreciated.
Using:
python = 3.6
cython = 0.26
numpy = 1.13.1
Visual Studio 2017 v15
ccodetest.c
#include <stdlib.h>
typedef struct P_T_ P_T;
typedef struct tag_T RT_MODEL_T;
struct P_T_ {
double gain_val;
};
struct tag_T {
P_T *param;
};
void compute(double array_in[4], RT_MODEL_T *const test_M)
{
P_T *test_P = ((P_T *) test_M->param);
int size;
size = sizeof(array_in);
int i;
for (i=0; i<size; i++)
{
array_in[i] = array_in[i] * test_P->gain_val;
}
}
cython_param_test.pyx
cimport cython
import numpy as np
cimport numpy as np
from cpython.mem cimport PyMem_Malloc, PyMem_Free
np.import_array()
cdef extern from "ccodetest.c":
ctypedef tag_T RT_MODEL_T
ctypedef P_T_ P_T
cdef struct P_T_:
double gain_val
cdef struct tag_T:
P_T *param
void compute(double array_in[4], RT_MODEL_T *const test_M)
cdef double array_in[4]
def run(
np.ndarray[np.double_t, ndim=1, mode='c'] x_in,
np.ndarray[np.double_t, ndim=2, mode='c'] x_out,
np.ndarray[np.double_t, ndim=1, mode='c'] gain):
cdef RT_MODEL_T* test_M = <RT_MODEL_T*> PyMem_Malloc(sizeof(RT_MODEL_T))
global array_in
test_M.param.gain_val = <double>gain
cdef int idx
try:
for idx in range(len(x_in)):
array_in[idx] = x_in[idx]
compute(array_in, test_M)
for idx in range(len(x_in)):
x_out[idx] = array_in[idx]
finally:
PyMem_Free(test_M)
return None
setup.py
import numpy
from Cython.Distutils import build_ext
def configuration(parent_package='', top_path=None):
from numpy.distutils.misc_util import Configuration
config = Configuration('', parent_package, top_path)
config.add_extension('cython_param_test',
sources=['cython_param_test.pyx'],
# libraries=['m'],
depends=['ccodetest.c'],
include_dirs=[numpy.get_include()])
return config
if __name__ == '__main__':
params = configuration(top_path='').todict()
params['cmdclass'] = dict(build_ext=build_ext)
setup(**params)
run_cython_param_test.py
import cython_param_test
import numpy as np
n_samples = 4
x_in = np.arange(n_samples, dtype='double') % 4
x_out = np.empty((n_samples, 1))
gain = np.ones(1, dtype='double') * 5
cython_param_test.run(x_in, x_out, gain)
print(x_out)
cdef RT_MODEL_T* test_M = <RT_MODEL_T*> PyMem_Malloc(sizeof(RT_MODEL_T))
You allocate space for a RT_MODEL_T. test_M has one member, a pointer to a P_T. Allocating space for the RT_MODEL_T only allocates space to store the pointer - it doesn't allocate a P_T to be pointed to. Where param points is completely arbitrary at the moment and is most likely a memory address that you aren't allowed to write to.
test_M.param.gain_val = ...
You attempt to write to an element of the P_T pointed to by param. However, param does not point to an allocated P_T.
... = <double>gain
You attempt to cast a numpy array to a double. This does not make sense at all. You probably want to get the first element of the numpy array or you should pass gain as just a double rather than a numpy array of doubles?
Since test_M and its contents don't need to live beyond the end of the function they're allocated in, I'd be tempted to allocate them on the stack instead, and that way you can completely avoid malloc and free:
cdef RT_MODEL_T test_M # not a pointer
cdef P_T p_t_instance
p_t_instance.gain_val = gain # or gain[0]?
test_M.param = &p_t_instance
# ...
compute(array_in, &test_M) # pass the address of `test_M`
Only do this if you are sure of the required lifetime of test_M and the P_T it holds a pointer to.
I'v written a Python extension module with C to speed up computation times. The first step is a 2D integration of a function f(x,y,k), which is very fast and allows me to integrate over y in [y1(x),y2(x)] and x in [a,b] whilst assigning a float to k. But I really need to integrate k over the range [c,d]. Currently, I'm doing something like this in Python
inner = lambda k: calc.kernel(l,k,ki)
I = quad(inner,c,d)[0]
where calc is my C-extension module and calc.kernel calls gauss2 to perform 2D integration. l and ki are just other variables. But with my data, quad still takes many hours to finish. I would like to do all calculations within the C-extension module, but I'm really stumped on how to implement this outer integral. Here is my C-code
#include <Python.h>
#include <math.h>
double A96[96]={ /* abscissas for 96-point Gauss quadrature */
};
double W96[96]={ /* weights for 96-point Gauss quadrature */
};
double Y1(double x){
return 0;
}
double Y2(double x){
return x;
}
double gauss1(double F(double),double a,double b)
{ /* 96-pt Gauss qaudrature integrates F(x) from a to b */
int i;
double cx,dx,q;
cx=(a+b)/2;
dx=(b-a)/2;
q=0;
for(i=0;i<48;i++)
q+=W96[i]*(F(cx-dx*A96[i])+F(cx+dx*A96[i]));
return(q*dx);
}
double gauss2(double F(double,double,int,double,double),double Y1(double),double Y2(double),double a,double b,int l,double k, double ki)
{/* 96x96-pt 2-D Gauss qaudrature integrates
F(x,y) from y=Y1(x) to Y2(x) and x=a to b */
int i,j,h;
double cx,cy,dx,dy,q,w,x,y1,y2;
cx=(a+b)/2;
dx=(b-a)/2;
q=0;
for(i=0;i<48;i++)
{
for(h=-1;h<=1;h+=2)
{
x=cx+h*dx*A96[i];
y1=Y1(x);
y2=Y2(x);
cy=(y1+y2)/2;
dy=(y2-y1)/2;
w=dy*W96[i];
for(j=0;j<48;j++)
q+=w*W96[j]*(F(x,cy-dy*A96[j],l,k,ki)+F(x,cy+dy*A96[j],l,k,ki));
}
}
return(q*dx);
}
double ps_fact(double z){
double M = 0.3;
return 3/2*(M*(1+z)*(1+z)*(1+z) + (1-M))*(M*(1+z)*(1+z)*(1+z) + (1-M))*(M*(1+z)*(1+z)*(1+z) + (1-M))/(1+z)/(1+z);
}
double drdz(double z){
double M = 0.3;
return 3000/sqrt(M*(1+z)*(1+z)*(1+z) + (1-M));
}
double rInt(double z){
double M = 0.3;
return 3000/sqrt(M*(1+z)*(1+z)*(1+z) + (1-M));
}
double kernel_func ( double y , double x, int l,double k, double ki) {
return ps_fact(y)*ki*rInt(x)*sqrt(M_PI/2/rInt(x))*jn(l+0.5,ki*rInt(x))*drdz(x)*(rInt(x)-rInt(y))/rInt(y)*sqrt(M_PI/2/rInt(y))*jn(l+0.5,k*rInt(y))*drdz(y);
}
static PyObject* calc(PyObject* self, PyObject* args)
{
int l;
double k, ki;
if (!PyArg_ParseTuple(args, "idd", &l, &k, &ki))
return NULL;
double res;
res = gauss2(kernel_func,Y1, Y2, 0,10,l, k, ki);
return Py_BuildValue("d", res);
}
static PyMethodDef CalcMethods[] = {
{"kernel", calc, METH_VARARGS, "Calculates kernel values."},
{NULL, NULL, 0, NULL}
};
PyMODINIT_FUNC initcalc(void){
(void) Py_InitModule("calc", CalcMethods);
A96 and W96 both contain the points for the Gaussian quadrature, so don't worry that they are empty here. I should add I don't take any credit for the functions gauss1 and gauss2.
EDIT: python code was wrong - edited now.
Maybe the source code for scipy integrate quad is a good place to start if you haven't looked there : https://github.com/scipy/scipy/blob/v0.17.0/scipy/integrate/quadpack.py#L45-L360
Looks like most of the work is already being done by native Fortran code, which is normally either as fast or faster than C/C++ code. You will be hard pressed to improve on that, unless you create/find a CUDA implementation.
You make the Fortran code multithreaded, if it's not already and the source is open. Lastly, you could make a threading dispatcher in C/Fortran (python doesn't support real threading because of the GIL) and just make your calls to quad parallel from one another atleast. Interfacing calc directly with Fortran quad would probably save you some decent overhead too.
I'm writing a program to find the roots of nth order Legendre Polynomials using c++; my code is attached below:
double* legRoots(int n)
{
double myRoots[n];
double x, dx, Pi = atan2(1,1)*4;
int iters = 0;
double tolerance = 1e-20;
double error = 10*tolerance;
int maxIterations = 1000;
for(int i = 1; i<=n; i++)
{
x = cos(Pi*(i-.25)/(n+.5));
do
{
dx -= legDir(n,x)/legDif(n,x);
x += dx;
iters += 1;
error = abs(dx);
} while (error>tolerance && iters<maxIterations);
myRoots[i-1] = x;
}
return myRoots;
}
Assuming the existence of functioning Legendre Polynomial and Legendre Polynomial derivative generating functions, which I do have but I thought that would make for unreadable walls of code text. This function is functioning in the sense that it's returning an array calculated values, but they're wildly off, outputting the following:
3.95253e-323
6.94492e-310
6.95268e-310
6.42285e-323
4.94066e-323
2.07355e-317
where an equivalent function I've written in Python gives the following:
[-0.90617985 -0.54064082 0. 0.54064082 0.90617985]
I was hoping another set of eyes could help me see what the issue in my C++ code is that's causing the values to be wildly off. I'm not doing anything different in my Python code that I'm doing in C++, so any help anyone could give on this is greatly appreciated, thanks. For reference, I'm mostly trying to emulate the method found on Rosetta code in regards to Gaussian Quadrature: http://rosettacode.org/wiki/Numerical_integration/Gauss-Legendre_Quadrature.
You are returning an address to a temporary variable in stack
{
double myRoots[n];
...
return myRoots; // Not a safe thing to do
}
I suggest changing your function definition to
void legRoots(int n, double *myRoots)
omitting the return statement, and defining myroots before calling the function
double myRoots[10];
legRoots(10, myRoots);
Option 2 would be to allocate myRoots dynamically with new or malloc.
I wrote a simple Python extension module to simulate a 3-bit analog-to-digital converter. It is supposed to accept a floating-point array as its input to return the same size array of output. The output actually consists of quantized input numbers. Here is my (simplified) module:
static PyObject *adc3(PyObject *self, PyObject *args) {
PyArrayObject *inArray = NULL, *outArray = NULL;
double *pinp = NULL, *pout = NULL;
npy_intp nelem;
int dims[1], i, j;
/* Get arguments: */
if (!PyArg_ParseTuple(args, "O:adc3", &inArray))
return NULL;
nelem = PyArray_DIM(inArray,0); /* size of the input array */
pout = (double *) malloc(nelem*sizeof(double));
pinp = (double *) PyArray_DATA(inArray);
/* ADC action */
for (i = 0; i < nelem; i++) {
if (pinp[i] >= -0.5) {
if (pinp[i] < 0.5) pout[i] = 0;
else if (pinp[i] < 1.5) pout[i] = 1;
else if (pinp[i] < 2.5) pout[i] = 2;
else if (pinp[i] < 3.5) pout[i] = 3;
else pout[i] = 4;
}
else {
if (pinp[i] >= -1.5) pout[i] = -1;
else if (pinp[i] >= -2.5) pout[i] = -2;
else if (pinp[i] >= -3.5) pout[i] = -3;
else pout[i] = -4;
}
}
dims[0] = nelem;
outArray = (PyArrayObject *)
PyArray_SimpleNewFromData(1, dims, NPY_DOUBLE, pout);
//Py_INCREF(outArray);
return PyArray_Return(outArray);
}
/* ==== methods table ====================== */
static PyMethodDef mwa_methods[] = {
{"adc", adc, METH_VARARGS, "n-bit Analog-to-Digital Converter (ADC)"},
{NULL, NULL, 0, NULL}
};
/* ==== Initialize ====================== */
PyMODINIT_FUNC initmwa() {
Py_InitModule("mwa", mwa_methods);
import_array(); // for NumPy
}
I expected that if reference counts were processed correctly, the Python garbage collection would (frequently enough) release the memory used by the output array if it has the same name and is used repeatedly. So I tested it on some dummy (but voluminous) data with this code:
for i in xrange(200):
a = rand(1000000)
b = mwa.adc3(a)
print i
Here the array named "b" is reused many times and its memory, borrowed by adc3() from the heap, is expected to be returned to the system. I used the gnome-system-monitor to check. Contrary to my expectations, the memory owned by python grew rapidly and could only be released by quitting the program (I use IPython).
For comparison, I tried the same procedure with the standard NumPy functions, zeros() and copy():
for i in xrange(1000):
a = np.zeros(10000000)
b = np.copy(a)
print i
As you can see, the latter code does not make any memory build-up.
I read many texts in the standard documentation and on the web, tried to use Py_INCREF(outArray) and not to use it. All in vain: the problem persisted.
However, I found the solution in http://wiki.scipy.org/Cookbook/C_Extensions/NumPy_arrays.
The author provides an extension program matsq() that creates an array and returns it. When I tried to use the calls suggested by the author:
outArray = (PyArrayObject *) PyArray_FromDims(nd,dims,NPY_DOUBLE);
pout = (double *) outArray->data;
instead of my
pout = (double *) malloc(nelem*sizeof(double));
outArray = (PyArrayObject *)
PyArray_SimpleNewFromData(1, dims, NPY_DOUBLE, pout);
/* no matter with or without Py_INCREF(outArray)) */
the memory leak gone! The program works properly now.
A question: can anybody explain why PyArray_SimpleNewFromData() does not provide the correct reference counting, while PyArray_FromDims() does?
Thank you very much.
ADDITION. I probably exceeded the room/time in comments, so I add to my comment to Alex here.
I tried to set the OWNDATA flag this way:
outArray->flags |= OWNDATA;
but I got "error: ‘OWNDATA’ undeclared".
The rest is in the comment. Thank you in advance.
SOLVED: The correct setting of the flag is
outArray->flags |= NPY_ARRAY_OWNDATA;
Now it works.
Alex, sorry.
The problem is not with PyArray_SimpleNewFromData which produces a properly refcounted PyObject*. Rather, it's with your malloc, assigned to pout then never freed.
As the docs at http://docs.scipy.org/doc/numpy/user/c-info.how-to-extend.html clearly state, documenting PyArray_SimpleNewFromData:
the ndarray will not own its data. When this ndarray is
deallocated, the pointer will not be freed.
...
If you want the
memory to be freed as soon as the ndarray is deallocated then simply
set the OWNDATA flag on the returned ndarray.
(my emphasis on the not). IOW, you're observing exactly the "will not be freed" behavior so clearly documented, and are not taking the step specifically recommended should you want to avoid said behavior.
+
I'm trying to optimize a piece of python code using AVX. I'm using ctypes to access the C++ function. Sometimes the functions segfaults and sometimes dont. I think it maybe has got something to do with the alignment?
Maybe anyone can help me with this, I'm kinda stuck here.
Python-Code:
from ctypes import *
import numpy as np
#path_cnt
path_cnt = 16
c_path_cnt = c_int(path_cnt)
#ndarray1
ndarray1 = np.ones(path_cnt,dtype=np.float32,order='C')
ndarray1.setflags(align=1,write=1)
c_ndarray1 = stock.ctypes.data_as(POINTER(c_float))
#ndarray2
ndarray2 = np.ones(path_cnt,dtype=np.float32,order='C');
ndarray2.setflags(align=1,write=1)
c_ndarray2 = max_vola.ctypes.data_as(POINTER(c_float))
#call function
finance = cdll.LoadLibrary(".../libfin.so")
finance.foobar.argtypes = [c_void_p, c_void_p,c_int]
finance.foobar(c_ndarray1,c_ndarray2,c_path_cnt)
x=0
while x < path_cnt:
print c_stock[x]
x+=1
C++ Code
extern "C"{
int foobar(float * ndarray1,float * ndarray2,int path_cnt)
{
for(int i=0;i<path_cnt;i=i+8)
{
__m256 arr1 = _mm256_load_ps(&ndarray1[i]);
__m256 arr2 = _mm256_load_ps(&ndarray2[i]);
__m256 add = _mm256_add_ps(arr1,arr2);
_mm256_store_ps(&ndarray1[i],add);
}
return 0;
}
}
And now the odd output behavior, making the some call in terminal twice gives different results!
tobias#tobias-Lenovo-U310:~/workspace/finance$ python finance.py
Segmentation fault (core dumped)
tobias#tobias-Lenovo-U310:~/workspace/finance$ python finance.py
2.0
2.0
2.0
2.0
2.0
2.0
2.0
2.0
Thanks in advance!
There are aligned and unaligned load instructions. The aligned ones will fault if you violate the alignment rules, but they are faster. The unaligned ones accept any address and do loads/shifts internally to get the data you want. You are using the aligned version, _mm256_load_ps and can just switch to the unaligned version _mm256_loadu_ps without any intermediate allocation.
A good vectorizing compiler will include a lead-in loop to reach an aligned address, then a body to work on aligned data, then a final loop to clean up any stragglers.
Allright, I tink I found a sultion, its not very elegant but it works at least!
The should be a better way, anyone any suggestions?
extern "C"{
int foobar(float * ndarray1,float * ndarray2,int path_cnt)
{
float * test = (float*)_mm_malloc(path_cnt*sizeof(float),32);
float * test2 = (float*)_mm_malloc(path_cnt*sizeof(float),32);
//copy to aligned memory(this part is kinda stupid)
for(int i=0;i<path_cnt;i++)
{
test[i] = stock[i];
test2[i] = max_vola[i];
}
for(int i=0;i<path_cnt;i=i+8)
{
__m256 arr1 = _mm256_load_ps(&test1[i]);
__m256 arr2 = _mm256_load_ps(&test2[i]);
__m256 add = _mm256_add_ps(arr1,arr2);
_mm256_store_ps(&test1[i],add);
}
//and copy everything back!
for(int i=0;i<path_cnt;i++)
{
stock[i] = test[i];
}
return 0;
}
}