Debugging C++ extension for Python

Debugging C++ extension for Python - python

I have a problem when I try to debug my C++ extension for Python.
The error is
Fatal Python error: PyThreadState_Get: no current thread
I followed this guide and it works when I run in the release version.
Python code:
from itertools import islice
from random import random
from time import perf_counter
COUNT = 500000 # Change this value depending on the speed of your computer
DATA = list(islice(iter(lambda: (random() - 0.5) * 3.0, None), COUNT))
e = 2.7182818284590452353602874713527
def sinh(x):
return (1 - (e ** (-2 * x))) / (2 * (e ** -x))
def cosh(x):
return (1 + (e ** (-2 * x))) / (2 * (e ** -x))
def tanh(x):
tanh_x = sinh(x) / cosh(x)
return tanh_x
def sequence_tanh(data):
'''Applies the hyperbolic tangent function to map all values in
the sequence to a value between -1.0 and 1.0.
'''
result = []
for x in data:
result.append(tanh(x))
return result
def test(fn, name):
start = perf_counter()
result = fn(DATA)
duration = perf_counter() - start
print('{} took {:.3f} seconds\n\n'.format(name, duration))
for d in result:
assert -1 <= d <=1, " incorrect values"
from superfastcode import fast_tanh
if __name__ == "__main__":
test(lambda d: [fast_tanh(x) for x in d], '[fast_tanh(x) for x in d]')
C++ code:
#include <Python.h>
#include <cmath>
const double e = 2.7182818284590452353602874713527;
double sinh_impl(double x) {
return (1 - pow(e, (-2 * x))) / (2 * pow(e, -x));
}
double cosh_impl(double x) {
return (1 + pow(e, (-2 * x))) / (2 * pow(e, -x));
}
PyObject* tanh_impl(PyObject *, PyObject* o) {
double x = PyFloat_AsDouble(o);
double tanh_x = sinh_impl(x) / cosh_impl(x);
return PyFloat_FromDouble(tanh_x);
}
static PyMethodDef superfastcode_methods[] = {
// The first property is the name exposed to Python, fast_tanh, the second is the C++
// function name that contains the implementation.
{ "fast_tanh", (PyCFunction)tanh_impl, METH_O, nullptr },
// Terminate the array with an object containing nulls.
{ nullptr, nullptr, 0, nullptr }
};
static PyModuleDef superfastcode_module = {
PyModuleDef_HEAD_INIT,
"superfastcode", // Module name to use with Python import statements
"Provides some functions, but faster", // Module description
0,
superfastcode_methods // Structure that defines the methods of the module
};
PyMODINIT_FUNC PyInit_superfastcode() {
return PyModule_Create(&superfastcode_module);
}
I am using the 64 bit version of Python 3.6, and are building the C++ code in x64 mode. Visual Studio 2017 15.6.4
I am linking with C:\Python\Python36.x64\libs\python36_d.lib and including header files from C:\Python\Python36.x64\include
My Python interpreter is in C:\Python\Python36.x64\
I get this result when I run the release build
[fast_tanh(x) for x in d] took 0.067 seconds
Update: I got it running in Py x86 but not x64.
When I hit the break point and step over (F10) it throws an exception.

I got this solution from Steve Dower # Microsoft:
This looks more like a mismatch between debug binaries and release headers.
The guide you've referenced is designed to always use the release binaries of Python, even if you are building a debug extension. So either you should be linking against python36.lib/python36.dll, or ignoring most of the setting changes listed in the guide and linking against python36_d.lib/python36_d.dll (the linking should be automatic once you set the paths - the choice of C runtime library will determine whether debug/release Python is used).
Reference: PTVS issues

Related

How to copy a 2D array (matrix) from python with a C function (and do some computer heavy computation) which return a 2D array (matrix) in python?

I want to copy a 2D numpy array (matrix) in a C function a get it back in python (and then do some calculation on it in C taking the speed advantage of C) . Therefore I need the C function matrix_copy to return a 2D array (or, I guess, a pointer to it). I tried with the following code but I get the following output (where one can see the second dimension of the array is lost).
matrix_in.shape:
(300, 200)
matrix_out.shape:
(300,)
How could I change the code (I guess the matrix_copy.c adding some pointer magic) so I could obtain an exact copy of the matrix_in in matrix_out?
Here is the main.py script:
from ctypes import c_void_p, c_double, c_int, cdll
from numpy.ctypeslib import ndpointer
import numpy as np
import pdb
n = 300
m = 200
matrix_in = np.random.randn(n, m)
lib = cdll.LoadLibrary("matrix_copy.so")
matrix_copy = lib.matrix_copy
matrix_copy.restype = ndpointer(dtype=c_double,
shape=(n,))
matrix_out = matrix_copy(c_void_p(matrix_in.ctypes.data),
c_int(n),
c_int(m))
print("matrix_in.shape:")
print(matrix_in.shape)
print("matrix_out.shape:")
print(matrix_out.shape)
Here is the matrix_copy.c script:
#include <stdlib.h>
#include <stdio.h>
double * matrix_copy(const double * matrix_in, int n, int m){
double * matrix_out = (double *)malloc(sizeof(double) * (n*m));
int index = 0;
for(int i=0; i< n; i++){
for(int j=0; j<m; j++){
matrix_out[i+j] = matrix_in[i+j];
//matrix_out[i][j] = matrix_in[i][j];
// some heavy computations not yet implemented
}
}
return matrix_out;
}
which I compile with the command
cc -fPIC -shared -o matrix_copy.so matrix_copy.c
And as a side note, why does the notation matrix_out[i][j] = matrix_in[i][j]; throws me an error on compilation?
matrix_copy.c:10:26: error: subscripted value is not an array, pointer, or vector
matrix_out[i][j] = matrix_in[i][j];
~~~~~~~~~~~~~^~
matrix_copy.c:10:44: error: subscripted value is not an array, pointer, or vector
matrix_out[i][j] = matrix_in[i][j];

The second dimension is 'lost' because you explicitly omit it in the named shape argument of ndpointer. Change:
matrix_copy.restype = ndpointer(dtype=c_double, shape=(n,))
to
matrix_copy.restype = ndpointer(dtype=c_double, shape=(n,m), flags='C')
Where flags='C' additionally notes that the returned data is stored contiguously in row major order.
With regards to matrix_out[i][j] = matrix_in[i][j]; throwing an error, consider that matrix_in is of type const double *. matrix_in[i] would yield a value of type const double - how do you further index this value (i.e., with [j])?
If you want to emulate accessing a 2-dimensional array via a single pointer, you must calculate offsets manually. matrix_out[i+j] is not sufficient, as you must account for the span of each sub array:
matrix_out[i * m + j] = matrix_in[i * m + j];
Note that in C, size_t is the generally preferred type to use when dealing with memory sizes or array lengths.
matrix_copy.c, simplified:
#include <stdlib.h>
double *matrix_copy(const double *matrix_in, size_t n, size_t m)
{
double *matrix_out = malloc(sizeof *matrix_out * n * m);
for (size_t i = 0; i < n; i++)
for (size_t j = 0; j < m; j++)
matrix_out[i * m + j] = matrix_in[i * m + j];
return matrix_out;
}
matrix.py, with more explicit typing:
from ctypes import c_void_p, c_double, c_size_t, cdll, POINTER
from numpy.ctypeslib import ndpointer
import numpy as np
c_double_p = POINTER(c_double)
n = 300
m = 200
matrix_in = np.random.randn(n, m).astype(c_double)
lib = cdll.LoadLibrary("matrix_copy.so")
matrix_copy = lib.matrix_copy
matrix_copy.argtypes = c_double_p, c_size_t, c_size_t
matrix_copy.restype = ndpointer(
dtype=c_double,
shape=(n,m),
flags='C')
matrix_out = matrix_copy(
matrix_in.ctypes.data_as(c_double_p),
c_size_t(n),
c_size_t(m))
print("matrix_in.shape:", matrix_in.shape)
print("matrix_out.shape:", matrix_out.shape)
print("in == out", matrix_in == matrix_out)

The incoming data is a probably single block of memory. You need to create the substructure.
In my C++ code I have to do the following on data (block) coming in via swig:
void divide2DDoubleArray(double * &block, double ** &subblockdividers, int noofsubblocks, int subblocksize){
/* The starting address of a block of doubles is used to generate
* pointers to subblocks.
*
* block: memory containing the original block of data
* subblockdividers: array of subblock addresses
* noofsubblocks: specify the number of subblocks produced
* subblocksize: specify the size of the subblocks produced
*
* Design by contract: application should make sure the memory
* in block is allocated and initialized properly.
*/
// Build 2D matrix for cols
subblockdividers=new double *[noofsubblocks];
subblockdividers[0]= block;
for (int i=1; i<noofsubblocks; ++i) {
subblockdividers[i] = &subblockdividers[i-1][subblocksize];
}
}
Now the pointer returned in subblockdividers can be used the way you would like to.
Don't forget to free subblockdividers when your done. (Note: adjustments might be needed to compile this as C code)

Performance of LLVM-Compiler on native C code vs Python+Numba

I recently did some tests on performance optimization in Python. One part was doing a benchmark on Monte-Carlo Pi calculation using SWIG and compile a library to import in Python. The other solution was using Numba. Now I totally wonder why the native C solution is worse than Numba even if LLVM compiler is used for both. So I'm wondering if I'm doing something wrong.
Runtime on my Laptop
native C module: 7.09 s
Python+Numba: 2.75 s
Native C code
#include "swigtest.h"
#include <time.h>
#include <stdlib.h>
#include <stdio.h>
float monte_carlo_pi(long nsamples)
{
int accGlob=0;
int accLoc=0;
int i,ns;
float x,y;
float res;
float iRMX=1.0/(float) RAND_MAX;
srand(time(NULL));
for(i=0;i<nsamples;i++)
{
x = (float)rand()*iRMX;
y = (float)rand()*iRMX;
if((x*x + y*y) < 1.0) { acc += 1;}
}
res = 4.0 * (float) acc / (float) nsamples;
printf("cres = %.5f\n",res);
return res;
}
swigtest.i
%module swigtest
%{
#define SWIG_FILE_WITH_INIT
#include "swigtest.h"
%}
float monte_carlo_pi(long nsamples);
Compiler call
clang.exe swigtest.c swigtest_wrap.c -Ofast -o _swigtest.pyd -I C:\python37\include -shared -L c:\python37\libs -g0 -mtune=intel -msse4.2 -mmmx
testswig.py
from swigtest import monte_carlo_pi
import time
import os
start = time.time()
pi = monte_carlo_pi(250000000)
print("pi: %.5f" % pi)
print("tm:",time.time()-start)
Python version with Numba
from numba import jit
import random
import time
start = time.time()
#jit(nopython=True,cache=True,fastmath=True)
def monte_carlo_pi(nsamples: int)-> float:
acc:int = 0
for i in range(nsamples):
x:float = random.random()
y:float = random.random()
if (x * x + y * y) < 1.0: acc += 1
return 4.0 * acc / nsamples
pi = monte_carlo_pi(250000000)
print("pi:",pi)
print("tm:",time.time()-start)

Summary up to now:
The rand() function seems to consume most of the time. Using a deterministic approach like this
...
ns = (long) sqrt((double)nsamples)+1;
dx = 1./sqrt((double)nsamples);
dy = dx;
...
for(i=0;i<ns;i++)
for(k=0;k<ns;k++)
{
x = i*dx;
y = k*dy;
if((x*x + y*y) < 1.0) { accLoc += 1;}
}
...
instead of rand() results in an execution tim of only 0.04 s! Obviously Numba uses another much more efficient random function.

PyCUDA LogicError: cuModuleLoadDataEx failed: an illegal memory access was encountered

I am trying to parallelize the bitonic sort with pycuda. For this I use SourceModule and the C code of the parallel bitonic sort. For the memory copies management I use InOut of the pycuda.driver that simplify some of the memory transfers
import pycuda.autoinit
import pycuda.driver as drv
from pycuda.compiler import SourceModule
from pycuda import gpuarray
import numpy as np
from time import time
ker = SourceModule(
"""
__device__ void swap(int & a, int & b){
int tmp = a;
a = b;
b = tmp;
}
__global__ void bitonicSort(int * values, int N){
extern __shared__ int shared[];
int tid = threadIdx.x + blockDim.x * blockIdx.x;
// Copy input to shared mem.
shared[tid] = values[tid];
__syncthreads();
// Parallel bitonic sort.
for (int k = 2; k <= N; k *= 2){
// Bitonic merge:
for (int j = k / 2; j>0; j /= 2){
int ixj = tid ^ j;
if (ixj > tid){
if ((tid & k) == 0){
//Sort ascending
if (shared[tid] > shared[ixj]){
swap(shared[tid], shared[ixj]);
}
}
else{
//Sort descending
if (shared[tid] < shared[ixj]){
swap(shared[tid], shared[ixj]);
}
}
}
__syncthreads();
}
}
values[tid] = shared[tid];
}
"""
)
N = 8 #lenght of A
A = np.int32(np.random.randint(1, 20, N)) #random numbers in A
BLOCK_SIZE = 256
NUM_BLOCKS = (N + BLOCK_SIZE-1)//BLOCK_SIZE
bitonicSort = ker.get_function("bitonicSort")
t1 = time()
bitonicSort(drv.InOut(A), np.int32(N), block=(BLOCK_SIZE,1,1), grid=(NUM_BLOCKS,1), shared=4*N)
t2 = time()
print("Execution Time {0}".format(t2 - t1))
print(A)
As in the kernel I use extern __shared__, in pycuda I use the shared parameter with the respective 4*N. Also try using __shared__ int shared[N] in the kernel but it doesn't work either (check here: Getting started with shared memory on PyCUDA)
Running in Google Collab I get the following error:
/usr/local/lib/python3.6/dist-packages/pycuda/compiler.py in __init__(self, source, nvcc, options, keep, no_extern_c, arch, code, cache_dir, include_dirs)
292
293 from pycuda.driver import module_from_buffer
--> 294 self.module = module_from_buffer(cubin)
295
296 self._bind_module()
LogicError: cuModuleLoadDataEx failed: an illegal memory access was encountered
Does anyone know what could be generating this error?

Your device code isn't accounting for the sizes of your arrays correctly.
You are launching 256 threads in a single block. That means that you will have 256 threads, with tid numbered 0..255, trying to execute each line of code. For example, in this case:
shared[tid] = values[tid];
You will have, for example, one thread trying to do shared[255] = values[255];
Neither your shared nor values array are that large. That is the reason for the illegal memory access error.
The simplest solution for this kind of trivial problem is to make your array sizes match your block size.
BLOCK_SIZE = N
According to my testing, that change clears up any errors and results in a properly sorted array.
It won't work for N greater than 1024, or multi-block usage, but your code would have to be modified for a multi-block sort, anyway.
If you still have trouble after making that change, I suggest restarting your python session or your colab session.

Memory Leak in Ctypes method

I have a project mostly written in Python. This project runs on my Raspberry Pi (Model B). With the use of the Pi Camera I record to a stream. Every second I pauze the recording to take the last frame from the stream and compare it with a older frame. The comparing is done in C code (mainly because it is faster than Python).
The C code is called from Python using Ctypes. See the code below.
# Load picturecomparer.so and set argument and return types
cmethod = ctypes.CDLL(Paths.CMODULE_LOCATION)
cmethod.compare_pictures.restype = ctypes.c_double
cmethod.compare_pictures.argtypes = [ctypes.c_char_p, types.c_char_p]
The 2 images that must be compared are stored on the disk. Python gives the paths of both images as arguments to the C code. The C code will return a value (double) which is the difference in percentage of both images.
# Call the C method to compare the images
difflevel = cmethod.compare_pictures(path1, path2)
The C code looks like this:
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#ifndef STB_IMAGE_IMPLEMENTATION
#define STB_IMAGE_IMPLEMENTATION
#include "stb_image.h"
#ifndef STBI_ASSERT
#define STBI_ASSERT(x)
#endif
#endif
#define COLOR_R 0
#define COLOR_G 1
#define COLOR_B 2
#define OFFSET 10
double compare_pictures(const char* path1, const char* path2);
double compare_pictures(const char* path1, const char* path2)
{
double totalDiff = 0.0, value;
unsigned int x, y;
int width1, height1, comps1;
unsigned char * image1 = stbi_load(path1, &width1, &height1, &comps1, 0);
int width2, height2, comps2;
unsigned char * image2 = stbi_load(path2, &width2, &height2, &comps2, 0);
// Perform some checks to be sure images are valid
if (image1 == NULL | image2 == NULL) { return 0; }
if (width1 != width2 | height1 != height2) { return 0; }
for (y = 0; y < height1; y++)
{
for (x = 0; x < width1; x++)
{
// Calculate difference in RED
value = (int)image1[(x + y*width1) * comps1 + COLOR_R] - (int)image2[(x + y*width2) * comps2 + COLOR_R];
if (value < OFFSET && value > (OFFSET * -1)) { value = 0; }
totalDiff += fabs(value) / 255.0;
// Calculate difference in GREEN
value = (int)image1[(x + y*width1) * comps1 + COLOR_G] - (int)image2[(x + y*width2) * comps2 + COLOR_G];
if (value < OFFSET && value >(OFFSET * -1)) { value = 0; }
totalDiff += fabs(value) / 255.0;
// Calculate difference in BLUE
value = (int)image1[(x + y*width1) * comps1 + COLOR_B] - (int)image2[(x + y*width2) * comps2 + COLOR_B];
if (value < OFFSET && value >(OFFSET * -1)) { value = 0; }
totalDiff += fabs(value) / 255.0;
}
}
totalDiff = 100.0 * totalDiff / (double)(width1 * height1 * 3);
return totalDiff;
}
The C code will be executed every ~2 seconds. I just noticed that there is a memory leak. After around 10 to 15 minutes my Raspberry Pi haves like 10MB ram left to use. A few minutes later it crashes and doesn't respond anymore.
I have done some checks to find out what causes this in my project. My entire project uses around 30-40MB ram if I disable the C code. This project is all my Raspberry Pi have to execute.
Model B: 512MB ram which shares between CPU and GPU.
GPU: 128MB (/boot/config.txt).
My Linux distro uses: ~60MB.
So I have ~300MB for my project.
Hope someone could point me where it goes wrong, or if I have to call GC myself, etc..
Thanks in advance.
p.s. I know the image comparing is not the best way, but it works for me now.

Since the images are being returned as pointers to buffers stbi_load must be allocating space for them and you are not releasing this space before returning so the memory leak is not surprising.
Check for the documentation to see if there is a specific stpi_free function or try adding free(image1); free(image2); before the final return.
Having checked I can categorically say that you should be calling STBI_FREE(image1); STBI_FREE(image2); before returning.

Error while using the c code in python

I am using the following code which I have found online
def c_int_binary_search(seq,t):
# do a little type checking in Python
assert(type(t) == type(1))
assert(type(seq) == type([]))
# now the C code
code = """
#line 29 "binary_search.py"
int val, m, min = 0;
int max = seq.length() - 1;
PyObject *py_val;
for(;;)
{
if (max < min )
{
return_val = Py::new_reference_to(Py::Int(-1));
break;
}
m = (min + max) /2;
val = py_to_int(PyList_GetItem(seq.ptr(),m),"val");
if (val < t)
min = m + 1;
else if (val > t)
max = m - 1;
else
{
return_val = Py::new_reference_to(Py::Int(m));
break;
}
}
"""
return inline(code,['seq','t'])
from the documentation of scipy
When I try to execute this script then i have the following errors
binary_search.py: In function ‘PyObject* compiled_func(PyObject*, PyObject*)’:
binary_search.py:36:38: error: ‘Py’ has not been declared
I am wondering if someone can guide me in this. I have already installed PyCXX. I am using Ubuntu.
Thanks a lot.

That example is out of date, the Py namespace doesn't exist in recent versions.
Some distributions ship the examples (that should be kept up to date) with scipy. On my machine, there's this:
/usr/lib64/python2.7/site-packages/scipy/weave/examples/binary_search.py
If you don't have something like that, you can download it from SciPy repository.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Debugging C++ extension for Python - python

Related

How to copy a 2D array (matrix) from python with a C function (and do some computer heavy computation) which return a 2D array (matrix) in python?

Performance of LLVM-Compiler on native C code vs Python+Numba

PyCUDA LogicError: cuModuleLoadDataEx failed: an illegal memory access was encountered

Memory Leak in Ctypes method

Error while using the c code in python

Categories

Resources