MEX equivalent for Python ( C wrapper functions) - python

Coming from MATLAB, I am looking for some way to create functions in Python which are derived from wrapping C functions. I came across Cython, ctypes, SWIG. My intent is not to improve speed by any factor (it would certainly help though).
Could someone recommend a decent solution for such a purpose.
Edit: What's the most popular/adopted way of doing this job?
Thanks.

I've found that weave works pretty well for shorter functions and has a very simple interface.
To give you an idea of just how easy the interface is, here's an example (taken from the PerformancePython website). Notice how multi-dimensional array conversion is handled for you by the converter (in this case Blitz).
from scipy.weave import converters
def inlineTimeStep(self, dt=0.0):
"""Takes a time step using inlined C code -- this version uses
blitz arrays."""
g = self.grid
nx, ny = g.u.shape
dx2, dy2 = g.dx**2, g.dy**2
dnr_inv = 0.5/(dx2 + dy2)
u = g.u
code = """
#line 120 "laplace.py" (This is only useful for debugging)
double tmp, err, diff;
err = 0.0;
for (int i=1; i<nx-1; ++i) {
for (int j=1; j<ny-1; ++j) {
tmp = u(i,j);
u(i,j) = ((u(i-1,j) + u(i+1,j))*dy2 +
(u(i,j-1) + u(i,j+1))*dx2)*dnr_inv;
diff = u(i,j) - tmp;
err += diff*diff;
}
}
return_val = sqrt(err);
"""
# compiler keyword only needed on windows with MSVC installed
err = weave.inline(code,
['u', 'dx2', 'dy2', 'dnr_inv', 'nx', 'ny'],
type_converters=converters.blitz,
compiler = 'gcc')
return err

Related

Pybind11 is slower than Pure Python

I created Python Bindings using pybind11. Everything worked perfectly, but when I did a speed check test the result was disappointing.
Basically, I have a function in C++ that adds two numbers and I want to use that function from a Python script. I also included a for loop to ran 100 times to better view the difference in the processing time.
For the function "imported" from C++, using pybind11, I obtain: 0.002310514450073242 ~ 0.0034799575805664062
For the simple Python script, I obtain: 0.0012788772583007812 ~ 0.0015883445739746094
main.cpp file:
#include <pybind11/pybind11.h>
namespace py = pybind11;
double sum(double a, double b) {
return a + b;
}
PYBIND11_MODULE(SumFunction, var) {
var.doc() = "pybind11 example module";
var.def("sum", &sum, "This function adds two input numbers");
}
main.py file:
from build.SumFunction import *
import time
start = time.time()
for i in range(100):
print(sum(2.3,5.2))
end = time.time()
print(end - start)
CMakeLists.txt file:
cmake_minimum_required(VERSION 3.0.0)
project(Projectpybind11 VERSION 0.1.0)
include(CTest)
enable_testing()
add_subdirectory(pybind11)
pybind11_add_module(SumFunction main.cpp)
set(CPACK_PROJECT_NAME ${PROJECT_NAME})
set(CPACK_PROJECT_VERSION ${PROJECT_VERSION})
include(CPack)
Simple Python script:
import time
def summ(a,b):
return a+b
start = time.time()
for i in range(100):
print(summ(2.3,5.2))
end = time.time()
print(end - start)
Benchmarking is a very complicated thing, even can be called as a Systemic Engineering.
Because there are many processes will interference our benchmarking job. For
example: NIC interrupt responsing / keyboard or mouse input / OS scheduling...
I have encountered my producing process being blocked by OS for up to 15 seconds!
So as the other advisors have pointed out, the print() invokes more
unnecessary interference.
Your testing computation is too simple.
You must think it out clearly what are you comparing for.
The speed of passing arguments between Python and C++ is obviously slower than
that of within Python side. So I assume that you want to compare the computing
speed of both, instead of arguments passing speed.
If so, I think your computing codes are too simple, and these will lead to the
time we counted is mainly the time for passing args, while the time for
computing is merely the minor of the total.
So, I put out my sample below, I will be glad to see anyone polish it.
Your loop count is too less.
The less loops, the more randomness. Similar with my opinion 1, testing time
is merely 0.000x second. It is possible, that the running process be interferenced by OS.
I think we should make the testing time to last at least a few of seconds.
C++ is not always faster than Python. Now time there are so many Python
modules/libs can use GPU to execute heavy computation, and parallelly do matrix operations even only by using CPU.
I guess that perhaps you are evaluating whether or not using Pybind11 in your project. I think that comparing like this worth nothing, because what is the best tool depends on what is the real requirement, but it is a good lesson to learn things.
I recently encountered a case, Python is faster than C++ in a Deep Learning.
Haha, funny?
At the end, I run my sample in my PC, and found that the C++ computing speed is faster up to 100 times than that in Python. I hope it be helpful for you.
If anyone would please revise/correct my opinions, it's my pleasure!
Pls forgive my ugly English, I hope I have expressed things correctly.
ComplexCpp.cpp:
#include <cmath>
#include <pybind11/numpy.h>
#include <pybind11/pybind11.h>
namespace py = pybind11;
double Compute( double x, py::array_t<double> ys ) {
// std::cout << "x:" << std::setprecision( 16 ) << x << std::endl;
auto r = ys.unchecked<1>();
for( py::ssize_t i = 0; i < r.shape( 0 ); ++i ) {
double y = r( i );
// std::cout << "y:" << std::setprecision( 16 ) << y << std::endl;
x += y;
x *= y;
y = std::max( y, 1.001 );
x /= y;
x *= std::log( y );
}
return x;
};
PYBIND11_MODULE( ComplexCpp, m ) {
m.def( "Compute", &Compute, "a more complicated computing" );
};
tryComplexCpp.py
import ComplexCpp
import math
import numpy as np
import random
import time
def PyCompute(x: float, ys: np.ndarray) -> float:
#print(f'x:{x}')
for y in ys:
#print(f'y:{y}')
x += y
x *= y
y = max(y, 1.001)
x /= y
x *= math.log(y)
return x
LOOPS: int = 100000000
if __name__ == "__main__":
# initialize random
x0 = random.random()
""" We store all args in a array, then pass them into both C++ func and
python side, to ensure that args for both sides are same. """
args = np.ndarray(LOOPS, dtype=np.float64)
for i in range(LOOPS):
args[i] = random.random()
print('Args are ready, now start...')
# try it with C++
start_time = time.time()
x = ComplexCpp.Compute(x0, args)
print(f'Computing with C++ in { time.time() - start_time }.\n')
# forcely use the result to prevent the entire procedure be optimized(omit)
print(f'The result is {x}\n')
# try it with python
start_time = time.time()
x = PyCompute(x0, args)
print(f'Computing with Python in { time.time() - start_time }.\n')
# forcely use the result to prevent the entire procedure be optimized(omit)
print(f'The result is {x}\n')

ctypes: pass numpy to c++ dll

I am using C++ to implement a dll, and python to use it. When I pass a numpy to dll, a bug is reported
My C++ code is:
__declspec(dllexport) void image(float * img, int m, int n)
{
for (int i = 0; i < m*n; i++)
{
printf("%d ", img[i]);
}
}
In the above code, I just pass a numpy, and print it.
Then, my python code to use this dll is:
import ctypes
import numpy as np
lib = ctypes.cdll.LoadLibrary("./bin/Release/dllAPI.dll")
img = np.random.random(size=[3, 3])*10
img = img.astype(np.float)
img_p = img.ctypes.data_as(ctypes.POINTER(ctypes.c_float))
lib.image.argtypes = (ctypes.POINTER(ctypes.c_float), ctypes.c_int, ctypes.c_int)
lib.image.restype = None
print('c++ result is: ')
lib.image(img_p, 3, 3)
print('\n original data is:')
print(img)
The printed information is:
c++ result is:
-536870912 -1073741824 0 1610612736 -1073741824 1073741824 1610612736 536870912 -2147483648
original data is:
[[7.76128455 3.16101652 7.44757958]
[2.32058998 9.96955139 3.26344099]
[9.42976627 1.34360611 8.4006054 ]]
My C++ code print a random number, and it looks like the memory is leakage.
My environment is:
win 10
vs 2015 x64
python 3.7
ctypes 1.1.0
How can I pass a numpy to C++ by ctypes? Any suggestion is appreciated~~~
----------------- update ----------------------
The C++ code has a bug for printf, and I updated the C++ code as:
__declspec(dllexport) void image(float * img, int m, int n)
{
for (int i = 0; i < m*n; i++)
{
printf("%f ", img[i]);
}
}
However, the printed information is still incorrect:
c++ result is:
12425670907412733350375273403699429376.000000 2.468873 0.000000 2.502769 4435260031901892608.000000 2.458449 -0.000000 2.416936 -4312230798506823897841664.000000
original data is:
[[7.50196262 8.08859399 7.33518741]
[6.67098506 0.04736352 9.5017838 ]
[3.47544102 9.09726041 0.48091646]]
So after a bit of digging I found that apparently numpy.float, was actually equivalent to ctypes.c_double(?), and you should do img.astype(ctypes.c_float), so that the pointer points to the valid result. I don't know why np.float would be a 64 bit float, but from my debugging that's what I saw. If you want to reproduce this try printing img_p[0] in python to see.
Note: Using the specific cast may be best practice in this case, because different dtypes may be implementation defined, leading to this exact error (even though both are float)

python - using ctypes and SSE/AVX SOMETIMES segfaults

+
I'm trying to optimize a piece of python code using AVX. I'm using ctypes to access the C++ function. Sometimes the functions segfaults and sometimes dont. I think it maybe has got something to do with the alignment?
Maybe anyone can help me with this, I'm kinda stuck here.
Python-Code:
from ctypes import *
import numpy as np
#path_cnt
path_cnt = 16
c_path_cnt = c_int(path_cnt)
#ndarray1
ndarray1 = np.ones(path_cnt,dtype=np.float32,order='C')
ndarray1.setflags(align=1,write=1)
c_ndarray1 = stock.ctypes.data_as(POINTER(c_float))
#ndarray2
ndarray2 = np.ones(path_cnt,dtype=np.float32,order='C');
ndarray2.setflags(align=1,write=1)
c_ndarray2 = max_vola.ctypes.data_as(POINTER(c_float))
#call function
finance = cdll.LoadLibrary(".../libfin.so")
finance.foobar.argtypes = [c_void_p, c_void_p,c_int]
finance.foobar(c_ndarray1,c_ndarray2,c_path_cnt)
x=0
while x < path_cnt:
print c_stock[x]
x+=1
C++ Code
extern "C"{
int foobar(float * ndarray1,float * ndarray2,int path_cnt)
{
for(int i=0;i<path_cnt;i=i+8)
{
__m256 arr1 = _mm256_load_ps(&ndarray1[i]);
__m256 arr2 = _mm256_load_ps(&ndarray2[i]);
__m256 add = _mm256_add_ps(arr1,arr2);
_mm256_store_ps(&ndarray1[i],add);
}
return 0;
}
}
And now the odd output behavior, making the some call in terminal twice gives different results!
tobias#tobias-Lenovo-U310:~/workspace/finance$ python finance.py
Segmentation fault (core dumped)
tobias#tobias-Lenovo-U310:~/workspace/finance$ python finance.py
2.0
2.0
2.0
2.0
2.0
2.0
2.0
2.0
Thanks in advance!
There are aligned and unaligned load instructions. The aligned ones will fault if you violate the alignment rules, but they are faster. The unaligned ones accept any address and do loads/shifts internally to get the data you want. You are using the aligned version, _mm256_load_ps and can just switch to the unaligned version _mm256_loadu_ps without any intermediate allocation.
A good vectorizing compiler will include a lead-in loop to reach an aligned address, then a body to work on aligned data, then a final loop to clean up any stragglers.
Allright, I tink I found a sultion, its not very elegant but it works at least!
The should be a better way, anyone any suggestions?
extern "C"{
int foobar(float * ndarray1,float * ndarray2,int path_cnt)
{
float * test = (float*)_mm_malloc(path_cnt*sizeof(float),32);
float * test2 = (float*)_mm_malloc(path_cnt*sizeof(float),32);
//copy to aligned memory(this part is kinda stupid)
for(int i=0;i<path_cnt;i++)
{
test[i] = stock[i];
test2[i] = max_vola[i];
}
for(int i=0;i<path_cnt;i=i+8)
{
__m256 arr1 = _mm256_load_ps(&test1[i]);
__m256 arr2 = _mm256_load_ps(&test2[i]);
__m256 add = _mm256_add_ps(arr1,arr2);
_mm256_store_ps(&test1[i],add);
}
//and copy everything back!
for(int i=0;i<path_cnt;i++)
{
stock[i] = test[i];
}
return 0;
}
}

How to use float ** from C in Python?

after having no success with my question on How to use float ** in Python with Swig?, I started thinking that swig might not be the weapon of choice. I need bindings for some c functions. One of these functions takes a float**. What would you recomend? Ctypes?
Interface file:
extern int read_data(const char *file,int *n_,int *m_,float **data_,int **classes_);
I've used ctypes for several projects now and have been quite happy with the results. I don't think I've personally needed a pointer-to-pointer wrapper yet but, in theory, you should be able to do the following:
from ctypes import *
your_dll = cdll.LoadLibrary("your_dll.dll")
PFloat = POINTER(c_float)
PInt = POINTER(c_int)
p_data = PFloat()
p_classes = PInt()
buff = create_string_buffer(1024)
n1 = c_int( 0 )
n2 = c_int( 0 )
ret = your_dll.read_data( buff, byref(n1), byref(n2), byref(p_data), byref(p_classes) )
print('Data: ', p_data.contents)
print('Classes: ', p_classes.contents)

How to use float ** in Python with Swig?

I am writing swig bindings for some c functions. One of these functions takes a float**. I am already using cpointer.i for the normal pointers and looked into carrays.i, but I did not find a way to declare a float**. What do you recommend?
interface file:
extern int read_data(const char
*file,int *n_,int *m_,float **data_,int **classes_);
This answer is a repost of one to a related question Framester posted about using ctypes instead of swig. I've included it here in case any web-searches turn up a link to his original question.
I've used ctypes for several projects
now and have been quite happy with the
results. I don't think I've personally
needed a pointer-to-pointer wrapper
yet but, in theory, you should be able
to do the following:
from ctypes import *
your_dll = cdll.LoadLibrary("your_dll.dll")
PFloat = POINTER(c_float)
PInt = POINTER(c_int)
p_data = PFloat()
p_classes = PInt()
buff = create_string_buffer(1024)
n1 = c_int( 0 )
n2 = c_int( 0 )
ret = your_dll.read_data( buff, byref(n1), byref(n2), byref(p_data), byref(p_classes) )
print('Data: ', p_data.contents)
print('Classes: ', p_classes.contents)

Categories