Pybind11 is slower than Pure Python

Pybind11 is slower than Pure Python - python

I created Python Bindings using pybind11. Everything worked perfectly, but when I did a speed check test the result was disappointing.
Basically, I have a function in C++ that adds two numbers and I want to use that function from a Python script. I also included a for loop to ran 100 times to better view the difference in the processing time.
For the function "imported" from C++, using pybind11, I obtain: 0.002310514450073242 ~ 0.0034799575805664062
For the simple Python script, I obtain: 0.0012788772583007812 ~ 0.0015883445739746094
main.cpp file:
#include <pybind11/pybind11.h>
namespace py = pybind11;
double sum(double a, double b) {
return a + b;
}
PYBIND11_MODULE(SumFunction, var) {
var.doc() = "pybind11 example module";
var.def("sum", &sum, "This function adds two input numbers");
}
main.py file:
from build.SumFunction import *
import time
start = time.time()
for i in range(100):
print(sum(2.3,5.2))
end = time.time()
print(end - start)
CMakeLists.txt file:
cmake_minimum_required(VERSION 3.0.0)
project(Projectpybind11 VERSION 0.1.0)
include(CTest)
enable_testing()
add_subdirectory(pybind11)
pybind11_add_module(SumFunction main.cpp)
set(CPACK_PROJECT_NAME ${PROJECT_NAME})
set(CPACK_PROJECT_VERSION ${PROJECT_VERSION})
include(CPack)
Simple Python script:
import time
def summ(a,b):
return a+b
start = time.time()
for i in range(100):
print(summ(2.3,5.2))
end = time.time()
print(end - start)

Benchmarking is a very complicated thing, even can be called as a Systemic Engineering.
Because there are many processes will interference our benchmarking job. For
example: NIC interrupt responsing / keyboard or mouse input / OS scheduling...
I have encountered my producing process being blocked by OS for up to 15 seconds!
So as the other advisors have pointed out, the print() invokes more
unnecessary interference.
Your testing computation is too simple.
You must think it out clearly what are you comparing for.
The speed of passing arguments between Python and C++ is obviously slower than
that of within Python side. So I assume that you want to compare the computing
speed of both, instead of arguments passing speed.
If so, I think your computing codes are too simple, and these will lead to the
time we counted is mainly the time for passing args, while the time for
computing is merely the minor of the total.
So, I put out my sample below, I will be glad to see anyone polish it.
Your loop count is too less.
The less loops, the more randomness. Similar with my opinion 1, testing time
is merely 0.000x second. It is possible, that the running process be interferenced by OS.
I think we should make the testing time to last at least a few of seconds.
C++ is not always faster than Python. Now time there are so many Python
modules/libs can use GPU to execute heavy computation, and parallelly do matrix operations even only by using CPU.
I guess that perhaps you are evaluating whether or not using Pybind11 in your project. I think that comparing like this worth nothing, because what is the best tool depends on what is the real requirement, but it is a good lesson to learn things.
I recently encountered a case, Python is faster than C++ in a Deep Learning.
Haha, funny?
At the end, I run my sample in my PC, and found that the C++ computing speed is faster up to 100 times than that in Python. I hope it be helpful for you.
If anyone would please revise/correct my opinions, it's my pleasure!
Pls forgive my ugly English, I hope I have expressed things correctly.
ComplexCpp.cpp:
#include <cmath>
#include <pybind11/numpy.h>
#include <pybind11/pybind11.h>
namespace py = pybind11;
double Compute( double x, py::array_t<double> ys ) {
// std::cout << "x:" << std::setprecision( 16 ) << x << std::endl;
auto r = ys.unchecked<1>();
for( py::ssize_t i = 0; i < r.shape( 0 ); ++i ) {
double y = r( i );
// std::cout << "y:" << std::setprecision( 16 ) << y << std::endl;
x += y;
x *= y;
y = std::max( y, 1.001 );
x /= y;
x *= std::log( y );
}
return x;
};
PYBIND11_MODULE( ComplexCpp, m ) {
m.def( "Compute", &Compute, "a more complicated computing" );
};
tryComplexCpp.py
import ComplexCpp
import math
import numpy as np
import random
import time
def PyCompute(x: float, ys: np.ndarray) -> float:
#print(f'x:{x}')
for y in ys:
#print(f'y:{y}')
x += y
x *= y
y = max(y, 1.001)
x /= y
x *= math.log(y)
return x
LOOPS: int = 100000000
if __name__ == "__main__":
# initialize random
x0 = random.random()
""" We store all args in a array, then pass them into both C++ func and
python side, to ensure that args for both sides are same. """
args = np.ndarray(LOOPS, dtype=np.float64)
for i in range(LOOPS):
args[i] = random.random()
print('Args are ready, now start...')
# try it with C++
start_time = time.time()
x = ComplexCpp.Compute(x0, args)
print(f'Computing with C++ in { time.time() - start_time }.\n')
# forcely use the result to prevent the entire procedure be optimized(omit)
print(f'The result is {x}\n')
# try it with python
start_time = time.time()
x = PyCompute(x0, args)
print(f'Computing with Python in { time.time() - start_time }.\n')
# forcely use the result to prevent the entire procedure be optimized(omit)
print(f'The result is {x}\n')

Related

Wrong ouptut of a c function returning a double called from python

I want to speed up a python code my calling a c function:
I have a the function in vanilla python sum_and_multiply.py:
def sam_py(lim_sup):
total = 0
for i in range(0,lim_sup): # xrange is slower according
for j in range(1, lim_sup): #to my test but more memory-friendly.
total += (i / j)
return total
then I have the equivalent function in C sum_and_multiply_c.c:
#include <stdio.h>
double sam_c(int lim_sup){
int i;
int j;
double total;
total = 0;
double div;
for (i=0; i<lim_sup; i++){
for (j=1; j<lim_sup; j++){
div = (double) i / j;
// printf("div: %.2f\n", div);
total += div;
// printf("total: %.2f\n", total);
}
}
printf("total: %.2f\n", total);
return total;
}
a file script.py which calls the 2 functions
from sum_and_multiply import sam_py
import time
lim_sup = 6000
start = time.time()
print(sam_py(lim_sup))
end = time.time()
time_elapsed01 = end - start
print("time elapsed: %.4fs" % time_elapsed01)
from ctypes import *
my_c_fun = CDLL("sum_and_multiply_c.so")
start = time.time()
print(my_c_fun.sam_c(lim_sup))
end = time.time()
time_elapsed02 = end - start
print("time elapsed: %.4fs" % time_elapsed02)
print("Speedup coefficient: %.2fx" % (time_elapsed01/time_elapsed02))
and finally a shell script bashscript.zsh which compile the C code and then call script.py
cc -fPIC -shared -o sum_and_multiply_c.so sum_and_multiply_c.c
python script.py
Here is the output:
166951817.45311993
time elapsed: 2.3095s
total: 166951817.45
20
time elapsed: 0.3016s
Speedup coefficient: 7.66x
Here is my question although the c function calculate correctly the result (as output 166951817.45 via printf) its output when passed to python is 20 which wrong. How could I have 166951817.45 instead?
Edit the problem persists after changing the last part of the script.py as follows:
from ctypes import *
my_c_fun = CDLL("sum_and_multiply_c.so")
my_c_fun.restype = c_double
my_c_fun.argtypes = [ c_int ]
start = time.time()
print(my_c_fun.sam_c(lim_sup))
end = time.time()
time_elapsed02 = end - start
print("time elapsed: %.4fs" % time_elapsed02)
print("Speedup coefficient: %.2fx" % (time_elapsed01/time_elapsed02))

You're assuming Python can "see" your function returns a double. But it can't. C doesn't "encode" the return type in anything, so whoever calls a function from a library needs to know its return type, or risk misinterpreting it.
You should have read the documentation of CDLL before using it! If you say this is for the sake of exercise, then that exercise needs to include reading the documentation (that's what good programmers do, no excuses).
class ctypes.CDLL(name, mode=DEFAULT_MODE, handle=None, use_errno=False, use_last_error=False)
Instances of this class represent loaded shared libraries. Functions in these libraries use the standard C calling convention, and are assumed to return int.
(emphasis mine.)
https://docs.python.org/2.7/library/ctypes.html#return-types is your friend (and the top of the page will tell you that Python2 is dead and you shouldn't use it, even if you insist on it. I'm sure you have a better reason than the Python developers themselves!).
my_c_fun = CDLL("sum_and_multiply_c.so")
sam_c = my_c_fun.sam_c
sam_c.restype = c_double
sam_c.argtypes = [ c_int ]
value = sam_c(6000)
print(value)
is the way to go.

How to use syncthreads in CUDA for a scan algorithm (Hillis-Steele)

I'm trying to implement a scan algorithm (Hillis-Steele) and I'm having some trouble understanding how to do it properly on CUDA. This is a minimal example using pyCUDA:
import pycuda.driver as cuda
import pycuda.autoinit
import numpy as np
from pycuda.compiler import SourceModule
#compile cuda code
mod = SourceModule('''
__global__ void scan(int * addresses){
for(int idx=1; idx <= threadIdx.x; idx <<= 1){
int new_value = addresses[threadIdx.x] + addresses[threadIdx.x - idx];
__syncthreads();
addresses[threadIdx.x] = new_value;
}
}
''')
func = mod.get_function("scan")
#Initialize an array with 1's
addresses_h = np.full((896,), 1, dtype='i4')
addresses_d = cuda.to_device(addresses_h)
#Launch kernel and copy back the result
threads_x = 896
func(addresses_d, block=(threads_x, 1, 1), grid=(1, 1))
addresses_h = cuda.from_device(addresses_d, addresses_h.shape, addresses_h.dtype)
# Check the result is correct
for i, n in enumerate(addresses_h):
assert i+1 == n
My question is about __syncthreads(). As you can see, I'm calling __syncthreads() inside the for loop and not every thread will execute that code the same number of times:
ThreadID - Times it will execute for loop
0 : 0 times
1 : 1 times
2- 3 : 2 times
4- 7 : 3 times
8- 15 : 4 times
16- 31 : 5 times
32- 63 : 6 times
64-127 : 7 times
128-255 : 8 times
256-511 : 9 times
512-896 : 10 times
There can be threads in the same warp with different number of calls to syncthreads. What will happen in that case? How can threads that are not executing the same code be synchronized?
In the sample code, we start with an array full of 1's and we get in the output the index+1 as value for each position. It is computing the correct answer. Is it "by chance" or the code is correct?
If this is not a proper use of syncthreads, how could I implement such algoritm using cuda?

If this is not a proper use of syncthreads, how could I implement such algoritm using cuda?
One typical approach is to separate the conditional code from the __syncthreads() calls. Use the conditional code to determine what threads will participate.
Here's a simple transformation of your posted code, that should give the same result, without any violations (i.e. all threads will participate in every __syncthreads() operation):
mod = SourceModule('''
__global__ void scan(int * addresses){
for(int i=1; i < blockDim.x; i <<= 1){
int new_value;
if (threadIdx.x >= i) new_value = addresses[threadIdx.x] + addresses[threadIdx.x - i];
__syncthreads();
if (threadIdx.x >= i) addresses[threadIdx.x] = new_value;
}
}
''')
I'm not suggesting this is a complete and proper scan, or that it is optimal, or anything of the sort. I'm simply showing how your code can be transformed to avoid the violation inherent in what you have.
If you want to learn more about scan methods, this is one source. But if you actually need a scan operation, I would suggest using thrust or cub.

Roots of Legendre Polynomials c++

I'm writing a program to find the roots of nth order Legendre Polynomials using c++; my code is attached below:
double* legRoots(int n)
{
double myRoots[n];
double x, dx, Pi = atan2(1,1)*4;
int iters = 0;
double tolerance = 1e-20;
double error = 10*tolerance;
int maxIterations = 1000;
for(int i = 1; i<=n; i++)
{
x = cos(Pi*(i-.25)/(n+.5));
do
{
dx -= legDir(n,x)/legDif(n,x);
x += dx;
iters += 1;
error = abs(dx);
} while (error>tolerance && iters<maxIterations);
myRoots[i-1] = x;
}
return myRoots;
}
Assuming the existence of functioning Legendre Polynomial and Legendre Polynomial derivative generating functions, which I do have but I thought that would make for unreadable walls of code text. This function is functioning in the sense that it's returning an array calculated values, but they're wildly off, outputting the following:
3.95253e-323
6.94492e-310
6.95268e-310
6.42285e-323
4.94066e-323
2.07355e-317
where an equivalent function I've written in Python gives the following:
[-0.90617985 -0.54064082 0. 0.54064082 0.90617985]
I was hoping another set of eyes could help me see what the issue in my C++ code is that's causing the values to be wildly off. I'm not doing anything different in my Python code that I'm doing in C++, so any help anyone could give on this is greatly appreciated, thanks. For reference, I'm mostly trying to emulate the method found on Rosetta code in regards to Gaussian Quadrature: http://rosettacode.org/wiki/Numerical_integration/Gauss-Legendre_Quadrature.

You are returning an address to a temporary variable in stack
{
double myRoots[n];
...
return myRoots; // Not a safe thing to do
}
I suggest changing your function definition to
void legRoots(int n, double *myRoots)
omitting the return statement, and defining myroots before calling the function
double myRoots[10];
legRoots(10, myRoots);
Option 2 would be to allocate myRoots dynamically with new or malloc.

MEX equivalent for Python ( C wrapper functions)

Coming from MATLAB, I am looking for some way to create functions in Python which are derived from wrapping C functions. I came across Cython, ctypes, SWIG. My intent is not to improve speed by any factor (it would certainly help though).
Could someone recommend a decent solution for such a purpose.
Edit: What's the most popular/adopted way of doing this job?
Thanks.

I've found that weave works pretty well for shorter functions and has a very simple interface.
To give you an idea of just how easy the interface is, here's an example (taken from the PerformancePython website). Notice how multi-dimensional array conversion is handled for you by the converter (in this case Blitz).
from scipy.weave import converters
def inlineTimeStep(self, dt=0.0):
"""Takes a time step using inlined C code -- this version uses
blitz arrays."""
g = self.grid
nx, ny = g.u.shape
dx2, dy2 = g.dx**2, g.dy**2
dnr_inv = 0.5/(dx2 + dy2)
u = g.u
code = """
#line 120 "laplace.py" (This is only useful for debugging)
double tmp, err, diff;
err = 0.0;
for (int i=1; i<nx-1; ++i) {
for (int j=1; j<ny-1; ++j) {
tmp = u(i,j);
u(i,j) = ((u(i-1,j) + u(i+1,j))*dy2 +
(u(i,j-1) + u(i,j+1))*dx2)*dnr_inv;
diff = u(i,j) - tmp;
err += diff*diff;
}
}
return_val = sqrt(err);
"""
# compiler keyword only needed on windows with MSVC installed
err = weave.inline(code,
['u', 'dx2', 'dy2', 'dnr_inv', 'nx', 'ny'],
type_converters=converters.blitz,
compiler = 'gcc')
return err

Python Fast Input Output Using Buffer Competitive Programming

I have seen people using buffer in different languages for fast input/output in Online Judges. For example this http://www.spoj.pl/problems/INTEST/ is done with C like this:
#include <stdio.h>
#define size 50000
int main (void){
unsigned int n=0,k,t;
char buff[size];
unsigned int divisible=0;
int block_read=0;
int j;
t=0;
scanf("%lu %lu\n",&t,&k);
while(t){
block_read =fread(buff,1,size,stdin);
for(j=0;j<block_read;j++){
if(buff[j]=='\n'){
t--;
if(n%k==0){
divisible++;
}
n=0;
}
else{
n = n*10 + (buff[j] - '0');
}
}
}
printf("%d",divisible);
return 0;
How can this be done with python?

import sys
file = sys.stdin
size = 50000
t = 0
while(t != 0)
block_read = file.read(size)
...
...
Most probably this will not increase performance though – Python is interpreted language, so you basically want to spend as much time in native code (standard library input/parsing routines in this case) as possible.
TL;DR either use built-in routines to parse integers or get some sort of 3rd party library which is optimized for speed.

I tried solving this one in Python 3 and couldn't get it to work no matter how I tried reading the input. I then switched to running it under Python 2.5 so I could use
import psyco
psyco.full()
After making that change I was able to get it to work by simply reading input from sys.stdin one line at a time in a for loop. I read the first line using raw_input() and parsed the values of n and k, then used the following loop to read the remainder of the input.
for line in sys.stdin:
count += not int(line) % k

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pybind11 is slower than Pure Python - python

Related

Wrong ouptut of a c function returning a double called from python

How to use syncthreads in CUDA for a scan algorithm (Hillis-Steele)

Roots of Legendre Polynomials c++

MEX equivalent for Python ( C wrapper functions)

Python Fast Input Output Using Buffer Competitive Programming

Categories

Resources