I read the answer to this question How to profile cython functions line-by-line, but I can't seem to get it to work with my setup.
I have a cumsum.pyx file:
# cython: profile=True
# cython: linetrace=True
# cython: binding=True
DEF CYTHON_TRACE = 1
def cumulative_sum(int n):
cdef int s=0, i
for i in range(n):
s += i
return s
I compiled it with:
cython cumsum.pyx
gcc cumsum.c $(pkg-config --cflags --libs python3) -o cumsum.so -shared -fPIC
Then I tried to profile it in ipython:
%load_ext line_profiler
from cumsum import cumulative_sum
%lprun -f cumulative_sum cumulative_sum(100)
I don't get an error message, only an empty profile:
Timer unit: 1e-06 s
Total time: 0 s
File: cumsum.pyx
Function: cumulative_sum at line 6
Line # Hits Time Per Hit % Time Line Contents
==============================================================
6 def cumulative_sum(int n):
7 cdef int s=0, i
8 for i in range(n):
9 s += i
10
11 return s
How can I get this to work?
PS: I use CMake, not setup.py, so I would appreciate a build system agnostic solution
The documentation on Cythons "Profiling" already includes an example how to set the CYTHON_TRACE macro:
# distutils: define_macros=CYTHON_TRACE_NOGIL=1
instead of your DEF CYTHON_TRACE = 1.
It worked when I compiled it using %%cython:
%load_ext cython
%%cython
# cython: profile=True
# cython: linetrace=True
# cython: binding=True
# distutils: define_macros=CYTHON_TRACE_NOGIL=1
def cumulative_sum(int n):
cdef int s=0, i
for i in range(n):
s += i
return s
And showed the profiling:
%load_ext line_profiler
%lprun -f cumulative_sum cumulative_sum(100)
[...]
Line # Hits Time Per Hit % Time Line Contents
==============================================================
7 def cumulative_sum(int n):
8 1 8 8.0 3.5 cdef int s=0, i
9 1 3 3.0 1.3 for i in range(n):
10 100 218 2.2 94.4 s += i
11 1 2 2.0 0.9 return s
Turns out the issue was that DEF CYTHON_TRACE = 1 doesn't actually set the right constant.
Workarounds include:
1.
MSeifert's answer, using distutils
2.
Changing the gcc line to
gcc cumsum.c $(pkg-config --cflags --libs python3) -o cumsum.so -shared -fPIC -DCYTHON_TRACE=1
3.
Making an extra header trace.h and setting the constant there
#define CYTHON_TRACE 1
along with adding the following to cumsum.pyx
cdef extern from "trace.h":
pass
4.
With CMake, adding
add_definitions(-DCYTHON_TRACE)
Related
I am trying to find out why the same code in Python works 25 times slower than C even if I use CDLL, when I try to write into I2C. Below I will describe all the details what I am doing step by step.
The version of Raspberry PI: Raspberry PI 3 Model B
OS: Raspbian Buster Lite Version:July 2019
GCC version: gcc (Raspbian 8.3.0-6+rpi1) 8.3.0
Python version: Python 3.7.3
The device I am working with though I2C is MCP23017. All I do is writing 0 and 1 to the pin B0. Here is my code written in C:
// test1.c
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <fcntl.h>
#include <sys/ioctl.h>
#include <linux/i2c-dev.h>
#include <time.h>
int init() {
int fd = open("/dev/i2c-1", O_RDWR);
ioctl(fd, I2C_SLAVE, 0x20);
return fd;
}
void deinit(int fd) {
close(fd);
}
void makewrite(int fd, int v) {
char buffer[2] = { 0x13, 0x00 };
buffer[1] = v;
write(fd, buffer, 2);
}
void mytest() {
clock_t tb, te;
int n = 1000;
int fd = init();
tb = clock();
int v = 1;
for (int i = 0; i < n; i++) {
makewrite(fd, v);
v = 1 - v;
}
te = clock();
printf("Time: %.3lf ms\n", (double)(te - tb) / n / CLOCKS_PER_SEC * 1e3);
deinit(fd);
}
int main() {
mytest();
return 0;
}
I compile and run it with the command:
gcc test1.c -o test1 && ./test1
It gives me the result:
pi#raspberrypi:~/dev/i2c_example $ gcc test1.c -o test1 && ./test1
Time: 0.020 ms
I may conclude that writing to the pin takes 0.02 milliseconds.
After that I create SO-file to be able to access the written functions from my Python script:
gcc -c -fPIC test1.c -o test1.o && gcc test1.o -shared -o test1.so
And my Python script to test:
# test1.py
import ctypes
from time import time
test1so = ctypes.CDLL("/home/pi/dev/i2c_example/test1.so")
test1so.mytest()
n = 1000
fd = test1so.init()
tb = time()
v = 1
for _ in range(n):
test1so.makewrite(fd, v)
v = 1 - v
te = time()
print("Time: {:.3f} ms".format((te - tb) / n * 1e3))
test1so.deinit(fd)
This provides me the result:
pi#raspberrypi:~/dev/i2c_example $ python test1.py
Time: 0.021 ms
Time: 0.516 ms
I cannot understand why the call of makewrite is 25 times slower in Python though actually I call the same C-code. I also researched that if I comment write(fd, buffer, 2); in test1.c or change fd to 1, the times given by the Python script are compatible, there is no such huge difference.
// in test1.c
write(fd, buffer, 2); -> write(1, buffer, 2);
Running C-program:
pi#raspberrypi:~/dev/i2c_example $ gcc test1.c -o test1 && ./test1
...Time: 0.012 ms
Running Python-program:
pi#raspberrypi:~/dev/i2c_example $ python3 test1.py
...Time: 0.009 ms
...Time: 0.021 ms
It confused me a lot. Can anybody tell me why does it happen and how can I improve my performance in Python regarding the access via I2C using C-DLL?
Summary:
Descriptor: 1 (stdout)
Execution time of makewrite in C purely: 0.009 ms
Execution time of makewrite in C called as C-DLL function from Python: 0.021 ms
The result is expectable. This difference is not so high. It can be explained that Python loop and its statements are not as efficient as in C, thus it increases the execution time.
Descriptor: I2C
Execution time of makewrite in C purely: 0.021 ms
Execution time of makewrite in C called as DLL function from Python: 0.516 ms
After switching the file descriptor to I2C the execution time in pure C increased in around 0.012 ms, so I would expect the execution time for calling from Python: 0.021 ms + 0.012 ms = 0.033 ms, because all changes I've done are inside of makewrite, so Python is supposed not to know this internal thing (because it's packed in so-file). But I have 0.516 ms instead of 0.033 ms that confuses me.
I want to create .so file from python and execute the .so file in C.
To do it I used cython to convert .pyx to .so
## print_me.pyx
cimport numpy as cnp
import numpy as np
cimport cython
cpdef public char* print_me(f):
# I know this numpy line does nothing
cdef cnp.ndarray[cnp.complex128_t, ndim=3] a = np.zeros((3,3,3), dtype=np.complex128)
return f
Then I used setup.py to actually convert .pyx to .so
## setup.py
from distutils.core import setup
from Cython.Build import cythonize
import numpy as np
setup(
ext_modules=cythonize("print_me.pyx"),
include_dirs=[np.get_include()]
)
By running the following command line, I was able to create .so file
python setup.py build_ext --inplace
When I tried to run so file using the following C code, I got a Segmentation Fault.
/* toloadso.c */
#include <stdio.h>
#include <stdlib.h>
#include <dlfcn.h>
#include <time.h>
#include <python2.7/Python.h>
int main(void)
{
// define function
void *handle;
char* (*print_me)(PyObject*);
char *error;
PyObject* filename = PyString_FromString("hello");
// load so file
handle = dlopen("./print_me.so", RTLD_LAZY);
if (!handle) {
fprintf(stderr, "%s\n", dlerror());
exit(EXIT_FAILURE);
}
dlerror();
// get function handler from so file
print_me = (char* (*)(PyObject*))dlsym(handle, "print_me");
// check if handler got error
error = dlerror();
if (error != NULL) {
fprintf(stderr, "%s\n", error);
exit(EXIT_FAILURE);
}
// execute loaded function
printf("%s\n", (char*)(*print_me)(filename));
dlclose(handle);
exit(EXIT_SUCCESS);
}
I compiled this .c file with following command:
gcc -fPIC -I/usr/include/ -o toloadso toloadso.c -lpython2.7 -ldl
(It compiled without error or warning)
When I tried to run this code, I got a segmentation Fault
[root#localhost ~]# ./toloadso
Segmentation fault
If I comment out the following line in print_me.pyx
cdef cnp.ndarray[cnp.complex128_t, ndim=3] a = np.zeros((3,3,3), dtype=np.complex128)
My C code runs without error, but once I uncomment this line, it does not work.
I think that trying to use numpy in cython generates an error somehow.
How can I fix it??
I thank you so much for your reply
You must initialize the numpy C API by calling import_array().
Add this line to your cython file:
cnp.import_array()
And as pointed out by #user4815162342 and #DavidW in the comments, you must call Py_Initialize() and Py_Finalize() in main().
Thank you for your help first. I could get something useful information, even though that could not directly solve my problem.
By referring to others advice,
rather than calling print_me function from .so file, I decided to call directly from C. This is what I did.
# print_me.pyx
import numpy as np
cimport numpy as np
np.import_array()
cdef public char* print_me(f):
cdef int[2][4] ll = [[1, 2, 3, 4], [5,6,7,8]]
cdef np.ndarray[np.int_t, ndim=2] nll = np.zeros((4, 6), dtype=np.int)
print nll
nll += 1
print nll
return f + str(ll[1][0])
This is my .c file
// main.c
#include <python2.7/Python.h>
#include "print_me.h"
int main()
{
// initialize python
Py_Initialize();
PyObject* filename = PyString_FromString("hello");
initsquare_number();
//initprint_me();
// call python-oriented function
printf("%s\n", print_me(filename));
// finalize python
Py_Finalize();
return 0;
}
I compiled then as follows
# to generate print_me.c and print_me.h
cython print_me.pyx
# to build main.c and print_me.c into main.o and print_me.o
cc -c main.c print_me.c -I/usr/include/python2.7 -I/usr/lib64/python2.7/site-packages/numpy/core/include
# to linke .o files
cc -lpython2.7 -ldl main.o print_me.o -o main
# execute main
./main
This results the following
[[0 0 0 0 0 0]
[0 0 0 0 0 0]
[0 0 0 0 0 0]
[0 0 0 0 0 0]]
[[1 1 1 1 1 1]
[1 1 1 1 1 1]
[1 1 1 1 1 1]
[1 1 1 1 1 1]]
hello5
Thank you for all of your help again!! :)
I was trying to measure the performance between python dictionaries, cythonized python dictionaries and cythonized cpp std::unordered_map doing only a init procedure. If the cythonized cpp code is compiled I thought it should be faster than the pure python version. I did a test using 4 different scenario/notation options:
Cython CPP code using std::unordered_map and Cython book notation (defining a pair and using insert method)
Cython CPP code using std::unordered_map and python notation (map[key] = value)
Cython code (typed code) using python dictionaries (map[key] = value)
Pure python code
I was expecting see how cython code outperforms pure python code, but in this case there is not improvement. Which could be the reason? I'm using Cython-0.22, python-3.4 and g++-4.8.
I got this exec time (seconds) using timeit:
Cython CPP book notation -> 15.696417249999968
Cython CPP python notation -> 16.481350984999835
Cython python notation -> 18.585355018999962
Pure python -> 18.162724677999904
Code is here and you can use it:
cython -a map_example.pyx
python3 setup_map.py build_ext --inplace
python3 use_map_example.py
map_example.pyx
from libcpp.unordered_map cimport unordered_map
from libcpp.pair cimport pair
cpdef int example_cpp_book_notation(int limit):
cdef unordered_map[int, int] mapa
cdef pair[int, int] entry
cdef int i
for i in range(limit):
entry.first = i
entry.second = i
mapa.insert(entry)
return 0
cpdef int example_cpp_python_notation(int limit):
cdef unordered_map[int, int] mapa
cdef pair[int, int] entry
cdef int i
for i in range(limit):
mapa[i] = i
return 0
cpdef int example_ctyped_notation(int limit):
mapa = {}
cdef int i
for i in range(limit):
mapa[i] = i
return 0
setup_map.py
from distutils.core import setup
from distutils.extension import Extension
from Cython.Build import cythonize
from Cython.Distutils import build_ext
import os
os.environ["CC"] = "g++"
os.environ["CXX"] = "g++"
modules = [Extension("map_example",
["map_example.pyx"],
language = "c++",
extra_compile_args=["-std=c++11"],
extra_link_args=["-std=c++11"])]
setup(name="map_example",
cmdclass={"build_ext": build_ext},
ext_modules=modules)
use_map_example.py
import map_example
C_MAXV = 100000000
C_NUMBER = 10
def cython_cpp_book_notation():
x = 1
while(x<C_MAXV):
map_example.example_cpp_book_notation(x)
x *= 10
def cython_cpp_python_notation():
x = 1
while(x<C_MAXV):
map_example.example_cpp_python_notation(x)
x *= 10
def cython_ctyped_notation():
x = 1
while(x<C_MAXV):
map_example.example_ctyped_notation(x)
x *= 10
def pure_python():
x = 1
while(x<C_MAXV):
map_a = {}
for i in range(x):
map_a[i] = i
x *= 10
return 0
if __name__ == '__main__':
import timeit
print("Cython CPP book notation")
print(timeit.timeit("cython_cpp_book_notation()", setup="from __main__ import cython_cpp_book_notation", number=C_NUMBER))
print("Cython CPP python notation")
print(timeit.timeit("cython_cpp_python_notation()", setup="from __main__ import cython_cpp_python_notation", number=C_NUMBER))
print("Cython python notation")
print(timeit.timeit("cython_ctyped_notation()", setup="from __main__ import cython_ctyped_notation", number=C_NUMBER))
print("Pure python")
print(timeit.timeit("pure_python()", setup="from __main__ import pure_python", number=C_NUMBER))
My timings from your code (after correcting that python *10 indent :) ) are
Cython CPP book notation
21.617647969018435
Cython CPP python notation
21.229907534987433
Cython python notation
24.44413448998239
Pure python
23.609809526009485
Basically everyone is in the same ballpark, with a modest edge for the CPP versions.
Nothing special about my machine, the usual Ubuntu 14.10, 0.202 Cython, 3.42 Python.
First I know that there are many similarly themed question on SO, but I can't find a solution after a day of searching, reading, and testing.
I have a python function which calculates the pairwise correlations of a numpy ndarray (m x n). I was orginally doing this purely in numpy but the function also computed the reciprocal pairs (i.e. as well as calculating the the correlation betwen rows A and B of the matrix, it calculated the correlation between rows B and A too.) So I took a slightly different approach that is about twice as fast for matrices of large m (realistic sizes for my problem are m ~ 8000).
This was great but still a tad slow, as there will be many such matrices, and to do them all will take a long time. So I started investigating cython as a way to speed things up. I understand from what I've read that cython won't really speed up numpy all that much. Is this true, or is there something I am missing?
I think the bottlenecks below are the np.sqrt, np.dot, the call to the ndarray's .T method and np.absolute. I've seen people use sqrt from libc.math to replace the np.sqrt, so I suppose my first question is, are the similar functions for the other methods in libc.math that I can use? I am afraid that I am completely and utterly unfamiliar with C/C++/C# or any of the C family languages, so this typing and cython business are very new territory to me, apologies if the reason/solution is obvious.
Failing that, any ideas about what I could do to get some performance gains?
Below are my pyx code, the setup code, and the call to the pyx function. I don't know if it's important, but when I call python setup build_ext --inplace It works but there are a lot warnings which I don't really understand. Could these also be a reason why I am not seeing a speed improvement?
Any help is very much appreciated, and sorry for the super long post.
setup.py
from distutils.core import setup
from distutils.extension import Extension
import numpy
from Cython.Distutils import build_ext
setup(
cmdclass = {'build_ext': build_ext},
ext_modules = [Extension("calcBrownCombinedP",
["calcBrownCombinedP.pyx"],
include_dirs=[numpy.get_include()])]
)
and the ouput of setup:
>python setup.py build_ext --inplace
running build_ext
cythoning calcBrownCombinedP.pyx to calcBrownCombinedP.c
building 'calcBrownCombinedP' extension
C:\Anaconda\Scripts\gcc.bat -DMS_WIN64 -mdll -O -Wall -IC:\Anaconda\lib\site-packages\numpy\core\include -IC:\Anaconda\include -IC:\Anaconda\PC -c calcBrownCombinedP.c -o build\temp.win-amd64-2.7\Release\calcbrowncombinedp.o
In file included from C:\Anaconda\lib\site-packages\numpy\core\include/numpy/ndarraytypes.h:1728:0,
from C:\Anaconda\lib\site-packages\numpy\core\include/numpy/ndarrayobject.h:17,
from C:\Anaconda\lib\site-packages\numpy\core\include/numpy/arrayobject.h:15,
from calcBrownCombinedP.c:340:
C:\Anaconda\lib\site-packages\numpy\core\include/numpy/npy_deprecated_api.h:8:9: note: #pragma message: C:\Anaconda\lib\site-packages\numpy\core\include/numpy/npy_deprecated_api.h(8) : Warning Msg: Using deprecated NumPy API, disable it by #defining NPY_NO_DEPRECATED_API NPY_1_7_API_VERSION
calcBrownCombinedP.c: In function '__Pyx_RaiseTooManyValuesError':
calcBrownCombinedP.c:4473:18: warning: unknown conversion type character 'z' in format [-Wformat]
calcBrownCombinedP.c:4473:18: warning: too many arguments for format [-Wformat-extra-args]
calcBrownCombinedP.c: In function '__Pyx_RaiseNeedMoreValuesError':
calcBrownCombinedP.c:4479:18: warning: unknown conversion type character 'z' in format [-Wformat]
calcBrownCombinedP.c:4479:18: warning: format '%s' expects argument of type 'char *', but argument 3 has type 'Py_ssize_t' [-Wformat]
calcBrownCombinedP.c:4479:18: warning: too many arguments for format [-Wformat-extra-args]
In file included from C:\Anaconda\lib\site-packages\numpy\core\include/numpy/ndarrayobject.h:26:0,
from C:\Anaconda\lib\site-packages\numpy\core\include/numpy/arrayobject.h:15,
from calcBrownCombinedP.c:340:
calcBrownCombinedP.c: At top level:
C:\Anaconda\lib\site-packages\numpy\core\include/numpy/__multiarray_api.h:1594:1: warning: '_import_array' defined but not used [-Wunused-function]
In file included from C:\Anaconda\lib\site-packages\numpy\core\include/numpy/ufuncobject.h:311:0,
from calcBrownCombinedP.c:341:
C:\Anaconda\lib\site-packages\numpy\core\include/numpy/__ufunc_api.h:236:1: warning: '_import_umath' defined but not used [-Wunused-function]
writing build\temp.win-amd64-2.7\Release\calcBrownCombinedP.def
C:\Anaconda\Scripts\gcc.bat -DMS_WIN64 -shared -s build\temp.win-amd64-2.7\Release\calcbrowncombinedp.o build\temp.win-amd64-2.7\Release\calcBrownCombinedP.def -LC:\Anaconda\libs -LC:\Anaconda\PCbuild\amd64 -lpython27 -lmsvcr90 -o C:\cygwin64\home\Davy\SNPsets\src\calcBrownCombinedP.pyd
the pyx code - 'calcBrownCombinedP.pyx'
import numpy as np
cimport numpy as np
from scipy import stats
DTYPE = np.int
ctypedef np.int_t DTYPE_t
def calcBrownCombinedP(np.ndarray genotypeArray):
cdef int nSNPs, i
cdef np.ndarray ms, datam, datass, d, rs, temp
cdef float runningSum, sigmaSq, E, df
nSNPs = genotypeArray.shape[0]
ms = genotypeArray.mean(axis=1)[(slice(None,None,None),None)]
datam = genotypeArray - ms
datass = np.sqrt(stats.ss(datam,axis=1))
runningSum = 0
for i in xrange(nSNPs):
temp = np.dot(datam[i:],datam[i].T)
d = (datass[i:]*datass[i])
rs = temp / d
rs = np.absolute(rs)[1:]
runningSum += sum(rs*(3.25+(0.75*rs)))
sigmaSq = 4*nSNPs+2*runningSum
E = 2*nSNPs
df = (2*(E*E))/sigmaSq
runningSum = sigmaSq/(2*E)
return runningSum
The code that tests the above against some pure python - 'test.py'
import numpy as np
from scipy import stats
import random
import time
from calcBrownCombinedP import calcBrownCombinedP
from PycalcBrownCombinedP import PycalcBrownCombinedP
ms = [10,50,100,500,1000,5000]
for m in ms:
print '---testing implentation with m = {0}---'.format(m)
genotypeArray = np.empty((m,20),dtype=int)
for i in xrange(m):
genotypeArray[i] = [random.randint(0,2) for j in xrange(20)]
print genotypeArray.shape
start = time.time()
print calcBrownCombinedP(genotypeArray)
print 'cython implementation took {0}'.format(time.time() - start)
start = time.time()
print PycalcBrownCombinedP(genotypeArray)
print 'python implementation took {0}'.format(time.time() - start)
and the ouput of that code is:
---testing implentation with m = 10---
(10L, 20L)
2.13660168648
cython implementation took 0.000999927520752
2.13660167749
python implementation took 0.000999927520752
---testing implentation with m = 50---
(50L, 20L)
8.82721138
cython implementation took 0.00399994850159
8.82721130234
python implementation took 0.00500011444092
---testing implentation with m = 100---
(100L, 20L)
16.7438983917
cython implementation took 0.0139999389648
16.7438965333
python implementation took 0.0120000839233
---testing implentation with m = 500---
(500L, 20L)
80.5343856812
cython implementation took 0.183000087738
80.5343694046
python implementation took 0.161000013351
---testing implentation with m = 1000---
(1000L, 20L)
160.122573853
cython implementation took 0.615000009537
160.122491308
python implementation took 0.598000049591
---testing implentation with m = 5000---
(5000L, 20L)
799.813842773
cython implementation took 10.7159998417
799.813880445
python implementation took 11.2510001659
Lastly, the pure python implementation 'PycalcBrownCombinedP.py'
import numpy as np
from scipy import stats
def PycalcBrownCombinedP(genotypeArray):
nSNPs = genotypeArray.shape[0]
ms = genotypeArray.mean(axis=1)[(slice(None,None,None),None)]
datam = genotypeArray - ms
datass = np.sqrt(stats.ss(datam,axis=1))
runningSum = 0
for i in xrange(nSNPs):
temp = np.dot(datam[i:],datam[i].T)
d = (datass[i:]*datass[i])
rs = temp / d
rs = np.absolute(rs)[1:]
runningSum += sum(rs*(3.25+(0.75*rs)))
sigmaSq = 4*nSNPs+2*runningSum
E = 2*nSNPs
df = (2*(E*E))/sigmaSq
runningSum = sigmaSq/(2*E)
return runningSum
Profiling with kernprof shows the bottleneck is the last line of the loop:
Line # Hits Time Per Hit % Time Line Contents
==============================================================
<snip>
16 5000 6145280 1229.1 86.6 runningSum += sum(rs*(3.25+(0.75*rs)))
This is no surprise as you're using the Python built-in function sum in both the Python and Cython versions. Switching to np.sum speeds the code up by a factor of 4.5 when the input array has shape (5000, 20).
If a small loss in accuracy is alright, then you can leverage linear algebra to speed up the final line further:
np.sum(rs * (3.25 + 0.75 * rs))
is really a vector dot product, i.e.
np.dot(rs, 3.25 + 0.75 * rs)
This is still suboptimal as it loops over rs three times and constructs two rs-sized temporary arrays. Using elementary algebra, this expression can be rewritten as
3.25 * np.sum(rs) + .75 * np.dot(rs, rs)
which not only gives the original result without the round-off error in the previous version, but only loops over rs twice and uses constant memory.(*)
The bottleneck is now np.dot, so installing a better BLAS library is going to buy you more than rewriting the whole thing in Cython.
(*) Or logarithmic memory in the very latest NumPy, which has a recursive reimplementation of np.sum that is faster than the old iterative one.
I have a number of C functions, and I would like to call them from python. cython seems to be the way to go, but I can't really find an example of how exactly this is done. My C function looks like this:
void calculate_daily ( char *db_name, int grid_id, int year,
double *dtmp, double *dtmn, double *dtmx,
double *dprec, double *ddtr, double *dayl,
double *dpet, double *dpar ) ;
All I want to do is to specify the first three parameters (a string and two integers), and recover 8 numpy arrays (or python lists. All the double arrays have N elements). My code assumes that the pointers are pointing to an already allocated chunk of memory. Also, the produced C code ought to link to some external libraries.
Here's a tiny but complete example of passing numpy arrays
to an external C function, logically
fc( int N, double* a, double* b, double* z ) # z = a + b
using Cython.
(This is surely well-known to those who know it well.
Comments are welcome.
Last change: 23 Feb 2011, for Cython 0.14.)
First read or skim
Cython build
and Cython with NumPy .
2 steps:
python f-setup.py build_ext --inplace
turns f.pyx and fc.cpp -> f.so, a dynamic library
python test-f.py
import f loads f.so; f.fpy( ... ) calls the C fc( ... ).
python f-setup.py uses distutils to run cython, compile and link:
cython f.pyx -> f.cpp
compile f.cpp and fc.cpp
link f.o fc.o -> f.so,
a dynamic lib that python import f will load.
For students, I'd suggest: make a diagram of these steps,
look through the files below, then download and run them.
(distutils is a huge, convoluted package used to
make Python packages for distribution, and install them.
Here we're using just a small part of it to compile and link to f.so.
This step has nothing to do with Cython, but it can be confusing;
simple mistakes in a .pyx can cause pages of obscure error messages from g++ compile and link.
See also
distutils doc
and/or
SO questions on distutils .)
Like make, setup.py will rerun
cython f.pyx and g++ -c ... f.cpp
if f.pyx is newer than f.cpp.
To cleanup, rm -r build/ .
An alternative to setup.py would be to run the steps separately, in a script or Makefile:
cython --cplus f.pyx -> f.cpp # see cython -h
g++ -c ... f.cpp -> f.o
g++ -c ... fc.cpp -> fc.o
cc-lib f.o fc.o -> dynamic library f.so.
Modify the cc-lib-mac wrapper
below for your platform and installation: it's not pretty, but small.
For real examples of Cython wrapping C,
look at .pyx files in just about any
SciKit .
See also:
Cython for NumPy users
and SO questions/tagged/cython .
To unpack the following files,
cut-paste the lot to one big file, say cython-numpy-c-demo,
then in Unix (in a clean new directory) run sh cython-numpy-c-demo.
#--------------------------------------------------------------------------------
cat >f.pyx <<\!
# f.pyx: numpy arrays -> extern from "fc.h"
# 3 steps:
# cython f.pyx -> f.c
# link: python f-setup.py build_ext --inplace -> f.so, a dynamic library
# py test-f.py: import f gets f.so, f.fpy below calls fc()
import numpy as np
cimport numpy as np
cdef extern from "fc.h":
int fc( int N, double* a, double* b, double* z ) # z = a + b
def fpy( N,
np.ndarray[np.double_t,ndim=1] A,
np.ndarray[np.double_t,ndim=1] B,
np.ndarray[np.double_t,ndim=1] Z ):
""" wrap np arrays to fc( a.data ... ) """
assert N <= len(A) == len(B) == len(Z)
fcret = fc( N, <double*> A.data, <double*> B.data, <double*> Z.data )
# fcret = fc( N, A.data, B.data, Z.data ) grr char*
return fcret
!
#--------------------------------------------------------------------------------
cat >fc.h <<\!
// fc.h: numpy arrays from cython , double*
int fc( int N, const double a[], const double b[], double z[] );
!
#--------------------------------------------------------------------------------
cat >fc.cpp <<\!
// fc.cpp: z = a + b, numpy arrays from cython
#include "fc.h"
#include <stdio.h>
int fc( int N, const double a[], const double b[], double z[] )
{
printf( "fc: N=%d a[0]=%f b[0]=%f \n", N, a[0], b[0] );
for( int j = 0; j < N; j ++ ){
z[j] = a[j] + b[j];
}
return N;
}
!
#--------------------------------------------------------------------------------
cat >f-setup.py <<\!
# python f-setup.py build_ext --inplace
# cython f.pyx -> f.cpp
# g++ -c f.cpp -> f.o
# g++ -c fc.cpp -> fc.o
# link f.o fc.o -> f.so
# distutils uses the Makefile distutils.sysconfig.get_makefile_filename()
# for compiling and linking: a sea of options.
# http://docs.python.org/distutils/introduction.html
# http://docs.python.org/distutils/apiref.html 20 pages ...
# https://stackoverflow.com/questions/tagged/distutils+python
import numpy
from distutils.core import setup
from distutils.extension import Extension
from Cython.Distutils import build_ext
# from Cython.Build import cythonize
ext_modules = [Extension(
name="f",
sources=["f.pyx", "fc.cpp"],
# extra_objects=["fc.o"], # if you compile fc.cpp separately
include_dirs = [numpy.get_include()], # .../site-packages/numpy/core/include
language="c++",
# libraries=
# extra_compile_args = "...".split(),
# extra_link_args = "...".split()
)]
setup(
name = 'f',
cmdclass = {'build_ext': build_ext},
ext_modules = ext_modules,
# ext_modules = cythonize(ext_modules) ? not in 0.14.1
# version=
# description=
# author=
# author_email=
)
# test: import f
!
#--------------------------------------------------------------------------------
cat >test-f.py <<\!
#!/usr/bin/env python
# test-f.py
import numpy as np
import f # loads f.so from cc-lib: f.pyx -> f.c + fc.o -> f.so
N = 3
a = np.arange( N, dtype=np.float64 )
b = np.arange( N, dtype=np.float64 )
z = np.ones( N, dtype=np.float64 ) * np.NaN
fret = f.fpy( N, a, b, z )
print "fpy -> fc z:", z
!
#--------------------------------------------------------------------------------
cat >cc-lib-mac <<\!
#!/bin/sh
me=${0##*/}
case $1 in
"" )
set -- f.cpp fc.cpp ;; # default: g++ these
-h* | --h* )
echo "
$me [g++ flags] xx.c yy.cpp zz.o ...
compiles .c .cpp .o files to a dynamic lib xx.so
"
exit 1
esac
# Logically this is simple, compile and link,
# but platform-dependent, layers upon layers, gloom, doom
base=${1%.c*}
base=${base%.o}
set -x
g++ -dynamic -arch ppc \
-bundle -undefined dynamic_lookup \
-fno-strict-aliasing -fPIC -fno-common -DNDEBUG `# -g` -fwrapv \
-isysroot /Developer/SDKs/MacOSX10.4u.sdk \
-I/Library/Frameworks/Python.framework/Versions/2.6/include/python2.6 \
-I${Pysite?}/numpy/core/include \
-O2 -Wall \
"$#" \
-o $base.so
# undefs: nm -gpv $base.so | egrep '^ *U _+[^P]'
!
# 23 Feb 2011 13:38
The following Cython code from
http://article.gmane.org/gmane.comp.python.cython.user/5625 doesn't require explicit casts and also handles non-continous arrays:
def fpy(A):
cdef np.ndarray[np.double_t, ndim=2, mode="c"] A_c
A_c = np.ascontiguousarray(A, dtype=np.double)
fc(&A_c[0,0])
Basically you can write your Cython function such that it allocates the arrays (make sure you cimport numpy as np):
cdef np.ndarray[np.double_t, ndim=1] rr = np.zeros((N,), dtype=np.double)
then pass in the .data pointer of each to your C function. That should work. If you don't need to start with zeros you could use np.empty for a small speed boost.
See the Cython for NumPy Users tutorial in the docs (fixed it to the correct link).
You should check out Ctypes it's probably the most easiest thing to use if all you want is one function.