Portable/fast way to obtain a pointer to Numpy/Numpypy data - python

I recently tried PyPy and was intrigued by the approach. I have lots of C extensions for Python, which all use PyArray_DATA() to obtain a pointer to the data sections of numpy arrays. Unfortunately, PyPy doesn't appear to export an equivalent for their numpypy arrays in their cpyext module, so I tried following the recommendation on their website to use ctypes. This pushes the task of obtaining the pointer to the Python level.
There appear to be two ways:
import ctypes as C
p_t = C.POINTER(C.c_double)
def get_ptr_ctypes(x):
return x.ctypes.data_as(p_t)
def get_ptr_array(x):
return C.cast(x.__array_interface__['data'][0], p_t)
Only the second one works on PyPy, so for compatibility the choice is clear. For CPython, both are slow as hell and a complete bottleneck for my application! Is there a fast and portable way of obtaining this pointer? Or is there an equivalent of PyArray_DATA() for PyPy (possibly undocumented)?

I still haven't found an entirely satisfactory solution, but nevertheless there is something one can do to obtain the pointer with a lot less overhead in CPython. First off, the reason why both ways mentioned above are so slow is that both .ctypes and .__array_interface__ are on-demand attributes, which are set by array_ctypes_get() and array_interface_get() in numpy/numpy/core/src/multiarray/getset.c. The first imports ctypes and creates a numpy.core._internal._ctypes instance, while the second one creates a new dictionary and populates it with lots of unnecessary stuff in addition to the data pointer.
There is nothing one can do on the Python level about this overhead, but one can write a micro-module on the C-level that bypasses most of the overhead:
#include <Python.h>
#include <numpy/arrayobject.h>
PyObject *_get_ptr(PyObject *self, PyObject *obj) {
return PyLong_FromVoidPtr(PyArray_DATA(obj));
}
static PyMethodDef methods[] = {
{"_get_ptr", _get_ptr, METH_O, "Wrapper to PyArray_DATA()"},
{NULL, NULL, 0, NULL}
};
PyMODINIT_FUNC initaccel(void) {
Py_InitModule("accel", methods);
}
Compile as usual as an Extension in setup.py, and import as
try:
from accel import _get_ptr
def get_ptr(x):
return C.cast(_get_ptr(x), p_t)
except ImportError:
get_ptr = get_ptr_array
On PyPy, from accel import _get_ptr will fail and get_ptr will fall back to get_ptr_array, which works with Numpypy.
As far as performance goes, for light-weight C function calls, ctypes + accel._get_ptr() is still quite a bit slower than the native CPython extension, which has essentially no overhead. It is of course much faster than get_ptr_ctypes() and get_ptr_array() above, so that the overhead may become insignificant for medium-weight C function calls.
One has gained compatibility with PyPy, although I have to say that after spending quite a bit of time trying to evaluate PyPy for my scientific computation applications, I don't see a future for it as long as they (quite stubbornly) refuse to support the full CPython API.
Update
I found that ctypes.cast() was now becoming the bottleneck after introducing accel._get_ptr(). One can get rid of the casts by declaring all pointers in the interface as ctypes.c_void_p. This is what I ended up with:
def get_ptr_ctypes2(x):
return x.ctypes._data
def get_ptr_array(x):
return x.__array_interface__['data'][0]
try:
from accel import _get_ptr as get_ptr
except ImportError:
get_ptr = get_ptr_array
Here, get_ptr_ctypes2() avoids the cast by accessing the hidden ndarray.ctypes._data attribute directly. Here are some timing results for calling heavy-weight and light-weight C functions from Python:
heavy C (few calls) light C (many calls)
ctypes + get_ptr_ctypes(): 0.71 s 15.40 s
ctypes + get_ptr_ctypes2(): 0.68 s 13.30 s
ctypes + get_ptr_array(): 0.65 s 11.50 s
ctypes + accel._get_ptr(): 0.63 s 9.47 s
native CPython: 0.62 s 8.54 s
Cython (no decorators): 0.64 s 9.96 s
So, with accel._get_ptr() and no ctypes.cast()s, ctypes' speed is actually competitive with a native CPython extension. So I just have to wait until someone rewrites h5py, matplotlib and scipy with ctypes to be able to try PyPy for anything serious...

That might not be answer enough, but hopefully a good hint. I am using scipy.weave.inline() in some parts of my code. I do not know much about the speed of the interface itself, because the function I execute is quite heavy and relies on a few pointers/arrays only, but it seems fast to me. Maybe you can get some inspiration from the scipy.weave code, particularly from attempt_function_call
https://github.com/scipy/scipy/blob/master/scipy/weave/inline_tools.py#L390
If you want to have a look at the C++ code that is generated by scipy.weave,
produce a simple example from here: http://docs.scipy.org/doc/scipy/reference/tutorial/weave.html ,
run the python script
get the scipy.weave cache folder:
import scipy.weave.catalog as ctl
ctl.default_dir()
Out[5]: '/home/user/.python27_compiled'
have a look at the generated C++ code in the folder

Related

Is there a built-in way to use inline C code in Python?

Even if numba, cython (and especially cython.inline) exist, in some cases, it would be interesting to have inline C code in Python.
Is there a built-in way (in Python standard library) to have inline C code?
PS: scipy.weave used to provide this, but it's Python 2 only.
Directly in the Python standard library, probably not. But it's possible to have something very close to inline C in Python with the cffi module (pip install cffi).
Here is an example, inspired by this article and this question, showing how to implement a factorial function in Python + "inline" C:
from cffi import FFI
ffi = FFI()
ffi.set_source("_test", """
long factorial(int n) {
long r = n;
while(n > 1) {
n -= 1;
r *= n;
}
return r;
}
""")
ffi.cdef("""long factorial(int);""")
ffi.compile()
from _test import lib # import the compiled library
print(lib.factorial(10)) # 3628800
Notes:
ffi.set_source(...) defines the actual C source code
ffi.cdef(...) is the equivalent of the .h header file
you can of course add some cleaning code after, if you don't need the compiled library at the end (however, cython.inline does the same and the compiled .pyd files are not cleaned by default, see here)
this quick inline use is particularly useful during a prototyping / development phase. Once everything is ready, you can separate the build (that you do only once), and the rest of the code which imports the pre-compiled library
It seems too good to be true, but it seems to work!

Dll causes Python to crash when using memset

I've been working on a project where I'm trying to use an old CodeBase library, written in C++, in Python. What I want is to use CodeBase to reindex a .dbf-file that has a .cdx-index. But currently, Python is crashing during runtime. A more detailed explanation will follow further down.
As for Python, I'm using ctypes to load the dll and then execute a function I added myself which should cause no problems, since it doesn't use a single line of code that CodeBase itself isn't using.
Python Code:
import ctypes
cb_interface = ctypes.CDLL("C4DLL.DLL")
cb_interface.reindex_file("C:\\temp\\list.dbf")
Here's the CodeBase function I added, but it requires some amount of knowledge that I can't provide right now, without blowing this question up quite a bit. If neccessary, I will provide further insight, as much as I can:
S4EXPORT int reindex_file(const char* file){
CODE4 S4PTR *code;
DATA4 S4PTR *data;
code4initLow(&code, 0, S4VERSION, sizeof(CODE4));
data = d4open(code, file);
d4reindex(data);
return 1;
}
According to my own debugging, my problem happens in code4initLow. Python crashes with a window saying "python.exe has stopped working", when the dll reaches the following line of code:
memset( (void *)c4, 0, sizeof( CODE4 ) ) ;
c4 here is the same object as code in the previous code-block.
Is there some problem with a dll trying to alter memory during runtime? Could it be a python problem that would go away if I were to create a .exe-file from my python script?
If someone could answer me these questions and/or provide a solution for my python-crashing-problem, I would greatly appreciate it.
And last but not least, this is my first question here. If I have accidently managed to violate a written or unwritten rule here, I apologize and promise to fix that as soon as possible.
First of all the pointer code doesn't point anywhere as it is uninitialized. Secondly you don't actually try to fill the structure, since you pass memset a pointer to the pointer.
What you should probably do is declare code as a normal structure instance (and not a pointer), and then use &code when passing it to d4open.
Like Joachim Pileborg saying the problem is, to pass a Nullpointer to code4initLow. ("Alternative") Solution is to allocate Memory for the struct CODE4 S4PTR *code = (CODE4*)malloc(sizeof(CODE4));, then pass the Pointer like code4initLow(code, 0, S4VERSION, sizeof(CODE4));.

Why is my python/numpy example faster than pure C implementation?

I have pretty much the same code in python and C. Python example:
import numpy
nbr_values = 8192
n_iter = 100000
a = numpy.ones(nbr_values).astype(numpy.float32)
for i in range(n_iter):
a = numpy.sin(a)
C example:
#include <stdio.h>
#include <math.h>
int main(void)
{
int i, j;
int nbr_values = 8192;
int n_iter = 100000;
double x;
for (j = 0; j < nbr_values; j++){
x = 1;
for (i=0; i<n_iter; i++)
x = sin(x);
}
return 0;
}
Something strange happen when I ran both examples:
$ time python numpy_test.py
real 0m5.967s
user 0m5.932s
sys 0m0.012s
$ g++ sin.c
$ time ./a.out
real 0m13.371s
user 0m13.301s
sys 0m0.008s
It looks like python/numpy is twice faster than C. Is there any mistake in the experiment above? How you can explain it?
P.S. I have Ubuntu 12.04, 8G ram, core i5 btw
First, turn on optimization. Secondly, subtleties matter. Your C code is definitely not 'basically the same'.
Here is equivalent C code:
sinary2.c:
#include <math.h>
#include <stdlib.h>
float *sin_array(const float *input, size_t elements)
{
int i = 0;
float *output = malloc(sizeof(float) * elements);
for (i = 0; i < elements; ++i) {
output[i] = sin(input[i]);
}
return output;
}
sinary.c:
#include <math.h>
#include <stdlib.h>
extern float *sin_array(const float *input, size_t elements)
int main(void)
{
int i;
int nbr_values = 8192;
int n_iter = 100000;
float *x = malloc(sizeof(float) * nbr_values);
for (i = 0; i < nbr_values; ++i) {
x[i] = 1;
}
for (i=0; i<n_iter; i++) {
float *newary = sin_array(x, nbr_values);
free(x);
x = newary;
}
return 0;
}
Results:
$ time python foo.py
real 0m5.986s
user 0m5.783s
sys 0m0.050s
$ gcc -O3 -ffast-math sinary.c sinary2.c -lm
$ time ./a.out
real 0m5.204s
user 0m4.995s
sys 0m0.208s
The reason the program has to be split in two is to fool the optimizer a bit. Otherwise it will realize that the whole loop has no effect at all and optimize it out. Putting things in two files doesn't give the compiler visibility into the possible side-effects of sin_array when it's compiling main and so it has to assume that it actually has some and repeatedly call it.
Your original program is not at all equivalent for several reasons. One is that you have nested loops in the C version and you don't in Python. Another is that you are working with arrays of values in the Python version and not in the C version. Another is that you are creating and discarding arrays in the Python version and not in the C version. And lastly you are using float in the Python version and double in the C version.
Simply calling the sin function the appropriate number of times does not make for an equivalent test.
Also, the optimizer is a really big deal for C. Comparing C code on which the optimizer hasn't been used to anything else when you're wondering about a speed comparison is the wrong thing to do. Of course, you also need to be mindful. The C optimizer is very sophisticated and if you're testing something that really doesn't do anything, the C optimizer might well notice this fact and simply not do anything at all, resulting in a program that's ridiculously fast.
Because "numpy" is a dedicated math library implemented for speed. C has standard functions for sin/cos, that are generally derived for accuracy.
You are also not comparing apples with apples, as you are using double in C, and float32 (float) in python. If we change the python code to calculate float64 instead, the time increases by about 2.5 seconds on my machine, making it roughly match with the correctly optimized C version.
If the whole test was made to do something more complicated that requires more control structres (if/else, do/while, etc), then you would probably see even less difference between C and Python - because the C compiler can't really do "sin" any faster - unless you implement a better "sin" function.
Newer mind the fact that your code isn't quite the same on both sides... ;)
You seem to be doing the the same operation in C 8192 x 10000 times but only 10000 in python (I haven't used numpy before so I may misunderstand the code). Why are you using an array in the python case (again I'm not use to numpy so perhaps the dereferencing is implicit). If you wish to use an array be careful doubles have a performance hit in terms of caching and optimised vectorisation - you're using different types between both implementations (float vs double) but given the algorithm I don't think it matters.
The main reason for a lot of anomalous performance benchmark issues surrounding C vs Pythis, Pythat... Is that simply the C implementation is often poor.
https://www.ibm.com/developerworks/community/blogs/jfp/entry/A_Comparison_Of_C_Julia_Python_Numba_Cython_Scipy_and_BLAS_on_LU_Factorization?lang=en
If you notice the guy writes C to process an array of doubles (without using restrict or const keywords where he could've), he builds with optimisation then forces the compiler to use SIMD rather than AVE. In short the compiler is using an inefficient instruction set for doubles and the wrong type of registers too if he wanted performance - you can be sure the numba and numpy will be using as many bells and whistles as possible and will be shipped with very efficient C and C++ libraries to begin with. In short if you want speed with C you have to think about it, you may even have to disassemble the code and perhaps disable optimisation and use compiler instrinsics instead. It gives you the tools to do it so don't expect the compiler to do all the work for you. If you want that degree of freedom use Cython, Numba, Numpy, Scipy etc. They're very fast but you won't be able to eek out every bit of performance out of the machine - to do that use C, C++ or new versions of FORTRAN.
Here is a very good article on these very points (I'd use SciPy):
https://www.scipy.org/scipylib/faq.html

DLR & Performance

I'm intending to create a web service which performs a large number of manually-specified calculations as fast as possible, and have been exploring the use of DLR.
Sorry if this is long but feel free to skim over and get the general gist.
I've been using the IronPython library as it makes the calculations very easy to specify. My works laptop gives a performance of about 400,000 calculations per second doing the following:
ScriptEngine py = Python.CreateEngine();
ScriptScope pys = py.CreateScope();
ScriptSource src = py.CreateScriptSourceFromString(#"
def result():
res = [None]*1000000
for i in range(0, 1000000):
res[i] = b.GetValue() + 1
return res
result()
");
CompiledCode compiled = src.Compile();
pys.SetVariable("b", new DynamicValue());
long start = DateTime.Now.Ticks;
var res = compiled.Execute(pys);
long end = DateTime.Now.Ticks;
Console.WriteLine("...Finished. Sample data:");
for (int i = 0; i < 10; i++)
{
Console.WriteLine(res[i]);
}
Console.WriteLine("Took " + (end - start) / 10000 + "ms to run 1000000 times.");
Where DynamicValue is a class that returns random numbers from a pre-built array (seeded and built at run time).
When I create a DLR class to do the same thing, I get much higher performance (~10,000,000 calculations per second). The class is as follows:
class DynamicCalc : IDynamicMetaObjectProvider
{
DynamicMetaObject IDynamicMetaObjectProvider.GetMetaObject(Expression parameter)
{
return new DynamicCalcMetaObject(parameter, this);
}
private class DynamicCalcMetaObject : DynamicMetaObject
{
internal DynamicCalcMetaObject(Expression parameter, DynamicCalc value) : base(parameter, BindingRestrictions.Empty, value) { }
public override DynamicMetaObject BindInvokeMember(InvokeMemberBinder binder, DynamicMetaObject[] args)
{
Expression Add = Expression.Convert(Expression.Add(args[0].Expression, args[1].Expression), typeof(System.Object));
DynamicMetaObject methodInfo = new DynamicMetaObject(Expression.Block(Add), BindingRestrictions.GetTypeRestriction(Expression, LimitType));
return methodInfo;
}
}
}
and is called/tested in the same way by doing the following:
dynamic obj = new DynamicCalc();
long t1 = DateTime.Now.Ticks;
for (int i = 0; i < 10000000; i++)
{
results[i] = obj.Add(ar1[i], ar2[i]);
}
long t2 = DateTime.Now.Ticks;
Where ar1 and ar2 are pre-built, runtime seeded arrays of random numbers.
The speed is great this way, but it's not easy to specify the calculation. I'd basically be looking at creating my own lexer & parser, whereas IronPython has everything I need already there.
I'd have thought I could get much better performance from IronPython since it is implemented on top of the DLR, and I could do with better than what I'm getting.
Is my example making best use of the IronPython engine? Is it possible to get significantly better performance out of it?
(Edit) Same as first example but with the loop in C#, setting variables and calling the python function:
ScriptSource src = py.CreateScriptSourceFromString(#"b + 1");
CompiledCode compiled = src.Compile();
double[] res = new double[1000000];
for(int i=0; i<1000000; i++)
{
pys.SetVariable("b", args1[i]);
res[i] = compiled.Execute(pys);
}
where pys is a ScriptScope from py, and args1 is a pre-built array of random doubles. This example executes slower than running the loop in the Python code and passing in the entire arrays.
delnan's comment leads you to some of the problems here. But I'll just get specific about what the differences are here. In the C# version you've cut out a significant amount of the dynamic calls that you have in the Python version. For starters your loop is typed to int and it sounds like ar1 and ar2 are strongly typed arrays. So in the C# version the only dynamic operations you have are the call to obj.Add (which is 1 operation in C#) and potentially the assignment to results if it's not typed to object which seems unlikely. Also note all of this code is lock free.
In the Python version you first have the allocation of the list - this also appears to be during your timer where as in C# it doesn't look like it is. Then you have the dynamic call to range, luckily that only happens once. But that again creates a gigantic list in memory - delnan's suggestion of xrange is an improvement here. Then you have the loop counter i which is getting boxed to an object for every iteration through the loop. Then you have the call to b.GetValue() which is actually 2 dynamic invocatiosn - first a get member to get the "GetValue" method and then an invoke on that bound method object. This is again creating one new object for every iteration of the loop. Then you have the result of b.GetValue() which may be yet another value that's boxed on every iteration. Then you add 1 to that result and you have another boxing operation on every iteration. Finally you store this into your list which is yet another dynamic operation - I think this final operation needs to lock to ensure the list remains consistent (again, delnan's suggestion of using a list comprehension improves this).
So in summary during the loop we have:
C# IronPython
Dynamic Operations 1 4
Allocations 1 4
Locks Acquired 0 1
So basically Python's dynamic behavior does come at a cost vs C#. If you want the best of both worlds you can try and balance what you do in C# vs what you do in Python. For example you could write the loop in C# and have it call a delegate which is a Python function (you can do scope.GetVariable> to get a function out of the scope as a delegate). You could also consider allocating a .NET array for the results if you really need to get every last bit of performance as it may reduce working set and GC copying by not keeping around a bunch of boxed values.
To do the delegate you could have the user write:
def computeValue(value):
return value + 1
Then in the C# code you'd do:
CompiledCode compiled = src.Compile();
compiled.Execute(pys);
var computer = pys.GetVariable<Func<object,object>>("computeValue");
Now you can do:
for (int i = 0; i < 10000000; i++)
{
results[i] = computer(i);
}
If you concerned about computation speed, is it better to look at lowlevel computation specification? Python and C# are high-level languages, and its implementation runtime can spend a lot of time for undercover work.
Look on this LLVM wrapper library: http://www.llvmpy.org
Install it using: pip install llvmpy ply
or on Debian Linux: apt install python-llvmpy python-ply
You still need to write some tiny compiler (you can use PLY library), and bind it with LLVM JIT calls (see LLVM Execution Engine), but this approach can be more effective (generated code much closer to real CPU code), and multiplatform comparing to .NET jail.
LLVM has ready to use optimizing compiler infrastructure, including a lot of optimizer stage modules, and big user and developer community.
Also look here: http://gmarkall.github.io/tutorials/llvm-cauldron-2016
PS: If you interested in it, I can help you with a compiler, contributing to my project's manual in parallel. But it will not be jumpstart, this theme is new to me too.

How to make Python Extensions for Windows for absolute beginners

I've been looking around the internet trying to find a good step by step guide to extend Python in Windows, and I haven't been able to find something for my skill level.
let's say you have some c code that looks like this:
#include <stdio.h>
#include <math.h>
double valuex(float value, double rate, double timex)
{
float value;
double rate, timex;
return value / (double) pow ((1 + rate), (timex));
}
and you want to turn that into a Python 3 module for use on a windows (64bit if that makes a difference) system. How would you go about doing that? I've looked up SWIG and Pyrex and in both circumstances they seem geared towards the unix user. With Pyrex I am not sure if it works with Python 3.
I'm just trying to learn the basics of programing, using some practical examples.
Lastly, if there is a good book that someone can recommend for learning to extend, I would greatly appreciate it.
Thank you.
Cython (Pyrex with a few kinks worked out and decisions made for practicality) can use one code base to make Python 2 and Python 3 modules. It's a really great choice for making libraries for 2 and 3. The user guide explains how to use it, but it doesn't demystify Windows programming or C or Python or programming in general, thought it can simplify some things for you.
SWIG can be hard to work with when you run into a problem and will not be especially conducive to creating a very native-feeling, idiomatic binding of the C you are relying on. For that, you would need to re-wrap the wrapper in Python, at which point it might have been nicer just to use Cython. It can be nice for bindings that you cannot dedicate enough work to make truly nice, and is convenient in that you can expose your API to many languages at once in it.
Depending on what you're trying to do, building your "extension" as a simple DLL and accessing it with ctypes could be, by far, the simplest approach.
I used your code, slightly adjusted and saved as mydll.c:
#include <stdio.h>
#include <math.h>
#define DLL_EXPORT __declspec(dllexport)
DLL_EXPORT double valuex(float value, double rate, double timex)
{
float value;
double rate, timex;
return value / (double) pow ((1 + rate), (timex));
}
I downloaded the Tiny C Compiler and invoked with this command.
tcc -shared mydll.c
(I believe adding -rdynamic would avoid the need to sprinkle DLL_EXPORT all over your function defs.)
This generated mydll.dll. I then ran Python:
Python 3.1.2 (r312:79149, Mar 21 2010, 00:41:52) ... on win32
>>> from ctypes import *
>>> mydll = cdll.mydll
>>> valuex = mydll.valuex
>>> valuex.argtypes = [c_float, c_double, c_double]
>>> valuex.restype = c_double
>>> valuex(1.2, 2.3, 3.4)
2.0470634033800796e-21
A start would be the documentation Building C and C++ Extensions on Windows.
Well, the easiest way to create Python plugins is to use C++ and Boost.Python. In your example, the extension module would look as simple as this:
#include <boost/python.hpp>
using namespace boost::python;
// ... your valuex function goes here ...
BOOST_PYTHON_MODULE(yourModuleName)
{
def("valuex", valuex, "an optional documentation string");
}
Boost.Python is available on the popular operating systems and should work with Python 3, too (not tested it, though, support was added in 2009).
Regarding SWIG: It is not for Unix only. You can download precompiled Windows binaries or compile it yourself with MinGW/MSYS.
You could as well try out Cython which is said to be Python 3 compatible.
At PyCon 2009, I gave a talk on how to write Python C extensions: A Whirlwind Excursion through Python C Extensions. There's nothing specific to Windows in it, but it covers the basic structure of an extension.

Categories