I've been looking around the internet trying to find a good step by step guide to extend Python in Windows, and I haven't been able to find something for my skill level.
let's say you have some c code that looks like this:
#include <stdio.h>
#include <math.h>
double valuex(float value, double rate, double timex)
{
float value;
double rate, timex;
return value / (double) pow ((1 + rate), (timex));
}
and you want to turn that into a Python 3 module for use on a windows (64bit if that makes a difference) system. How would you go about doing that? I've looked up SWIG and Pyrex and in both circumstances they seem geared towards the unix user. With Pyrex I am not sure if it works with Python 3.
I'm just trying to learn the basics of programing, using some practical examples.
Lastly, if there is a good book that someone can recommend for learning to extend, I would greatly appreciate it.
Thank you.
Cython (Pyrex with a few kinks worked out and decisions made for practicality) can use one code base to make Python 2 and Python 3 modules. It's a really great choice for making libraries for 2 and 3. The user guide explains how to use it, but it doesn't demystify Windows programming or C or Python or programming in general, thought it can simplify some things for you.
SWIG can be hard to work with when you run into a problem and will not be especially conducive to creating a very native-feeling, idiomatic binding of the C you are relying on. For that, you would need to re-wrap the wrapper in Python, at which point it might have been nicer just to use Cython. It can be nice for bindings that you cannot dedicate enough work to make truly nice, and is convenient in that you can expose your API to many languages at once in it.
Depending on what you're trying to do, building your "extension" as a simple DLL and accessing it with ctypes could be, by far, the simplest approach.
I used your code, slightly adjusted and saved as mydll.c:
#include <stdio.h>
#include <math.h>
#define DLL_EXPORT __declspec(dllexport)
DLL_EXPORT double valuex(float value, double rate, double timex)
{
float value;
double rate, timex;
return value / (double) pow ((1 + rate), (timex));
}
I downloaded the Tiny C Compiler and invoked with this command.
tcc -shared mydll.c
(I believe adding -rdynamic would avoid the need to sprinkle DLL_EXPORT all over your function defs.)
This generated mydll.dll. I then ran Python:
Python 3.1.2 (r312:79149, Mar 21 2010, 00:41:52) ... on win32
>>> from ctypes import *
>>> mydll = cdll.mydll
>>> valuex = mydll.valuex
>>> valuex.argtypes = [c_float, c_double, c_double]
>>> valuex.restype = c_double
>>> valuex(1.2, 2.3, 3.4)
2.0470634033800796e-21
A start would be the documentation Building C and C++ Extensions on Windows.
Well, the easiest way to create Python plugins is to use C++ and Boost.Python. In your example, the extension module would look as simple as this:
#include <boost/python.hpp>
using namespace boost::python;
// ... your valuex function goes here ...
BOOST_PYTHON_MODULE(yourModuleName)
{
def("valuex", valuex, "an optional documentation string");
}
Boost.Python is available on the popular operating systems and should work with Python 3, too (not tested it, though, support was added in 2009).
Regarding SWIG: It is not for Unix only. You can download precompiled Windows binaries or compile it yourself with MinGW/MSYS.
You could as well try out Cython which is said to be Python 3 compatible.
At PyCon 2009, I gave a talk on how to write Python C extensions: A Whirlwind Excursion through Python C Extensions. There's nothing specific to Windows in it, but it covers the basic structure of an extension.
Related
Even if numba, cython (and especially cython.inline) exist, in some cases, it would be interesting to have inline C code in Python.
Is there a built-in way (in Python standard library) to have inline C code?
PS: scipy.weave used to provide this, but it's Python 2 only.
Directly in the Python standard library, probably not. But it's possible to have something very close to inline C in Python with the cffi module (pip install cffi).
Here is an example, inspired by this article and this question, showing how to implement a factorial function in Python + "inline" C:
from cffi import FFI
ffi = FFI()
ffi.set_source("_test", """
long factorial(int n) {
long r = n;
while(n > 1) {
n -= 1;
r *= n;
}
return r;
}
""")
ffi.cdef("""long factorial(int);""")
ffi.compile()
from _test import lib # import the compiled library
print(lib.factorial(10)) # 3628800
Notes:
ffi.set_source(...) defines the actual C source code
ffi.cdef(...) is the equivalent of the .h header file
you can of course add some cleaning code after, if you don't need the compiled library at the end (however, cython.inline does the same and the compiled .pyd files are not cleaned by default, see here)
this quick inline use is particularly useful during a prototyping / development phase. Once everything is ready, you can separate the build (that you do only once), and the rest of the code which imports the pre-compiled library
It seems too good to be true, but it seems to work!
I am trying to find the actual implementation of math.pow in Python3 (3.7.3) on Ubuntu 18.04 LTS (Bionic Beaver).
The Python doc says the math module
provides access to the mathematical functions defined by the C standard.
This post says the math module
is usually included in OS distributions. ... many microprocessors have specialised instructions for some of these operations, and your compiler may well make use of those rather than jumping to the implementation in the C library.
So, there are 3 possible implementations Python, Ubuntu, microprocessors.
I searched python math module source code, and then google got me the mathmodule.c.
I didn't find the definition, declaration, wrapper, or implementation of math.pow in this mathmodule.c file.
I noticed that the mathmodule.c includes some headers.
#include "Python.h"
#include "_math.h"
#include "clinic/mathmodule.c.h"
So, I searched them respectively, and got
"Python.h" is a "meta-include" file
Include nearly all Python header files
"_math.h" seems to be for Hyperbolic functions
the clue in "mathmodule.c.h" ends at math_pow_impl
static PyObject *
math_pow_impl(PyObject *module, double x, double y);
math_pow_impl is implemented at mathmodule.c
static PyObject *
math_pow_impl(PyObject *module, double x, double y)
/*[clinic end generated code: output=fff93e65abccd6b0 input=c26f1f6075088bfd]*/
{
double r;
int odd_y;
...
errno = 0;
PyFPE_START_PROTECT("in math_pow", return 0);
r = pow(x, y);
PyFPE_END_PROTECT(r);
r = pow(x, y); seems to be the key, although mathmodule.c only uses the function rather than implementing this function, what is the next step I could try?
PS:
I also searched on my ubuntu (/usr/include/math.h) and got no result contains "pow".
EDIT: Ok, all the edits made the layout of the question a bit confusing so I will try to rewrite the question (not changing the content, but improving its structure).
The issue in short
I have an openCL program that works fine, if I compile it as an executable. Now I try to make it callable from Python using boost.python. However, as soon as I exit Python (after importing my module), python crashes.
The reason seems to have something to do with
statically storing only GPU CommandQueues and their release mechanism when the program terminates
MWE and setup
Setup
IDE used: Visual Studio 2015
OS used: Windows 7 64bit
Python version: 3.5
AMD OpenCL APP 3.0 headers
cl2.hpp directly from Khronos as suggested here: empty openCL program throws deprecation warning
Also I have an Intel CPU with integrated graphics hardware and no other dedicated graphics card
I use version 1.60 of the boost library compiled as 64-bit versions
The boost dll I use is called: boost_python-vc140-mt-1_60.dll
The openCL program without python works fine
The python module without openCL works fine
MWE
#include <vector>
#define CL_HPP_ENABLE_EXCEPTIONS
#define CL_HPP_TARGET_OPENCL_VERSION 200
#define CL_HPP_MINIMUM_OPENCL_VERSION 200 // I have the same issue for 100 and 110
#include "cl2.hpp"
#include <boost/python.hpp>
using namespace std;
class TestClass
{
private:
std::vector<cl::CommandQueue> queues;
TestClass();
public:
static const TestClass& getInstance()
{
static TestClass instance;
return instance;
}
};
TestClass::TestClass()
{
std::vector<cl::Device> devices;
vector<cl::Platform> platforms;
cl::Platform::get(&platforms);
//remove non 2.0 platforms (as suggested by doqtor)
platforms.erase(
std::remove_if(platforms.begin(), platforms.end(),
[](const cl::Platform& platform)
{
int v = cl::detail::getPlatformVersion(platform());
short version_major = v >> 16;
return !(version_major >= 2);
}),
platforms.end());
//Get all available GPUs
for (const cl::Platform& pl : platforms)
{
vector<cl::Device> plDevices;
try {
pl.getDevices(CL_DEVICE_TYPE_GPU, &plDevices);
}
catch (cl::Error&)
{
// Doesn't matter. No GPU is available on the current machine for
// this platform. Just check afterwards, that you have at least one
// device
continue;
}
devices.insert(end(devices), begin(plDevices), end(plDevices));
}
cl::Context context(devices[0]);
cl::CommandQueue queue(context, devices[0]);
queues.push_back(queue);
}
int main()
{
TestClass::getInstance();
return 0;
}
BOOST_PYTHON_MODULE(FrameWork)
{
TestClass::getInstance();
}
Calling program
So after compiling the program as a dll I start python and run the following program
import FrameWork
exit()
While the import works without issues, python crashes on exit(). So I click on debug and Visual Studio tells me there was an exception in the following code section (in cl2.hpp):
template <>
struct ReferenceHandler<cl_command_queue>
{
static cl_int retain(cl_command_queue queue)
{ return ::clRetainCommandQueue(queue); }
static cl_int release(cl_command_queue queue) // -- HERE --
{ return ::clReleaseCommandQueue(queue); }
};
If you compile the above code instead as a simple executable, it works without issues. Also the code works if one of the following is true:
CL_DEVICE_TYPE_GPU is replaced by CL_DEVICE_TYPE_ALL
the line queues.push_back(queue) is removed
Question
So what could be the reason for this and what are possible solutions? I suspect it has something to do with the fact that my testclass is static, but since it works with the executable I am at a loss what is causing it.
I came across similar problem in the past.
clRetain* functions are supported from OpenCL1.2.
When getting devices for the first GPU platform (platforms[0].getDevices(...) for CL_DEVICE_TYPE_GPU) in your case it must happen to be a platform pre OpenCL1.2 hence you get a crash. When getting devices of any type (GPU/CPU/...) your first platform changes to be a OpenCL1.2+ and everything is fine.
To fix the problem set:
#define CL_HPP_MINIMUM_OPENCL_VERSION 110
This will ensure calls to clRetain* aren't made for unsupported platforms (pre OpenCL 1.2)
Update: I think there is a bug in cl2.hpp which despite setting minimum OpenCL version to 1.1 it still tries to use clRetain* on pre OpenCL1.2 devices when creating a command queue.
Setting minimum OpenCL version to 110 and version filtering works fine for me.
Complete working example:
#include "stdafx.h"
#include <vector>
#define CL_HPP_ENABLE_EXCEPTIONS
#define CL_HPP_TARGET_OPENCL_VERSION 200
#define CL_HPP_MINIMUM_OPENCL_VERSION 110
#include <CL/cl2.hpp>
using namespace std;
class TestClass
{
private:
std::vector<cl::CommandQueue> queues;
TestClass();
public:
static const TestClass& getInstance()
{
static TestClass instance;
return instance;
}
};
TestClass::TestClass()
{
std::vector<cl::Device> devices;
vector<cl::Platform> platforms;
cl::Platform::get(&platforms);
size_t x = 0;
for (; x < platforms.size(); ++x)
{
cl::Platform &p = platforms[x];
int v = cl::detail::getPlatformVersion(p());
short version_major = v >> 16;
if (version_major >= 2) // OpenCL 2.x
break;
}
if (x == platforms.size())
return; // no OpenCL 2.0 platform available
platforms[x].getDevices(CL_DEVICE_TYPE_GPU, &devices);
cl::Context context(devices);
cl::CommandQueue queue(context, devices[0]);
queues.push_back(queue);
}
int main()
{
TestClass::getInstance();
return 0;
}
Update2:
So what could be the reason for this and what are possible solutions?
I suspect it has something to do with the fact that my testclass is
static, but since it works with the executable I am at a loss what is
causing it.
TestClass static seems to be a reason. Looks like releasing memory is happening in wrong order when run from python. To fix that you may want to add a method which will have to be explicitly called to release opencl objects before python starts releasing memory.
static TestClass& getInstance() // <- const removed
{
static TestClass instance;
return instance;
}
void release()
{
queues.clear();
}
BOOST_PYTHON_MODULE(FrameWork)
{
TestClass::getInstance();
TestClass::getInstance().release();
}
"I would appreciate an answer that explains to me what the problem actually is and if there are ways to fix it."
First, let me say that doqtor already answered how to fix the issue -- by ensuring a well-defined destruction time of all used OpenCL resources. IMO, this is not a "hack", but the right thing to do. Trying to rely on static init/cleanup magic to do the right thing -- and watching it fail to do so -- is the real hack!
Second, some thoughts about the issue: the actual problem is even more complex than the common static initialization order fiasco stories. It involves DLL loading/unloading order, both in connection with python loading your custom dll at runtime and (more importantly) with OpenCL's installable client driver (ICD) model.
What DLLs are involved when running an application/dll that uses OpenCL? To the application, the only relevant DLL is the opencl.dll you link against. It is loaded into process memory during application startup time (or when your custom DLL which needs opencl is dynamically loaded in python).
Then, at the time when you first call clGetPlatformInfo() or similar in your code, the ICD logic kicks in: opencl.dll will look for installed drivers (in windows, those are mentioned somewhere in the registry) and dynamically load their respective dlls (using sth like the LoadLibrary() system call). That may be e.g. nvopencl.dll for nvidia, or some other dll for the intel driver you have installed. Now, in contrast to the relatively simple opencl.dll, this ICD dll can and will have a multitude of dependencies on its own -- probably using Intel IPP, or TBB, or whatever. So by now, things have become real messy already.
Now, during shutdown, the windows loader must decide which dlls to unload in which order. When you compile your example in a single executable, the number and order of dlls being loaded/unloaded will certainly be different than in the "python loads your custom dll at runtime" scenario. And that could well be the reason why you experience the problem only in the latter case, and only if you still have an opencl-context+commandqueue alive during shutdown of your custom dll. The destruction of your queue (triggered via clRelease... during static destruction of your testclass instance) is delegated to the intel-icd-dll, so this dll must still be fully functional at that time. If, for some reason, that is not the case (perhaps because the loader chose to unload it or one of the dlls it needs), you crash.
That line of thought reminded me of this article:
https://blogs.msdn.microsoft.com/larryosterman/2004/06/10/dll_process_detach-is-the-last-thing-my-dlls-going-to-see-right/
There's a paragraph, talking about "COM objects", which might be equally applicable to "OpenCL resources":
"So consider the case where you have a DLL that instantiates a COM object at some point during its lifetime. If that DLL keeps a reference to the COM object in a global variable, and doesn’t release the COM object until the DLL_PROCESS_DETACH, then the DLL that implements the COM object will be kept in memory during the lifetime of the COM object. Effectively the DLL implementing the COM object has become dependant on the DLL that holds the reference to the COM object. But the loader has no way of knowing about this dependency. All it knows is that the DLL’s are loaded into memory."
Now, I wrote a lot of words without coming to a definitive proof of what's actually going wrong. The main lesson I learned from bugs like these is: don't enter that snake pit, and do your resource-cleanup in a well-defined place like doqtor suggested. Good night.
I have some C++ code that currently relies on hard-coded constants, which are imported into multiple other cpp files, and I would like my python (pyx) file to set the constants once at runtime.
So, cython.pyx imports files a.cpp, b.cpp, and c.cpp, and constants.hpp
Files a.cpp, b.cpp, and c.cpp all import constants.hpp.
I would like instead to have one universal constants file, eg new_constants.yml, which python imports and sends through to the cpp files. This also means (I think) that I won't have to re-compile the c code every time I want to tweak the constants.
I'm used to scripting languages (python, js), so working with old C++ code is throwing me off a bit, and I'm sure parts of this question sound like I'm retarded, so, thanks for being patient with me.
These are just some weird dependencies, and I can't wrap my mind around unspooling it.
C++ literally inserts #include'd files into the code at compile time (technically before compile time - during the preprocessor run), so there is no way to change those values at runtime.
if you have the following
foo.h
const int value = 42;
and foo.cpp
#include "foo.h"
int foo(){ return value; }
When you compile foo.cpp, the preprocessor will substitute the exact contents of foo.h to replace #include "foo.h" in the cpp file and then the compiler will see
const int value = 42;
int foo(){ return value; }
and nothing else
The original source code for a c++ program is completely discarded once compilation is complete and is never used again.
You can see what the compiler sees using the -E flag to gcc which will make it output the pre-processed source.
I recently tried PyPy and was intrigued by the approach. I have lots of C extensions for Python, which all use PyArray_DATA() to obtain a pointer to the data sections of numpy arrays. Unfortunately, PyPy doesn't appear to export an equivalent for their numpypy arrays in their cpyext module, so I tried following the recommendation on their website to use ctypes. This pushes the task of obtaining the pointer to the Python level.
There appear to be two ways:
import ctypes as C
p_t = C.POINTER(C.c_double)
def get_ptr_ctypes(x):
return x.ctypes.data_as(p_t)
def get_ptr_array(x):
return C.cast(x.__array_interface__['data'][0], p_t)
Only the second one works on PyPy, so for compatibility the choice is clear. For CPython, both are slow as hell and a complete bottleneck for my application! Is there a fast and portable way of obtaining this pointer? Or is there an equivalent of PyArray_DATA() for PyPy (possibly undocumented)?
I still haven't found an entirely satisfactory solution, but nevertheless there is something one can do to obtain the pointer with a lot less overhead in CPython. First off, the reason why both ways mentioned above are so slow is that both .ctypes and .__array_interface__ are on-demand attributes, which are set by array_ctypes_get() and array_interface_get() in numpy/numpy/core/src/multiarray/getset.c. The first imports ctypes and creates a numpy.core._internal._ctypes instance, while the second one creates a new dictionary and populates it with lots of unnecessary stuff in addition to the data pointer.
There is nothing one can do on the Python level about this overhead, but one can write a micro-module on the C-level that bypasses most of the overhead:
#include <Python.h>
#include <numpy/arrayobject.h>
PyObject *_get_ptr(PyObject *self, PyObject *obj) {
return PyLong_FromVoidPtr(PyArray_DATA(obj));
}
static PyMethodDef methods[] = {
{"_get_ptr", _get_ptr, METH_O, "Wrapper to PyArray_DATA()"},
{NULL, NULL, 0, NULL}
};
PyMODINIT_FUNC initaccel(void) {
Py_InitModule("accel", methods);
}
Compile as usual as an Extension in setup.py, and import as
try:
from accel import _get_ptr
def get_ptr(x):
return C.cast(_get_ptr(x), p_t)
except ImportError:
get_ptr = get_ptr_array
On PyPy, from accel import _get_ptr will fail and get_ptr will fall back to get_ptr_array, which works with Numpypy.
As far as performance goes, for light-weight C function calls, ctypes + accel._get_ptr() is still quite a bit slower than the native CPython extension, which has essentially no overhead. It is of course much faster than get_ptr_ctypes() and get_ptr_array() above, so that the overhead may become insignificant for medium-weight C function calls.
One has gained compatibility with PyPy, although I have to say that after spending quite a bit of time trying to evaluate PyPy for my scientific computation applications, I don't see a future for it as long as they (quite stubbornly) refuse to support the full CPython API.
Update
I found that ctypes.cast() was now becoming the bottleneck after introducing accel._get_ptr(). One can get rid of the casts by declaring all pointers in the interface as ctypes.c_void_p. This is what I ended up with:
def get_ptr_ctypes2(x):
return x.ctypes._data
def get_ptr_array(x):
return x.__array_interface__['data'][0]
try:
from accel import _get_ptr as get_ptr
except ImportError:
get_ptr = get_ptr_array
Here, get_ptr_ctypes2() avoids the cast by accessing the hidden ndarray.ctypes._data attribute directly. Here are some timing results for calling heavy-weight and light-weight C functions from Python:
heavy C (few calls) light C (many calls)
ctypes + get_ptr_ctypes(): 0.71 s 15.40 s
ctypes + get_ptr_ctypes2(): 0.68 s 13.30 s
ctypes + get_ptr_array(): 0.65 s 11.50 s
ctypes + accel._get_ptr(): 0.63 s 9.47 s
native CPython: 0.62 s 8.54 s
Cython (no decorators): 0.64 s 9.96 s
So, with accel._get_ptr() and no ctypes.cast()s, ctypes' speed is actually competitive with a native CPython extension. So I just have to wait until someone rewrites h5py, matplotlib and scipy with ctypes to be able to try PyPy for anything serious...
That might not be answer enough, but hopefully a good hint. I am using scipy.weave.inline() in some parts of my code. I do not know much about the speed of the interface itself, because the function I execute is quite heavy and relies on a few pointers/arrays only, but it seems fast to me. Maybe you can get some inspiration from the scipy.weave code, particularly from attempt_function_call
https://github.com/scipy/scipy/blob/master/scipy/weave/inline_tools.py#L390
If you want to have a look at the C++ code that is generated by scipy.weave,
produce a simple example from here: http://docs.scipy.org/doc/scipy/reference/tutorial/weave.html ,
run the python script
get the scipy.weave cache folder:
import scipy.weave.catalog as ctl
ctl.default_dir()
Out[5]: '/home/user/.python27_compiled'
have a look at the generated C++ code in the folder