pybind11 memory leak and crashes - python

I am facing memory leak and crash issues in pybind11.
I am calling a python function "myfunc" from a python file "mydl.py" that uses Tensorflow Keras deep learning functions, Numpy, and Redis modules using pybind11 in a repeatitive C++ code. The code structure is as follows.
class myclass {
public:
myclass() {
py::initialize_interpreter();
{
py::module sys = py::module::import("sys");
py::module os = py::module::import("os");
py::str cwd = os.attr("getcwd")();
py::print("os.cwd: ", cwd);
py::str bin = cwd + py::str("/../bin");
// Add bin to sys.path
py::module site = py::module::import("site");
site.attr("addsitedir")(bin);
}
}
~myclass() {
py::finalize_interpreter();
}
int callpyfunc(string a1, string a2) {
int retval;
{
py::module mydl = py::module::import("mydl");
py::object result = mydl.attr("myfunc")(a1, a2);
retval = result.cast<int>();
}
return retval;
}
}
myclass *mcobj1;
int main() {
mcobj1 = new myclass();
int retval;
while (/* some deep learning condition is not met */) {
retval = mcobj1->callpyfunc(a1, a2);
}
del mcobj1;
}
The memory size of this program goes on increasing consistently to the point of it consuming entire 62 GB RAM and crashing. It seems like Python interpreter is not releasing memory allocated for different objects inside each call to "myfunc" of "mydl.py" even after the call gets done.
Here's what all I have tried with no luck of fixing the issue:
Using scoped interpreter inside callpyfunc instead of doing initialize_interpreter and finalize_interpreter. But in that case the code crashes quietly in the second call to "callpyfunc", the first call goes fine. This is exactly what is mentioned here
Moving initialize_interpreter along with import of modules like "sys", "os" and finalize_interpreter inside callpyfunc. But in that case the code crashes in the second call to "callpyfunc" at line py::module mydl = py::module::import("mydl"); and never reaches finalizing of interpreter.

Related

Pybind11 - Capture output in realtime from C++ to Python

Using Pybind11, I am able to call a C++ native function from my Python code.
My C++ programme has a long-running function that keeps on going until explicitly stopped and this function generates some output using std::cout. As this long-running function never returns due to its nature, I am trying to get the output of this C++ code in my Python code for further processing.
I am aware of this: https://pybind11.readthedocs.io/en/stable/advanced/pycpp/utilities.html#capturing-standard-output-from-ostream however I really do not see how to reflect the C++ generated output into my Python code.
Here is the code:
int myFunc()
{
...
for(;;) { // Can be only stopped if requested by user
std::cout << capturedEvent;
}
...
return 0;
}
namespace py = pybind11;
PYBIND11_MODULE(Handler, m) {
// Add a scoped redirect for your noisy code
m.def("myFunc", []() {
py::scoped_ostream_redirect stream(
std::cout, // std::ostream&
py::module_::import("sys").attr("stdout") // Python output
);
myFunc();
});
...
}
And Python:
from Handler import myFunc
from random import random
DATA = [(random() - 0.5) * 3 for _ in range(999999)]
def test(fn, name):
result = fn()
print('capturedEvent is {} {} \n\n'.format(result, name))
if __name__ == "__main__":
test(myFunc, '(PyBind11 C++ extension)')
I would like to retrieve, in realtime, the content of the C++ capturedEvent variable.
If there is another approach than capturing stdout, (maybe sharing a variable in realtime?) please, let me know, maybe my strategy is wrong.
Thank you.

PyDic segfaults when key > size 1

I am trying to use python bindings to interface a simple python wrapper around my c++ code. I am currently wanting to return a map of values. When I try to create a dictionary entry my application segfaults when the key > size 1. Even ignoring the returning of the object I still get the error. Only adding "ke" segfaults as well. I have successfully returned a dict with {"k": 10} but that is it.
C++:
extern "C" void Test() {
signal(SIGSEGV, handler); // install our handler
PyObject* results = PyDict_New();
printf("Adding k\n");
PyDict_SetItemString(results, "k", PyLong_FromLong(3000));
printf("Adding ke\n");
PyDict_SetItemString(results, "ke", PyLong_FromLong(3000));
printf("Adding key\n");
PyDict_SetItemString(results, "key", PyLong_FromLong(3000));
}
Python:
import ctypes
_test_bench = ctypes.CDLL('<path_to_so>')
_test_bench.Test.argtypes = None
_test_bench.Test.restype = None
def test() -> None:
global _test_bench
_test_bench.Test()
test()
Output:
Adding k
Adding ke
Error: signal 11:
You can't use the Python API from a library loaded with CDLL. You need to use PyDLL.
(Also, don't forget to do your refcount management. That's not the cause of the crash, but it is still a problem.)

How can embedded python call a C++ class's function?

So, StackOverflow, I'm stumped.
The code as I have it is a C++ function with embedded Python. I generate a message on the C++ side, send it to the python, get a different message back. I got it to work, I got it tested, so far, so good.
The next step is that I need Python to generate messages on their own and send them into C++. This is where I'm starting to get stuck. After spending a few hours puzzling over the documentation, it seemed like the best way would be to define a module to hold my functions. So I wrote up the following stub:
static PyMethodDef mailbox_methods[] = {
{ "send_external_message",
[](PyObject *caller, PyObject *args) -> PyObject *
{
classname *interface = (classname *)
PyCapsule_GetPointer(PyTuple_GetItem(args, 0), "turkey");
class_field_type toReturn;
toReturn = class_field_type.python2cpp(PyTuple_GetItem(args, 1));
interface ->send_message(toReturn);
Py_INCREF(Py_None);
return Py_None;
},
METH_VARARGS,
"documentation" },
{ NULL, NULL, 0, NULL }
};
static struct PyModuleDef moduledef = {
PyModuleDef_HEAD_INIT,
"turkey",
"documentation",
-1,
mailbox_methods
};
//defined in header file, just noting its existence here
PyObject *Module;
PyMODINIT_FUNC PyInit_turkey()
{
Module = PyModule_Create(&moduledef);
return Module;
}
And on the Python side, I had the following receiver code:
import turkey
I get the following response:
ImportError: No module named 'turkey'
Now, here's the part where I get really confused. Here's the code in my initialization function:
PyInit_turkey();
PyObject *interface = PyCapsule_New(this, "instance", NULL);
char *repr = PyUnicode_AsUTF8(PyObject_Repr(Module));
cout << "REPR: " << repr << "\n";
if (PyErr_Occurred())
PyErr_Print();
It prints out
REPR: <module 'turkey'>
Traceback (most recent call last):
<snipped>
import turkey
ImportError: No module named 'turkey'
So the module exists, but it's not ever being passed to Python anywhere. I can't find documentation on how to pass it in and get it initialized on the Python side. I realize that I'm probably just missing a trivial step, but I can't for the life off me figure out what it is. Can anyone help?
The answer was, in the end, a single function that I was missing. Included at the start of my initialization function, before I call Py_Initialize:
PyImport_AppendInittab("facade", initfunc);
Py_Initialize();
PyEval_InitThreads();
The Python documentation does not mention PyImport_AppendInittab except in passing, and this is why I was having such a difficult time making the jump.
To anyone else who finds this in the future: You do not need to create a DLL to extend python or use a pre-built library to bring a module into Python. It can be far easier than that.

Memory management of SWIG generated objects passed to C

I am trying to wrap a library for Python written in C++ using SWIG. The library uses function calls that accept byte buffers as parameters. In Python I am creating these byte buffers using %array_class from SWIG . I made a proof-of-concept program in order to test this out and I noticed a significant memory leak associated with passing these buffers to C++. Specifically, running the code below steadily raises the memory usage of the Python application (as observed on the Task Manager) up to about 250MB where the program halts. The printouts from C indicate that the program does run as expected, but just eats up more memory. The del buff statement runs, but does nothing to release the memory. I tried creating and deleting the buffer in each loop, but same result.
Running delete x; in C++ crashes my program entirely.
My Swig Interface file:
%module example
%include "carrays.i"
%array_class(uint8_t, buffer);
%{
#include "PythonConnector.h"
%}
%include "PythonConnector.h"
The C header file:
class PythonConnector {
public:
void print_array(uint8_t *x);
};
The minimal C-defined function
void PythonConnector::print_array(uint8_t *x)
{
//int i;
//for (i = 0; i < 100; i++) {
// printf("[%d] = %d\n", i, x[i]);
//}
//delete x; // <-- This crashed the program
return;
}
The tester Python script
import time
import example
sizeBytes = 10000
buff = example.buffer(sizeBytes)
for j in range(1000):
# Initialize data buffer
for i in range(sizeBytes):
buff[i] = i%256
buff[0] = 0
example.PythonConnector().print_array(buff.cast())
print(j)
del buff
time.sleep(10)
Am I missing something? I suspect that SWIG creates some proxy object for each time the buffer is passed to the C++ that is not garbage-collected.
Edit:
SWIG version 3.0.7
CPython version 3.5 x64
Windows 10 OS
Thanks for your help.
OK, Thanks to #Flexo, I found the answer.
The problem is the instantiation of the example.PythonConnector() being created in each loop. Instantiating only once outside the loop seems to fix the memory problem:
import time
import example
sizeBytes = 10000
buff = example.buffer(sizeBytes)
conn = example.PythonConnector()
for j in range(1000):
# Initialize data buffer
for i in range(sizeBytes):
buff[i] = i%256
buff[0] = 0
conn.print_array(buff.cast())
print(j)
del buff
time.sleep(10)
There still remains the question why the many connectors don't get garbage collected in the original code.

static openCL class not properly released in python module using boost.python

EDIT: Ok, all the edits made the layout of the question a bit confusing so I will try to rewrite the question (not changing the content, but improving its structure).
The issue in short
I have an openCL program that works fine, if I compile it as an executable. Now I try to make it callable from Python using boost.python. However, as soon as I exit Python (after importing my module), python crashes.
The reason seems to have something to do with
statically storing only GPU CommandQueues and their release mechanism when the program terminates
MWE and setup
Setup
IDE used: Visual Studio 2015
OS used: Windows 7 64bit
Python version: 3.5
AMD OpenCL APP 3.0 headers
cl2.hpp directly from Khronos as suggested here: empty openCL program throws deprecation warning
Also I have an Intel CPU with integrated graphics hardware and no other dedicated graphics card
I use version 1.60 of the boost library compiled as 64-bit versions
The boost dll I use is called: boost_python-vc140-mt-1_60.dll
The openCL program without python works fine
The python module without openCL works fine
MWE
#include <vector>
#define CL_HPP_ENABLE_EXCEPTIONS
#define CL_HPP_TARGET_OPENCL_VERSION 200
#define CL_HPP_MINIMUM_OPENCL_VERSION 200 // I have the same issue for 100 and 110
#include "cl2.hpp"
#include <boost/python.hpp>
using namespace std;
class TestClass
{
private:
std::vector<cl::CommandQueue> queues;
TestClass();
public:
static const TestClass& getInstance()
{
static TestClass instance;
return instance;
}
};
TestClass::TestClass()
{
std::vector<cl::Device> devices;
vector<cl::Platform> platforms;
cl::Platform::get(&platforms);
//remove non 2.0 platforms (as suggested by doqtor)
platforms.erase(
std::remove_if(platforms.begin(), platforms.end(),
[](const cl::Platform& platform)
{
int v = cl::detail::getPlatformVersion(platform());
short version_major = v >> 16;
return !(version_major >= 2);
}),
platforms.end());
//Get all available GPUs
for (const cl::Platform& pl : platforms)
{
vector<cl::Device> plDevices;
try {
pl.getDevices(CL_DEVICE_TYPE_GPU, &plDevices);
}
catch (cl::Error&)
{
// Doesn't matter. No GPU is available on the current machine for
// this platform. Just check afterwards, that you have at least one
// device
continue;
}
devices.insert(end(devices), begin(plDevices), end(plDevices));
}
cl::Context context(devices[0]);
cl::CommandQueue queue(context, devices[0]);
queues.push_back(queue);
}
int main()
{
TestClass::getInstance();
return 0;
}
BOOST_PYTHON_MODULE(FrameWork)
{
TestClass::getInstance();
}
Calling program
So after compiling the program as a dll I start python and run the following program
import FrameWork
exit()
While the import works without issues, python crashes on exit(). So I click on debug and Visual Studio tells me there was an exception in the following code section (in cl2.hpp):
template <>
struct ReferenceHandler<cl_command_queue>
{
static cl_int retain(cl_command_queue queue)
{ return ::clRetainCommandQueue(queue); }
static cl_int release(cl_command_queue queue) // -- HERE --
{ return ::clReleaseCommandQueue(queue); }
};
If you compile the above code instead as a simple executable, it works without issues. Also the code works if one of the following is true:
CL_DEVICE_TYPE_GPU is replaced by CL_DEVICE_TYPE_ALL
the line queues.push_back(queue) is removed
Question
So what could be the reason for this and what are possible solutions? I suspect it has something to do with the fact that my testclass is static, but since it works with the executable I am at a loss what is causing it.
I came across similar problem in the past.
clRetain* functions are supported from OpenCL1.2.
When getting devices for the first GPU platform (platforms[0].getDevices(...) for CL_DEVICE_TYPE_GPU) in your case it must happen to be a platform pre OpenCL1.2 hence you get a crash. When getting devices of any type (GPU/CPU/...) your first platform changes to be a OpenCL1.2+ and everything is fine.
To fix the problem set:
#define CL_HPP_MINIMUM_OPENCL_VERSION 110
This will ensure calls to clRetain* aren't made for unsupported platforms (pre OpenCL 1.2)
Update: I think there is a bug in cl2.hpp which despite setting minimum OpenCL version to 1.1 it still tries to use clRetain* on pre OpenCL1.2 devices when creating a command queue.
Setting minimum OpenCL version to 110 and version filtering works fine for me.
Complete working example:
#include "stdafx.h"
#include <vector>
#define CL_HPP_ENABLE_EXCEPTIONS
#define CL_HPP_TARGET_OPENCL_VERSION 200
#define CL_HPP_MINIMUM_OPENCL_VERSION 110
#include <CL/cl2.hpp>
using namespace std;
class TestClass
{
private:
std::vector<cl::CommandQueue> queues;
TestClass();
public:
static const TestClass& getInstance()
{
static TestClass instance;
return instance;
}
};
TestClass::TestClass()
{
std::vector<cl::Device> devices;
vector<cl::Platform> platforms;
cl::Platform::get(&platforms);
size_t x = 0;
for (; x < platforms.size(); ++x)
{
cl::Platform &p = platforms[x];
int v = cl::detail::getPlatformVersion(p());
short version_major = v >> 16;
if (version_major >= 2) // OpenCL 2.x
break;
}
if (x == platforms.size())
return; // no OpenCL 2.0 platform available
platforms[x].getDevices(CL_DEVICE_TYPE_GPU, &devices);
cl::Context context(devices);
cl::CommandQueue queue(context, devices[0]);
queues.push_back(queue);
}
int main()
{
TestClass::getInstance();
return 0;
}
Update2:
So what could be the reason for this and what are possible solutions?
I suspect it has something to do with the fact that my testclass is
static, but since it works with the executable I am at a loss what is
causing it.
TestClass static seems to be a reason. Looks like releasing memory is happening in wrong order when run from python. To fix that you may want to add a method which will have to be explicitly called to release opencl objects before python starts releasing memory.
static TestClass& getInstance() // <- const removed
{
static TestClass instance;
return instance;
}
void release()
{
queues.clear();
}
BOOST_PYTHON_MODULE(FrameWork)
{
TestClass::getInstance();
TestClass::getInstance().release();
}
"I would appreciate an answer that explains to me what the problem actually is and if there are ways to fix it."
First, let me say that doqtor already answered how to fix the issue -- by ensuring a well-defined destruction time of all used OpenCL resources. IMO, this is not a "hack", but the right thing to do. Trying to rely on static init/cleanup magic to do the right thing -- and watching it fail to do so -- is the real hack!
Second, some thoughts about the issue: the actual problem is even more complex than the common static initialization order fiasco stories. It involves DLL loading/unloading order, both in connection with python loading your custom dll at runtime and (more importantly) with OpenCL's installable client driver (ICD) model.
What DLLs are involved when running an application/dll that uses OpenCL? To the application, the only relevant DLL is the opencl.dll you link against. It is loaded into process memory during application startup time (or when your custom DLL which needs opencl is dynamically loaded in python).
Then, at the time when you first call clGetPlatformInfo() or similar in your code, the ICD logic kicks in: opencl.dll will look for installed drivers (in windows, those are mentioned somewhere in the registry) and dynamically load their respective dlls (using sth like the LoadLibrary() system call). That may be e.g. nvopencl.dll for nvidia, or some other dll for the intel driver you have installed. Now, in contrast to the relatively simple opencl.dll, this ICD dll can and will have a multitude of dependencies on its own -- probably using Intel IPP, or TBB, or whatever. So by now, things have become real messy already.
Now, during shutdown, the windows loader must decide which dlls to unload in which order. When you compile your example in a single executable, the number and order of dlls being loaded/unloaded will certainly be different than in the "python loads your custom dll at runtime" scenario. And that could well be the reason why you experience the problem only in the latter case, and only if you still have an opencl-context+commandqueue alive during shutdown of your custom dll. The destruction of your queue (triggered via clRelease... during static destruction of your testclass instance) is delegated to the intel-icd-dll, so this dll must still be fully functional at that time. If, for some reason, that is not the case (perhaps because the loader chose to unload it or one of the dlls it needs), you crash.
That line of thought reminded me of this article:
https://blogs.msdn.microsoft.com/larryosterman/2004/06/10/dll_process_detach-is-the-last-thing-my-dlls-going-to-see-right/
There's a paragraph, talking about "COM objects", which might be equally applicable to "OpenCL resources":
"So consider the case where you have a DLL that instantiates a COM object at some point during its lifetime. If that DLL keeps a reference to the COM object in a global variable, and doesn’t release the COM object until the DLL_PROCESS_DETACH, then the DLL that implements the COM object will be kept in memory during the lifetime of the COM object. Effectively the DLL implementing the COM object has become dependant on the DLL that holds the reference to the COM object. But the loader has no way of knowing about this dependency. All it knows is that the DLL’s are loaded into memory."
Now, I wrote a lot of words without coming to a definitive proof of what's actually going wrong. The main lesson I learned from bugs like these is: don't enter that snake pit, and do your resource-cleanup in a well-defined place like doqtor suggested. Good night.

Categories