Freeze/Fail when using functional with OpenMP [Pybind11/OpenMP] - python

I have a problem with the functional feature of Pybind11 when I use it with a for-loop with OpenMP. I've done some research and my problem sounds pretty similar to the one in this Pull Request from 2 years ago, but although this PR is closed and the issue seems to be fixed I still have this issue. A code example I created will hopefully explain my problem better:
#include <pybind11/pybind11.h>
#include <pybind11/functional.h>
#include <omp.h>
namespace py = pybind11;
class B {
B(int n, const int& initial_value);
void map(const std::function<int(int)> &f);
int n;
int* elements;
#include <pybind11/pybind11.h>
#include <pybind11/functional.h>
#include "b.h"
namespace py = pybind11;
B::B(int n, const int& v)
: n(n) {
elements = new int[n];
#pragma omp parallel for
for (int i = 0; i < n; i++) {
elements[i] = v;
void B::map(const std::function<int(int)> &f) {
#pragma omp parallel for
for (int i = 0; i < n; i++) {
elements[i] = f(elements[i]);
PYBIND11_MODULE(m, handle) {
handle.doc() = "Example Module";
py::class_<B>(handle, "B")
.def(py::init<int, int>())
.def("map", &B::map)
cmake_minimum_required(VERSION 3.4...3.18)
pybind11_add_module(m b.cpp)
target_link_libraries(m PUBLIC OpenMP::OpenMP_CXX)
message( FATAL_ERROR "Your compiler does not support OpenMP" )
from build.m import *
def test(i):
return i * 20
b = B(2, 2)
I basically have an array where I want to apply a Python function to every element using a for-loop. I know that it is an issue with functional and OpenMP specifically because in other parts of my project I am using OpenMP successfully and functional is also working if I am not using OpenMP.
Edit: It freezes at the map function and has to be terminated. I am using Ubuntu 21.10, Python 3.9, GCC 11.2.0, OpenMP 4.5, and the newest version of the pybind11 repo.

You're likely experiencing a deadlock between OpenMP's scheduler and Python's GIL (Global Interpreter Lock).
I suggest attaching gdb to your process and looking at where the threads are to verify that's really the problem.
IMHO mixing Python functions and OpenMP like that is asking for trouble. If you want multi-threading of Python functions you can use multiprocessing.pool.ThreadPool. But unless your functions release the GIL most of the time you won't benefit from multi-threading.


How to force/test malloc failure in shared library when called via Python ctypes

I have a Python program that calls a shared library (libpq in this case) that itself calls malloc under the hood.
I want to be able to test (i.e. in unit tests) what happens when those calls to malloc fail (e.g. when there isn't enough memory).
How can I force that?
Note: I don't think setting a resource limit on the process using ulimit -d would work. It would need to be be precise and robust enough to, say, make a single malloc call inside libpq, for example one inside PQconnectdbParams, to fail, but all others to work fine, across different versions of Python, and even different resource usages in the same version of Python.
It's possible, but it's tricky. In summary
You can override malloc in a shared library, say, and then (on linux at least) using the LD_PRELOAD environment variable to load it.
But... Python calls malloc all over the place, and you need those to succeed. To isolate the "right" calls to malloc to fail you can use the glibc functions "backtrace" and "backtrace_symbols" to inspect the stack to see if it's the right one to fail.
This shared library exposes a small API to control which calls to malloc will fail (so it doesn't need to be hard coded in the library)
To allow some calls to malloc to succeed, you need a pointer to the original malloc function. However, to find this you need to call dlsym, which itself can call malloc. So you need to build in a simple allocator inside the new malloc so these calls (recursive) calls to malloc succeed. Thanks to for this tip.
In more detail:
The shared library code
// In test_override_malloc.c
// Some of this code is inspired by
#define _GNU_SOURCE
#include <dlfcn.h>
#include <execinfo.h>
#include <stddef.h>
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
// Fails malloc at the fail_in-th call when search_string is in the backtrade
// -1 means never fail
static int fail_in = -1;
static char search_string[1024];
// To find the original address of malloc during malloc, we might
// dlsym will be called which might allocate memory via malloc
static char initialising_buffer[10240];
static int initialising_buffer_pos = 0;
// The pointers to original memory management functions to call
// when we don't want to fail
static void *(*original_malloc)(size_t) = NULL;
static void (*original_free)(void *ptr) = NULL;
void set_fail_in(int _fail_in, char *_search_string) {
fail_in = _fail_in;
strncpy(search_string, _search_string, sizeof(search_string));
void *
malloc(size_t size) {
void *memory = NULL;
int trace_size = 100;
void *stack[trace_size];
static int initialising = 0;
static int level = 0;
// Save original
if (!original_malloc) {
if (initialising) {
if (size + initialising_buffer_pos >= sizeof(initialising_buffer)) {
void *ptr = initialising_buffer + initialising_buffer_pos;
initialising_buffer_pos += size;
return ptr;
initialising = 1;
original_malloc = dlsym(RTLD_NEXT, "malloc");
original_free = dlsym(RTLD_NEXT, "free");
initialising = 0;
// If we're in a nested malloc call (the backtrace functions below can call malloc)
// then call the original malloc
if (level) {
return original_malloc(size);
if (fail_in == -1) {
memory = original_malloc(size);
} else {
// Find if we're in the stack
backtrace(stack, trace_size);
char **symbols = backtrace_symbols(stack, trace_size);
int found = 0;
for (int i = 0; i < trace_size; ++i) {
if (strstr(symbols[i], search_string) != NULL) {
found = 1;
if (!found) {
memory = original_malloc(size);
} else {
if (fail_in > 0) {
memory = original_malloc(size);
return memory;
void free(void *ptr) {
if (ptr < (void*) initialising_buffer || ptr > (void*)(initialising_buffer + sizeof(initialising_buffer))) {
Compiled with
gcc -shared -fPIC test_override_malloc.c -o -ldl
Example Python code
This could go inside the unit tests
# Inside
from ctypes import cdll
cdll.LoadLibrary('./').set_fail_in(0, b'')
# ... then call a function in the shared library
# The `0` above means the very next call it makes to malloc will fail
Run with
LD_PRELOAD=$PWD/ python3
(This might all not be worth it admittedly... if Python calls malloc a lot, I wonder if that in most situations it's unlikely that Python will be fine but just the one call in the library will fail)

dll export function to ctypes

I have some functions which are written in C++, which require high real-time performance. I want to quickly export these functions as dynamic link library to be exposed to Python so that I could do some high level programming.
In these functions, in order to simply usage, I use PyList_New in <Python.h> to collect some intermedia data. But I met some errors.
Code Example
I found the core problem is that I CAN'T event export a python object. After compiling the source to dll and use ctypes to load it, result shows
OSError: exception: access violation reading 0x0000000000000008
C++ code:
#include <Python.h>
#ifdef _MSC_VER
#define DLL_EXPORT __declspec( dllexport )
#define DLL_EXPORT
#ifdef __cplusplus
extern "C"{
DLL_EXPORT PyObject *test3() {
PyObject* ptr = PyList_New(10);
return ptr;
#ifdef __cplusplus
Python test code:
if __name__ == "__main__":
import ctypes
lib = ctypes.cdll.LoadLibrary(LIB_DLL)
test3 = lib.test3
test3.argtypes = None
test3.restype = ctypes.py_object
Environment Config
Clion with Microsoft Visual Studio 2019 Community, and the arch is amd64.
I know that, the right way is to use the recommanded method to wrap C++ source using Python/C Api to a module, but it seems that I have to code a lot. Anyone can help?
ctypes is normally for calling "regular" C functions, not Python C API functions, but it can be done. You must use PyDLL to load a function that uses Python, as it won't release the GIL (global intepreter lock) required to be held when using Python functions. Your code as shown is invalid, however, because it doesn't populate the list it creates (using OP code as test.c):
>>> from ctypes import *
>>> lib = PyDLL('./test')
>>> lib.test3.restype=py_object
>>> lib.test3()
[<NULL>, <NULL>, <NULL>, <NULL>, <NULL>, <NULL>, <NULL>, <NULL>, <NULL>, <NULL>]
Instead, write a C or C++ function normally:
#ifdef _MSC_VER
#define DLL_EXPORT __declspec( dllexport )
#define DLL_EXPORT
#ifdef __cplusplus
extern "C"{
DLL_EXPORT int* create(int n) {
auto p = new int[n];
for(int i = 0; i < n; ++i)
p[i] = i;
return p;
DLL_EXPORT void destroy(int* p) {
delete [] p;
#ifdef __cplusplus
from ctypes import *
lib = CDLL('./test')
lib.create.argtypes = c_int,
lib.create.restype = POINTER(c_int)
lib.destroy.argtypes = POINTER(c_int),
lib.destroy.restype = None
p = lib.create(5)
print(p) # pointer to int
print(p[:5]) # convert to list...pointer doesn't have length so slice.
lib.destroy(p) # free memory
<__main__.LP_c_long object at 0x000001E094CD9DC0>
[0, 1, 2, 3, 4]
I solved it by myself. Just change to Release and all the problems are solved.

Calling parallel C++ code in Python using Pybind11

I have a C++ code that runs in parallel with OpenMP, performing some long calculations. This part works great.
Now, I'm using Python to make a GUI around this code. So, I'd like to call my C++ code inside my python program. For that, I use Pybind11 (but I guess I could use something else if needed).
The problem is that when called from Python, my C++ code runs in serial with only one thread/CPU.
I tried (in two ways) to understand what is done in the documentation of pybind11 here but it does not seem to work at all.
My binding looks like that :
#include <pybind11/pybind11.h>
#include <pybind11/stl.h>
#include "../cpp/include/myHeader.hpp"
namespace py = pybind11;
PYBIND11_MODULE(my_module, m) {
m.def("testFunction", &testFunction, py::call_guard<py::gil_scoped_release>());
m.def("testFunction2", [](inputType input) -> outputType {
/* Release GIL before calling into (potentially long-running) C++ code */
py::gil_scoped_release release;
outputType output = testFunction(input);
py::gil_scoped_acquire acquire;
return output;
Problem: This still does not work and uses only one thread (I verify that with a print of omp_get_num_threads() in an omp parallel region).
Question: What am I doing wrong? What do I need to do to be able to use parallel C++ code inside Python?
Disclaimer: I must admit I don't really understand the GIL thing, particularly in my case where I do not use Python inside my C++ code, which is really "independent" in theory. I just want to be able to use it in another (Python) code.
Have a great day.
I have solved my problem thanks to the pptaszni's answer. Indeed, the GIL things are not needed at all, I misunderstood the documentation. pptaszni's code worked and in fact it was a problem with my CMake file.
Thank you.
It's not really a good answer (too long for a comment thought), because I did not reproduce your problem, but maybe you can isolate the issue in your code by trying this example that works for me:
C++ code:
#include "OpenMpExample.hpp"
#include <algorithm>
#include <iostream>
#include <random>
#include <vector>
#include <omp.h>
constexpr int DATA_SIZE = 10000000;
std::vector<int> testFunction()
int nthreads = 0, tid = 0;
std::vector<std::vector<int> > data;
std::vector<int> results;
std::random_device rnd_device;
std::mt19937 mersenne_engine {rnd_device()};
std::uniform_int_distribution<int> dist {-10, 10};
auto gen = [&dist, &mersenne_engine](){ return dist(mersenne_engine); };
#pragma omp parallel private(tid)
tid = omp_get_thread_num();
if (tid == 0)
nthreads = omp_get_num_threads();
std::cout << "Num threads: " << nthreads << std::endl;
#pragma omp parallel private(tid) shared(data, gen)
tid = omp_get_thread_num();
std::generate(data[tid].begin(), data[tid].end(), gen);
#pragma omp parallel private(tid) shared(data, results)
tid = omp_get_thread_num();
results[tid] = std::accumulate(data[tid].begin(), data[tid].end(), 0);
for (auto r : results)
std::cout << r << ", ";
std::cout << std::endl;
return results;
I tried to keep the code short, but force the machine to actually do some computations at the same time. Each thread generates 10^7 random integers and then sums them up. Then the python binding does not even require gil_scoped_release:
#include <pybind11/pybind11.h>
#include <pybind11/stl.h>
#include "OpenMpExample.hpp"
namespace py = pybind11;
// both versions work for me
// PYBIND11_MODULE(mylib, m) {
// m.def("testFunction", &testFunction, py::call_guard<py::gil_scoped_release>());
// }
PYBIND11_MODULE(mylib, m) {
m.def("testFunction", &testFunction);
Example output from python:
Python 3.6.8 (default, Jun 29 2020, 16:38:14)
[GCC 7.5.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import mylib
>>> x = mylib.testFunction()
Num threads: 12
-10975, -22101, -11333, -28603, -471, -15505, -18141, 2887, -6813, -5328, -13975, -4321,
My environment: Ubuntu 18.04.3 LTS, gcc 8.4.0, openMP 201511, python 3.6.8;

Embed / Include Python.h into C++ [Full Guide] (Python 3.9) (Windows) (Qt 5.15) [duplicate]

This question already has answers here:
how can i include python.h in QMake
(1 answer)
Embedding python 3.4 into C++ Qt Application?
(4 answers)
Closed 2 years ago.
When I was trying to embed a Python script into my Qt C++ program, I run into multiple problems when trying to include Python.h.
The following features, I would like to provide:
Include python.h
Execute Python Strings
Execute Python Scripts
Execute Python Scripts with Arguments
It should also work when Python is not installed on the deployed machine
Therefore I searched around the Internet to try to find a solution. And found a lot of Questions and Blogs, but non have them covered all my Problems and it still took me multiple hours and a lot of frustration.
That's why I have to write down a StackOverflow entry with my full solution so it might help and might accelerate all your work :)
(This answer and all its code examples work also in a non-Qt environment. Only 2. and 4. are Qt specific)
Download and install Python
Alter the .pro file of your project and add the following lines (edit for your correct python path):
INCLUDEPATH = "C:\Users\Public\AppData\Local\Programs\Python\Python39\include"
LIBS += -L"C:\Users\Public\AppData\Local\Programs\Python\Python39\libs" -l"python39"
Example main.cpp code:
#include <QCoreApplication>
#pragma push_macro("slots")
#undef slots
#include <Python.h>
#pragma pop_macro("slots")
* \brief runPy can execut a Python string
* \param string (Python code)
static void runPy(const char* string){
* \brief runPyScript executs a Python script
* \param file (the path of the script)
static void runPyScript(const char* file){
FILE* fp;
fp = _Py_fopen(file, "r");
PyRun_SimpleFile(fp, file);
int main(int argc, char *argv[])
QCoreApplication a(argc, argv);
runPy("from time import time,ctime\n"
"print('Today is', ctime(time()))\n");
//uncomment the following line to run a script
return a.exec();
Whenever you #include <Python.h> use the following code instead. (The Slots from Python will otherwise conflict with the Qt Slots
#pragma push_macro("slots")
#undef slots
#include <Python.h>
#pragma pop_macro("slots")
After compiling, add the python3.dll, python39.dll, as well as the DLLs and Lib Python folders to your compilation folder. You can find them in the root directory of your Python installation. This will allow you to run the embedded c++ code even when python is not installed.
With these steps, I was able to get python running in Qt with the 64 bit MinGW and MSVC compiler. Only the MSVC in debug mode got still a problem.
If you want to pass arguments to the python script, you need the following function (It can be easy copy-pasted into your code):
* \brief runPyScriptArgs executs a Python script and passes arguments
* \param file (the path of the script)
* \param argc amount of arguments
* \param argv array of arguments with size of argc
static void runPyScriptArgs(const char* file, int argc, char *argv[]){
FILE* fp;
wchar_t** wargv = new wchar_t*[argc];
for(int i = 0; i < argc; i++)
wargv[i] = Py_DecodeLocale(argv[i], nullptr);
if(wargv[i] == nullptr)
PySys_SetArgv(argc, wargv);
fp = _Py_fopen(file, "r");
PyRun_SimpleFile(fp, file);
for(int i = 0; i < argc; i++)
wargv[i] = nullptr;
delete[] wargv;
wargv = nullptr;
To use this function, call it like this (For example in your main):
int py_argc = 2;
char* py_argv[py_argc];
py_argv[0] = "Progamm";
py_argv[1] = "Hello";
runPyScriptArgs("test/", py_argc, py_argv);
Together with the script in the test folder:
import sys
if len(sys.argv) != 2:
sys.exit("Not enough args")
ca_one = str(sys.argv[0])
ca_two = str(sys.argv[1])
print ("My command line args are " + ca_one + " and " + ca_two)
you get the following output:
My command line args are Progamm and Hello

Openmesh: updating face normals faster with Python than with C++?

I have created the following simple C++ script with OpenMesh:
#include <string>
#include <OpenMesh/Core/IO/MeshIO.hh>
#include <OpenMesh/Core/Mesh/TriMesh_ArrayKernelT.hh>
struct MyTraits : OpenMesh::DefaultTraits{
typedef OpenMesh::Vec3d Point;
typedef OpenMesh::Vec3d Normal;
typedef OpenMesh::TriMesh_ArrayKernelT<MyTraits> MyMesh;
int main(int argc, char *argv[]){
std::string filename = "filename.stl";
MyMesh OM_mesh;
OpenMesh::IO::Options ropt;
ropt += OpenMesh::IO::Options::Binary;
ropt += OpenMesh::IO::Options::FaceNormal;
OpenMesh::IO::read_mesh(OM_mesh, filename);
for(int k=0; k<1000; k++){
return 0;
Also, I have developed the following simple Python script using the OpenMesh bindings:
import openmesh as OM
filename = "filename.stl"
OM_mesh = OM.TriMesh()
options = OM.Options()
options += OM.Options.Binary
options += OM.Options.FaceNormal
OM.read_mesh(OM_mesh, filename, options)
for k in range(1000):
Both scripts update the face normals of the loaded mesh 1000 times. I expected that the C++ script would be considerably faster than the Python script, but in fact it is just the opposite. I found that the C++ script spends around 8 seconds, while the Python script only spends around 0.3 seconds.
How can this be possible? Are the Python bindings doing something different than just "wrap" the C++ update_face_normals method? Thanks.
I've found that I should use the reading options when I read the file in C++, like this:
OpenMesh::IO::read_mesh(OM_mesh, filename, ropt);
By doing so, the speed in C++ is higher than in Python. However, in .off files, this update is not correct, but this is another issue.
