How to use pybind11 in multithreaded application

How to use pybind11 in multithreaded application - python

I want to run Python in a worker thread. However I get strange segfaults and deadlocks of the threads in the worker pool. How do I correctly use pybind11/Python C API to allow the threads to run the jobs?
I know that it does not make much sense to MT python because of the GIL, but thats an intermediate solution to fit the current architecture until there is a better approach.

this works. Wrap the long running c++ code with gil_scoped_release and gil_scoped_acquire
pybind11::gil_scoped_release release;
while (true)
{
// do something and break
}
pybind11::gil_scoped_acquire acquire;

If you want to do following -
From Python you want to run C++ threads to execute different tasks on each thread.
Create a theradpool like https://github.com/progschj/ThreadPool
Write a wrapper class to wrap and bind to Python using PyBind11.
Create the instance of ThreadPool and add tasks from Python which in turn creates different threads to execute your tasks.
Disclaimer - I have not tried but this will work :)

Related

Running Python code in parallel from Rust with rust-cpython

I'm trying to speed up a data pipeline using Rust. The pipeline contains bits of Python code that I don't want to modify, so I'm trying to run them as-is from Rust using rust-cpython and multiple threads.
However, the performance is not what I expected, it's actually the same as running the python code bits sequentially in a single thread.
Reading the documentation, I understand when invoking the following, you actually get a pointer to a single Python interpreter that can only be created once, even if you run it from multiple threads separately.
let gil = Python::acquire_gil();
let py = gil.python();
If that's the case, it means the Python GIL is actually preventing all parallel execution in Rust as well. Is there a way to solve this problem?
Here's the code of my test:
use cpython::Python;
use std::thread;
use std::sync::mpsc;
use std::time::Instant;
#[test]
fn python_test_parallel() {
let start = Instant::now();
let (tx_output, rx_output) = mpsc::channel();
let tx_output_1 = mpsc::Sender::clone(&tx_output);
thread::spawn(move || {
let gil = Python::acquire_gil();
let py = gil.python();
let start_thread = Instant::now();
py.run("j=0\nfor i in range(10000000): j=j+i;", None, None).unwrap();
println!("{:27} : {:6.1} ms", "Run time thread 1, parallel", (Instant::now() - start_thread).as_secs_f64() * 1000f64);
tx_output_1.send(()).unwrap();
});
let tx_output_2 = mpsc::Sender::clone(&tx_output);
thread::spawn(move || {
let gil = Python::acquire_gil();
let py = gil.python();
let start_thread = Instant::now();
py.run("j=0\nfor i in range(10000000): j=j+i;", None, None).unwrap();
println!("{:27} : {:6.1} ms", "Run time thread 2, parallel", (Instant::now() - start_thread).as_secs_f64() * 1000f64);
tx_output_2.send(()).unwrap();
});
// Receivers to ensure all threads run
let _output_1 = rx_output.recv().unwrap();
let _output_2 = rx_output.recv().unwrap();
println!("{:37} : {:6.1} ms", "Total time, parallel", (Instant::now() - start).as_secs_f64() * 1000f64);
}

The CPython implementation of Python does not allow executing Python bytecode in multiple threads at the same time. As you note yourself, the global interpreter lock (GIL) prevents this.
We don't have any information on what exactly your Python code is doing, so I'll give a few general hints how you could improve the performance of your code.
If your code is I/O-bound, e.g. reading from the network, you will generally get nice performance improvements from using multiple threads. Blocking I/O calls will release the GIL before blocking, so other threads can execute during that time.
Some libraries, e.g. NumPy, internally release the GIL during long-running library calls that don't need access to Python data structures. With these libraries, you can get performance improvements for multi-threaded, CPU-bound code even if you only write pure Python code using the library.
If your code is CPU-bound and spends most of its time executing Python bytecode, you can often use multipe processes rather than threads to achieve parallel execution. The multiprocessing in the Python standard library helps with this.
If your code is CPU-bound, spends most of its time executing Python bytecode and can't be run in parallel processes because it accesses shared data, you can't run it in multiple threads in parallel – the GIL prevents this. However, even without the GIL, you can't just run sequential code in parallel without changes in any language. Since you have concurrent access to some data, you need to add locking and possibly make algorithmic changes to prevent data races; the details of how to do this depend on your use case. (And if you don't have concurrent data access, you should use processes instead of threads – see above.)
Beyond parallelism, a good way to speed up Python code with Rust is to profile your Python code, find the hot spots where most of the time is spent, and rewrite these bits as Rust functions that you call from your Python code. If this doesn't give you enough of a speedup, you can combine this approach with parallelism – preventing data races is generally easier to achieve in Rust than in most other languages.

If you use py03 bindings you can use the allow_threads method and callbacks to free the GIL for faster parralelism: https://pyo3.rs/v0.13.2/parallelism.html

Multithreading with Python and C api

I have a C++ program that uses the C api to use a Python library of mine.
Both the Python library AND the C++ code are multithreaded.
In particular, one thread of the C++ program instantiates a Python object that inherits from threading.Thread. I need all my C++ threads to be able to call methods on that object.
From my very first tries (I naively just instantiate the object from the main thread, then wait some time, then call the method) I noticed that the execution of the Python thread associated with the object just created stops as soon as the execution comes back to the C++ program.
If the execution stays with Python (for example, if I call PyRun_SimpleString("time.sleep(5)");) the execution of the Python thread continues in background and everything works fine until the wait ends and the execution goes back to C++.
I am evidently doing something wrong. What should I do to make both my C++ and Python multithreaded and capable of working with each other nicely? I have no previous experience in the field so please don't assume anything!

A correct order of steps to perform what you are trying to do is:
In the main thread:
Initialize Python using Py_Initialize*.
Initialize Python threading support using PyEval_InitThreads().
Start the C++ thread.
At this point, the main thread still holds the GIL.
In a C++ thread:
Acquire the GIL using PyGILState_Ensure().
Create a new Python thread object and start it.
Release the GIL using PyGILState_Release().
Sleep, do something useful or exit the thread.
Because the main thread holds the GIL, this thread will be waiting to acquire the GIL. If the main thread calls the Python API it may release the GIL from time to time allowing the Python thread to execute for a little while.
Back in the main thread:
Release the GIL, enabling threads to run using PyEval_SaveThread()
Before attempting to use other Python calls, reacquire the GIL using PyEval_RestoreThread()
I suspect that you are missing the last step - releasing the GIL in the main thread, allowing the Python thread to execute.
I have a small but complete example that does exactly that at this link.

You probably do not unlock the Global Interpreter Lock when you callback from python's threading.Thread.
Well, if you are using bare python's C API you have some documentation here, about how to release/acquire the GIL. But while using C++, I must warn you that it might broke down upon any exceptions throwing in your C++ code. See here.
In general any of your C++ function that runs for too long should unlock GIL and lock, whenever it use the C Python API again.

scheduling embedded python processes

I've been trying to create a C++ program that embeds multiple python threads. Due to the nature of the program the advantage of multitasking comes from asynchronous I/O; but due to some variables that need to be altered between context switching I need to control the scheduling. I thought that because of python's GIL lock this would be simple enough, but it's turning out not to be: python wants to use POSIX threads rather than software threads, I can't figure out from the documentation what happens if I store the result of PyEval_SaveThread() and don't call PyEval_RestoreThread() in the same function--so presumably I'm not supposed to be doing that, etc.
Is it possible to create a custom scheduler for embedded python threads, or was python basically designed so that it can't be done?

It turns out that using PyEval_SaveThread() and PyEval_RestoreThread() is unnecessary, basically I used coroutines to run the scripts and control the scheduling. In this case from libPCL. However this isn't really much of a solution because if python encounters a syntax error it will segfault if it is in a coroutine, oddly enough even if there is only one python script running in one coroutine this will still happen. But at the very least they don't seem to conflict with each other.

working in python console while executing a boost::python module

How can i keep using the console while executing a process from a boost::python module? I figured i have to use threading but I think I'm missing something.
import pk #my boost::python module from c++
import threading
t = threading.Thread(target=pk.showExample, args=())
t.start()
This executes showExample, which runs a Window rendering 3D content. Now i would like to keep on coding in the python console while this window is running. The example above works to show the Window but fails to keep the console interactive. Any Ideas how to do it? Thanks for any suggestions.
Greetings
Chris
Edit: I also tried to make Threads in the showExample() C++ code, didn't work as well. I probably have to make the console a thread, but I have not a clue how and can't find any helpful examples.
Edit2: to make the example more simple I implemented these c++ methods:
void Example::simpleWindow()
{
int running = GL_TRUE;
glfwInit();
glfwOpenWindow(800,600, 8,8,8,8,24,8, GLFW_WINDOW);
glewExperimental = GL_TRUE;
glewInit();
glEnable(GL_DEPTH_TEST);
glEnable(GL_CULL_FACE);
glCullFace(GL_BACK);
while(running)
{
glClearColor(0.0f, 0.0f, 0.0f, 0.0f);
glClear(GL_COLOR_BUFFER_BIT | GL_DEPTH_BUFFER_BIT | GL_STENCIL_BUFFER_BIT);
glfwSwapBuffers();
running = !glfwGetKey(GLFW_KEY_ESC) && gkfwGetWindowParam(GLFW_OPENED);
}
}
void Example::makeWindowThread()
{
boost::thread t(simpleWindow);
t.join();
}
There may be some useless lines of code (it was just copy paste of a part from the real method i want to use.) Both methods are static. If I start interactive console in a thread and start the pk.makeWindowThread() in python, i can't give input anymore. Doesn't work if I put the call of pk.makeWindowThread() in a python thread as well. (I am trying to print something in console while showing the window.

When trying to execute a process while keeping the console interactive, then consider using the subprocess or multiprocessing modules. When doing this within Boost.Python, it is probably more appropriate to execute a process in C++ using execv() family of functions.
When trying to spawn a thread while keeping the console interactive, then one must consider the Global Interpreter Lock (GIL). In short, the GIL is a mutex around the interpreter, preventing parallel operations to be performed on Python objects. Thus, at any point in time, a max of one thread, the one that has acquired the GIL, is allowed to perform operations on Python objects.
For multithreaded Python programs with no C or C++ threads, the CPython interpreter functions as a cooperative scheduler, enabling concurrency. Threads will yield control when Python knows the thread is about to perform a blocking call. For example, a thread will release the GIL within time.sleep(). Additionally, the interpreter will force a thread to yield control after certain criteria have been met. For example, after a thread has executed a certain amount of bytecode operations, the interpreter will force it to yield control, allowing other threads to execute.
C or C++ threads are sometimes referred to as alien threads in the Python documentation. The Python interpreter has no ability to force an alien thread to yield control by releasing the GIL. Therefore, alien threads are responsible for managing the GIL to permit concurrent or parallel execution with Python threads. With this in mind, lets examine some of the C++ code:
void Example::makeWindowThread()
{
boost::thread t(simpleWindow);
t.join();
}
This will spawn a thread, and thread::join() will block, waiting for the t thread to complete execution. If this function is exposed to Python via Boost.Python, then the calling thread will block. As only one Python thread is allowed to be executed at any point in time, the calling thread will own the GIL. Once the calling thread blocks on t.join(), all other Python threads will remain blocked, as the interpreter cannot force the thread to yield control. To enable other Python threads to run, the GIL should be released pre-join, and acquired post-join.
void Example::makeWindowThread()
{
boost::thread t(simpleWindow);
release GIL // allow other python threads to run.
t.join();
acquire GIL // execution is going to occur within the interpreter.
}
However, this will still cause the console to block waiting for the thread to complete execution. Instead, consider spawning the thread and detaching from it via thread::detach(). As the calling thread will no longer block, managing the GIL within Example::makeWindowThread is no longer necessary.
void Example::makeWindowThread()
{
boost::thread(simpleWindow).detach();
}
For more details/examples of managing the GIL, please consider reading this answer for a basic implementation overview, and this answer for a much deeper dive into considerations one must take.

You have two options:
start python with the -i flag, that will cause to drop it to the interactive interperter instead of exiting from the main thread
start an interactive session manually:
import code
code.interact()
The second option is particularily useful if you want to run the interactive session in it's own thread, as some libraries (like PyQt/PySide) don't like it when they arn't started from the main thread:
from code import interact
from threading import Thread
Thread(target=interact, kwargs={'local': globals()}).start()
... # start some mainloop which will block the main thread
Passing local=globals() to interact is necessary so that you have access to the scope of the module, otherwise the interpreter session would only have access to the content of the thread's scope.

Embedding python in multithreaded C application

I'm embedding the python interpreter in a multithreaded C application and I'm a little confused as to what APIs I should use to ensure thread safety.
From what I gathered, when embedding python it is up to the embedder to take care of the GIL lock before calling any other Python C API call. This is done with these functions:
gstate = PyGILState_Ensure();
// do some python api calls, run python scripts
PyGILState_Release(gstate);
But this alone doesn't seem to be enough. I still got random crashes since it doesn't seem to provide mutual exclusion for the Python APIs.
After reading some more docs I also added:
PyEval_InitThreads();
right after the call to Py_IsInitialized() but that's where the confusing part comes. The docs state that this function:
Initialize and acquire the global interpreter lock
This suggests that when this function returns, the GIL is supposed to be locked and should be unlocked somehow. but in practice this doesn't seem to be required. With this line in place my multithreaded worked perfectly and mutual exclusion was maintained by the PyGILState_Ensure/Release functions.
When I tried adding PyEval_ReleaseLock() after PyEval_ReleaseLock() the app dead-locked pretty quickly in a subsequent call to PyImport_ExecCodeModule().
So what am I missing here?

I had exactly the same problem and it is now solved by using PyEval_SaveThread() immediately after PyEval_InitThreads(), as you suggest above. However, my actual problem was that I used PyEval_InitThreads() after PyInitialise() which then caused PyGILState_Ensure() to block when called from different, subsequent native threads. In summary, this is what I do now:
There is global variable:
static int gil_init = 0;
From a main thread load the native C extension and start the Python interpreter:
Py_Initialize()
From multiple other threads my app concurrently makes a lot of calls into the Python/C API:
if (!gil_init) {
gil_init = 1;
PyEval_InitThreads();
PyEval_SaveThread();
}
state = PyGILState_Ensure();
// Call Python/C API functions...
PyGILState_Release(state);
From the main thread stop the Python interpreter
Py_Finalize()
All other solutions I've tried either caused random Python sigfaults or deadlock/blocking using PyGILState_Ensure().
The Python documentation really should be more clear on this and at least provide an example for both the embedding and extension use cases.

Eventually I figured it out.
After
PyEval_InitThreads();
You need to call
PyEval_SaveThread();
While properly release the GIL for the main thread.

Note that the if (!gil_init) { code in #forman's answer runs only once, so it can be just as well done in the main thread, which allows us to drop the flag (gil_init would properly have to be atomic or otherwise synchronized).
PyEval_InitThreads() is meaningful only in CPython 3.6 and older, and has been deprecated in CPython 3.9, so it has to be guarded with a macro.
Given all this, what I am currently using is the following:
In the main thread, run all of
Py_Initialize();
PyEval_InitThreads(); // only on Python 3.6 or older!
/* tstate = */ PyEval_SaveThread(); // maybe save the return value if you need it later
Now, whenever you need to call into Python, do
state = PyGILState_Ensure();
// Call Python/C API functions...
PyGILState_Release(state);
Finally, from the main thread, stop the Python interpreter
PyGILState_Ensure(); // PyEval_RestoreThread(tstate); seems to work just as well
Py_Finalize()

Having a multi-threaded C app trying to communicate from multiple threads to multiple Python threads of a single CPython instance looks risky to me.
As long as only one C thread communicates with Python you should not have to worry about locking even if the Python application is multi-threading.
If you need multiple python threads you can set the application up this way and have multiple C threads communicate via a queue with that single C thread that farms them out to multiple Python threads.
An alternative that might work for you is to have multiple CPython instances one for each C thread that needs it (of course communication between Python programs should be via the C program).
Another alternative might the Stackless Python interpreter. That does away with the GIL, but I am not sure you run into other problems binding it to multiple threads. stackless was a drop-in replacement for my (single-threaded) C application.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.