Are Python extensions produced by Cython/Pyrex threadsafe?

Are Python extensions produced by Cython/Pyrex threadsafe? - python

If not, is there a way I can guarantee thread safety by programming a certain way?
To clarify, when talking about "threadsafe,' I mean Python threads, not OS-level threads.

It all depends on the interaction between your Cython code and Python's GIL, as documented in detail here. If you don't do anything special, Cython-generated code will respect the GIL (as will a C-coded extension that doesn't use the GIL-releasing macros); that makes such code "as threadsafe as Python code" -- which isn't much, but is easier to handle than completely free-threading code (you still need to architect multi-threaded cooperation and synchronization, ideally with Queue instances but possibly with locking &c).
Code that has relinquished the GIL and not yet acquired it back MUST NOT in any way interact with the Python runtime and the objects that the Python runtime uses -- this goes for Cython just as well as for C-coded extensions. The upside of it is of course that such code can run on a separate core (until it needs to sync up or in any way communicate with the Python runtime again, of course).

Python's global interpreter lock means that only one thread can be active in the interpreter at any one time. However, once control is passed out to a C extension another thread can be active within the interpreter. Multiple threads can be created, and nothing prevents a thread from being interrupted within the middle of a critical section. N
on thread-safe code can be implemented within the interpreter, so nothing about code running within the interpreter is inherently thread safe. Code in C or Pyrex modules can still modify data structures that are visible to python code. Native code can, of course, also have threading issues with native data structures.
You can't guarantee thread safety beyond using appropriate design and synchronisation - the GIL on the python interpreter doesn't materially change this.

Related

How to manipulate the GIL when many threads need to execute?

My understanding is that the typical GIL manipulations involve, e.g., blocking I/O operations. Hence one would want to release the lock before the I/O operation and reacquire it once it has completed.
I'm currently facing a different scenario with a C extension: I am creating X windows that are exposed to Python via the Canvas class. When the method show() is called on an instance, a new UI thread is started using PyThreads (with a call to PyThread_start_new_thread). This new thread is responsible for drawing on the X window, using the Python code specified in the on_draw method of a subclass of Canvas. A pure C event loop is started in the main thread that simply checks for events on the X window and, for the time being, only captures the WM_DELETE_EVENT.
So I have potentially many threads (one for each X window) that want to execute Python code and the main thread that does not execute any Python code at all.
How do I release/acquire the GIL in order to allow the UI threads to get into the interpreter orderly?

The rule is easy: you need to hold the GIL to access Python machinery (any API starting with Py<...> and any PyObject).
So, you can release it whenever you don't need any of that.
Anything further than this is the fundamental problem of locking granularity: potential benefits vs locking overhead. There was an experiment for Py 1.4 to replace the GIL with more granular locks that failed exactly because the overhead proved prohibitive.
That's why it's typically released for code chunks involving call(s) to extental facilities that can take arbitrary time (especially if they involve waiting for external events) -- if you don't release the lock, Python will be just idling during this time.
Heeding this rule, you will get to your goal automatically: whenever a thread can't proceed further (whether it's I/O, signal from another thread, or even so much as a time.sleep() to avoid a busy loop), it will release the lock and allow other threads to proceed in its stead. The GIL assigning mechanism strives to be fair (see issue8299 for exploration on how fair it is), releasing the programmer from bothering about any bias stemming solely from the engine.

I think the problem stems from the fact that, in my opinion, the official documentation is a bit ambiguous on the meaning of Non-Python created threads. Quoting from it:
When threads are created using the dedicated Python APIs (such as the threading module), a thread state is automatically associated to them and the code showed above is therefore correct. However, when threads are created from C (for example by a third-party library with its own thread management), they don’t hold the GIL, nor is there a thread state structure for them.
I have highlighted in bold the parts that I find off-putting. As I have stated in the OP, I am calling PyThread_start_new_thread. Whilst this creates a new thread from C, this function is not part of a third-party library, but of the dedicated Python (C) APIs. Based on this assumption, I ruled out that I actually needed to use the PyGILState_Ensure/PyGILState_Release paradigm.
As far as I can tell from what I've seen with my experiments, a thread created from C with (just) PyThread_start_new_thread should be considered as a non-Python created thread.

Safe to call multiprocessing from a thread in Python?

According to
https://github.com/joblib/joblib/issues/180, and Is there a safe way to create a subprocess from a thread in python?
the Python multiprocessing module does not allow use from within threads. Is this true?
My understanding is that its fine to fork from threads, as long as you
aren't holding a threading.Lock when you do so (in the current thread? anywhere in the process?). However, Python's documentation is silent on whether threading.Lock objects are safely shared after a fork.
There's also this: locks shared from the logging module causes issues with fork. https://bugs.python.org/issue6721
I'm not sure how this issue arises. It sounds like the state of any locks in the process are copied into the child process when the current thread forks (which seems like a design error and certain to deadlock). If so, does using multiprocessing really provide any protection against this (since I'm free to create my multiprocessing.Pool after threading.Lock is created and entered by other threads, and after threads have started that using the not-fork-safe logging module) -- the multiprocessing module docs are also silent about whether multiprocessing.Pools should be allocated before Locks.
Does replacing threading.Lock with multiprocessing.Lock everywhere avoid this issue and allow us to safely combine threads and forks?

It sounds like the state of any locks in the process are copied into the child process when the current thread forks (which seems like a design error and certain to deadlock).
It is not a design error, rather, fork() predates single-process multithreading. The state of all locks is copied into the child process because they're just objects in memory; the entire address-space of the process is copied as is in fork. There are only bad alternatives: either copy all threads over fork, or deny forking in multithreaded application.
Therefore, fork()ing in a multithreading program was never the safe thing to do, unless then followed by execve() or exit() in the child process.
Does replacing threading.Lock with multiprocessing.Lock everywhere avoid this issue and allow us to safely combine threads and forks?
No. Nothing makes it safe to combine threads and forks, it cannot be done.
The problem is that when you have multiple threads in a process, after fork() system call you cannot continue safely running the program in POSIX systems.
For example, Linux manuals fork(2):
After a fork(2) in a multithreaded program, the child can safely call
only async-signal-safe functions (see signal(7)) until such time as it
calls execve(2).
I.e. it is OK to fork() in a multithreaded program and then only call async-signal-safe C functions (which is a rather limited subset of C functions), until the child process has been replaced with another executable!
Unsafe C function calls in child processes are then for example
malloc for dynamic memory allocation
any <stdio.h> functions for formatted input
most of the pthread_* functions required for thread state handling, including creation of new threads...
Thus there is very little what the child process can actually safely do. Unfortunately CPython core developers have been downplaying the problems caused by this. Even now the documentation says:
Note that safely forking a multithreaded process is
problematic.
Quite an euphemism for "impossible".
It is safe to use multiprocessing from a Python process that has multiple threads of control provided that you're not using the fork start method; in Python 3.4+ it is now possible to change the start method. In previous Python versions including all of Python 2, the POSIX systems always behaved as if fork was specified as the start method; this would result in undefined behaviour.
The problems are not limited to just threading.Lock objects but all locks held by the C standard library, the C extensions etc. What is worse that most of the time people would say "it works for me"... until it stops from working.
There were even a cases where a seemingly single-threading Python program is actually multithreading in MacOS X, causing failures and deadlocks upon using multiprocessing.
Another problem is that all opened file handles, their use, shared sockets might behave oddly in programs that forks, but that would be the case even in single-threaded programs.
TL;DR: using multiprocessing in multithreaded programs, with C extensions, with opened sockets etc:
fine in 3.4+ & POSIX if you explicitly specify a starting method that is not fork,
fine in Windows because it doesn't support forking;
in Python 2 - 3.3 on POSIX: you'll mostly shoot yourself in the foot.

Python threads all executing on a single core

I have a Python program that spawns many threads, runs 4 at a time, and each performs an expensive operation. Pseudocode:
for object in list:
t = Thread(target=process, args=(object))
# if fewer than 4 threads are currently running, t.start(). Otherwise, add t to queue
But when the program is run, Activity Monitor in OS X shows that 1 of the 4 logical cores is at 100% and the others are at nearly 0. Obviously I can't force the OS to do anything but I've never had to pay attention to performance in multi-threaded code like this before so I was wondering if I'm just missing or misunderstanding something.
Thanks.

Note that in many cases (and virtually all cases where your "expensive operation" is a calculation implemented in Python), multiple threads will not actually run concurrently due to Python's Global Interpreter Lock (GIL).
The GIL is an interpreter-level lock.
This lock prevents execution of
multiple threads at once in the Python
interpreter. Each thread that wants to
run must wait for the GIL to be
released by the other thread, which
means your multi-threaded Python
application is essentially single
threaded, right? Yes. Not exactly.
Sort of.
CPython uses what’s called “operating
system” threads under the covers,
which is to say each time a request to
make a new thread is made, the
interpreter actually calls into the
operating system’s libraries and
kernel to generate a new thread. This
is the same as Java, for example. So
in memory you really do have multiple
threads and normally the operating
system controls which thread is
scheduled to run. On a multiple
processor machine, this means you
could have many threads spread across
multiple processors, all happily
chugging away doing work.
However, while CPython does use
operating system threads (in theory
allowing multiple threads to execute
within the interpreter
simultaneously), the interpreter also
forces the GIL to be acquired by a
thread before it can access the
interpreter and stack and can modify
Python objects in memory all
willy-nilly. The latter point is why
the GIL exists: The GIL prevents
simultaneous access to Python objects
by multiple threads. But this does not
save you (as illustrated by the Bank
example) from being a lock-sensitive
creature; you don’t get a free ride.
The GIL is there to protect the
interpreters memory, not your sanity.
See the Global Interpreter Lock section of Jesse Noller's post for more details.
To get around this problem, check out Python's multiprocessing module.
multiple processes (with judicious use
of IPC) are[...] a much better
approach to writing apps for multi-CPU
boxes than threads.
-- Guido van Rossum (creator of Python)
Edit based on a comment from #spinkus:
If Python can't run multiple threads simultaneously, then why have threading at all?
Threads can still be very useful in Python when doing simultaneous operations that do not need to modify the interpreter's state. This includes many (most?) long-running function calls that are not in-Python calculations, such as I/O (file access or network requests)) and [calculations on Numpy arrays][6]. These operations release the GIL while waiting for a result, allowing the program to continue executing. Then, once the result is received, the thread must re-acquire the GIL in order to use that result in "Python-land"

Python has a Global Interpreter Lock, which can prevent threads of interpreted code from being processed concurrently.
http://en.wikipedia.org/wiki/Global_Interpreter_Lock
http://wiki.python.org/moin/GlobalInterpreterLock
For ways to get around this, try the multiprocessing module, as advised here:
Does running separate python processes avoid the GIL?

AFAIK, in CPython the Global Interpreter Lock means that there can't be more than one block of Python code being run at any one time. Although this does not really affect anything in a single processor/single-core machine, on a mulitcore machine it means you have effectively only one thread running at any one time - causing all the other core to be idle.

What is the global interpreter lock (GIL) in CPython?

What is a global interpreter lock and why is it an issue?
A lot of noise has been made around removing the GIL from Python, and I'd like to understand why that is so important. I have never written a compiler nor an interpreter myself, so don't be frugal with details, I'll probably need them to understand.

Python's GIL is intended to serialize access to interpreter internals from different threads. On multi-core systems, it means that multiple threads can't effectively make use of multiple cores. (If the GIL didn't lead to this problem, most people wouldn't care about the GIL - it's only being raised as an issue because of the increasing prevalence of multi-core systems.) If you want to understand it in detail, you can view this video or look at this set of slides. It might be too much information, but then you did ask for details :-)
Note that Python's GIL is only really an issue for CPython, the reference implementation. Jython and IronPython don't have a GIL. As a Python developer, you don't generally come across the GIL unless you're writing a C extension. C extension writers need to release the GIL when their extensions do blocking I/O, so that other threads in the Python process get a chance to run.

Suppose you have multiple threads which don't really touch each other's data. Those should execute as independently as possible. If you have a "global lock" which you need to acquire in order to (say) call a function, that can end up as a bottleneck. You can wind up not getting much benefit from having multiple threads in the first place.
To put it into a real world analogy: imagine 100 developers working at a company with only a single coffee mug. Most of the developers would spend their time waiting for coffee instead of coding.
None of this is Python-specific - I don't know the details of what Python needed a GIL for in the first place. However, hopefully it's given you a better idea of the general concept.

Let's first understand what the python GIL provides:
Any operation/instruction is executed in the interpreter. GIL ensures that interpreter is held by a single thread at a particular instant of time. And your python program with multiple threads works in a single interpreter. At any particular instant of time, this interpreter is held by a single thread. It means that only the thread which is holding the interpreter is running at any instant of time.
Now why is that an issue:
Your machine could be having multiple cores/processors. And multiple cores allow multiple threads to execute simultaneously i.e multiple threads could execute at any particular instant of time..
But since the interpreter is held by a single thread, other threads are not doing anything even though they have access to a core. So, you are not getting any advantage provided by multiple cores because at any instant only a single core, which is the core being used by the thread currently holding the interpreter, is being used. So, your program will take as long to execute as if it were a single threaded program.
However, potentially blocking or long-running operations, such as I/O, image processing, and NumPy number crunching, happen outside the GIL. Taken from here. So for such operations, a multithreaded operation will still be faster than a single threaded operation despite the presence of GIL. So, GIL is not always a bottleneck.
Edit: GIL is an implementation detail of CPython. IronPython and Jython don't have GIL, so a truly multithreaded program should be possible in them, thought I have never used PyPy and Jython and not sure of this.

Python 3.7 documentation
I would also like to highlight the following quote from the Python threading documentation:
CPython implementation detail: In CPython, due to the Global Interpreter Lock, only one thread can execute Python code at once (even though certain performance-oriented libraries might overcome this limitation). If you want your application to make better use of the computational resources of multi-core machines, you are advised to use multiprocessing or concurrent.futures.ProcessPoolExecutor. However, threading is still an appropriate model if you want to run multiple I/O-bound tasks simultaneously.
This links to the Glossary entry for global interpreter lock which explains that the GIL implies that threaded parallelism in Python is unsuitable for CPU bound tasks:
The mechanism used by the CPython interpreter to assure that only one thread executes Python bytecode at a time. This simplifies the CPython implementation by making the object model (including critical built-in types such as dict) implicitly safe against concurrent access. Locking the entire interpreter makes it easier for the interpreter to be multi-threaded, at the expense of much of the parallelism afforded by multi-processor machines.
However, some extension modules, either standard or third-party, are designed so as to release the GIL when doing computationally-intensive tasks such as compression or hashing. Also, the GIL is always released when doing I/O.
Past efforts to create a “free-threaded” interpreter (one which locks shared data at a much finer granularity) have not been successful because performance suffered in the common single-processor case. It is believed that overcoming this performance issue would make the implementation much more complicated and therefore costlier to maintain.
This quote also implies that dicts and thus variable assignment are also thread safe as a CPython implementation detail:
Is Python variable assignment atomic?
Thread Safety in Python's dictionary
Next, the docs for the multiprocessing package explain how it overcomes the GIL by spawning process while exposing an interface similar to that of threading:
multiprocessing is a package that supports spawning processes using an API similar to the threading module. The multiprocessing package offers both local and remote concurrency, effectively side-stepping the Global Interpreter Lock by using subprocesses instead of threads. Due to this, the multiprocessing module allows the programmer to fully leverage multiple processors on a given machine. It runs on both Unix and Windows.
And the docs for concurrent.futures.ProcessPoolExecutor explain that it uses multiprocessing as a backend:
The ProcessPoolExecutor class is an Executor subclass that uses a pool of processes to execute calls asynchronously. ProcessPoolExecutor uses the multiprocessing module, which allows it to side-step the Global Interpreter Lock but also means that only picklable objects can be executed and returned.
which should be contrasted to the other base class ThreadPoolExecutor that uses threads instead of processes
ThreadPoolExecutor is an Executor subclass that uses a pool of threads to execute calls asynchronously.
from which we conclude that ThreadPoolExecutor is only suitable for I/O bound tasks, while ProcessPoolExecutor can also handle CPU bound tasks.
Process vs thread experiments
At Multiprocessing vs Threading Python I've done an experimental analysis of process vs threads in Python.
Quick preview of the results:
In other languages
The concept seems to exist outside of Python as well, applying just as well to Ruby for example: https://en.wikipedia.org/wiki/Global_interpreter_lock
It mentions the advantages:
increased speed of single-threaded programs (no necessity to acquire or release locks on all data structures separately),
easy integration of C libraries that usually are not thread-safe,
ease of implementation (having a single GIL is much simpler to implement than a lock-free interpreter or one using fine-grained locks).
but the JVM seems to do just fine without the GIL, so I wonder if it is worth it. The following question asks why the GIL exists in the first place: Why the Global Interpreter Lock?

Python doesn't allow multi-threading in the truest sense of the word. It has a multi-threading package but if you want to multi-thread to speed your code up, then it's usually not a good idea to use it. Python has a construct called the Global Interpreter Lock (GIL).
https://www.youtube.com/watch?v=ph374fJqFPE
The GIL makes sure that only one of your 'threads' can execute at any one time. A thread acquires the GIL, does a little work, then passes the GIL onto the next thread. This happens very quickly so to the human eye it may seem like your threads are executing in parallel, but they are really just taking turns using the same CPU core. All this GIL passing adds overhead to execution. This means that if you want to make your code run faster then using the threading package often isn't a good idea.
There are reasons to use Python's threading package. If you want to run some things simultaneously, and efficiency is not a concern, then it's totally fine and convenient. Or if you are running code that needs to wait for something (like some IO) then it could make a lot of sense. But the threading library wont let you use extra CPU cores.
Multi-threading can be outsourced to the operating system (by doing multi-processing), some external application that calls your Python code (eg, Spark or Hadoop), or some code that your Python code calls (eg: you could have your Python code call a C function that does the expensive multi-threaded stuff).

Whenever two threads have access to the same variable you have a problem.
In C++ for instance, the way to avoid the problem is to define some mutex lock to prevent two thread to, let's say, enter the setter of an object at the same time.
Multithreading is possible in python, but two threads cannot be executed at the same time
at a granularity finer than one python instruction.
The running thread is getting a global lock called GIL.
This means if you begin write some multithreaded code in order to take advantage of your multicore processor, your performance won't improve.
The usual workaround consists of going multiprocess.
Note that it is possible to release the GIL if you're inside a method you wrote in C for instance.
The use of a GIL is not inherent to Python but to some of its interpreter, including the most common CPython.
(#edited, see comment)
The GIL issue is still valid in Python 3000.

Why Python (CPython and others) uses the GIL
From http://wiki.python.org/moin/GlobalInterpreterLock
In CPython, the global interpreter lock, or GIL, is a mutex that prevents multiple native threads from executing Python bytecodes at once. This lock is necessary mainly because CPython's memory management is not thread-safe.
How to remove it from Python?
Like Lua, maybe Python could start multiple VM, But python doesn't do that, I guess there should be some other reasons.
In Numpy or some other python extended library, sometimes, releasing the GIL to other threads could boost the efficiency of the whole programme.

I want to share an example from the book multithreading for Visual Effects. So here is a classic dead lock situation
static void MyCallback(const Context &context){
Auto<Lock> lock(GetMyMutexFromContext(context));
...
EvalMyPythonString(str); //A function that takes the GIL
...
}
Now consider the events in the sequence resulting a dead-lock.
╔═══╦════════════════════════════════════════╦══════════════════════════════════════╗
║ ║ Main Thread ║ Other Thread ║
╠═══╬════════════════════════════════════════╬══════════════════════════════════════╣
║ 1 ║ Python Command acquires GIL ║ Work started ║
║ 2 ║ Computation requested ║ MyCallback runs and acquires MyMutex ║
║ 3 ║ ║ MyCallback now waits for GIL ║
║ 4 ║ MyCallback runs and waits for MyMutex ║ waiting for GIL ║
╚═══╩════════════════════════════════════════╩══════════════════════════════════════╝

Why the Global Interpreter Lock?

What is exactly the function of Python's Global Interpreter Lock?
Do other languages that are compiled to bytecode employ a similar mechanism?

In general, for any thread safety problem you will need to protect your internal data structures with locks.
This can be done with various levels of granularity.
You can use fine-grained locking, where every separate structure has its own lock.
You can use coarse-grained locking where one lock protects everything (the GIL approach).
There are various pros and cons of each method. Fine-grained locking allows greater parallelism - two threads can
execute in parallel when they don't share any resources. However there is a much larger administrative overhead. For
every line of code, you may need to acquire and release several locks.
The coarse grained approach is the opposite. Two threads can't run at the same time, but an individual thread will run faster because its not doing so much bookkeeping. Ultimately it comes down to a tradeoff between single-threaded speed and parallelism.
There have been a few attempts to remove the GIL in python, but the extra overhead for single threaded machines was generally too large. Some cases can actually be slower even on multi-processor machines
due to lock contention.
Do other languages that are compiled to bytecode employ a similar mechanism?
It varies, and it probably shouldn't be considered a language property so much as an implementation property.
For instance, there are Python implementations such as Jython and IronPython which use the threading approach of their underlying VM, rather than a GIL approach. Additionally, the next version of Ruby looks to be moving towards introducing a GIL.

The following is from the official Python/C API Reference Manual:
The Python interpreter is not fully
thread safe. In order to support
multi-threaded Python programs,
there's a global lock that must be
held by the current thread before it
can safely access Python objects.
Without the lock, even the simplest
operations could cause problems in a
multi-threaded program: for example,
when two threads simultaneously
increment the reference count of the
same object, the reference count could
end up being incremented only once
instead of twice.
Therefore, the rule exists that only
the thread that has acquired the
global interpreter lock may operate on
Python objects or call Python/C API
functions. In order to support
multi-threaded Python programs, the
interpreter regularly releases and
reacquires the lock -- by default,
every 100 bytecode instructions (this
can be changed with
sys.setcheckinterval()). The lock is
also released and reacquired around
potentially blocking I/O operations
like reading or writing a file, so
that other threads can run while the
thread that requests the I/O is
waiting for the I/O operation to
complete.
I think it sums up the issue pretty well.

The global interpreter lock is a big mutex-type lock that protects reference counters from getting hosed. If you are writing pure python code, this all happens behind the scenes, but if you embedding Python into C, then you might have to explicitly take/release the lock.
This mechanism is not related to Python being compiled to bytecode. It's not needed for Java. In fact, it's not even needed for Jython (python compiled to jvm).
see also this question

Python, like perl 5, was not designed from the ground up to be thread safe. Threads were grafted on after the fact, so the global interpreter lock is used to maintain mutual exclusion to where only one thread is executing code at a given time in the bowels of the interpreter.
Individual Python threads are cooperatively multitasked by the interpreter itself by cycling the lock every so often.
Grabbing the lock yourself is needed when you are talking to Python from C when other Python threads are active to 'opt in' to this protocol and make sure that nothing unsafe happens behind your back.
Other systems that have a single-threaded heritage that later evolved into mulithreaded systems often have some mechanism of this sort. For instance, the Linux kernel has the "Big Kernel Lock" from its early SMP days. Gradually over time as multi-threading performance becomes an issue there is a tendency to try to break these sorts of locks up into smaller pieces or replace them with lock-free algorithms and data structures where possible to maximize throughput.

Regarding your second question, not all scripting languages use this, but it only makes them less powerful. For instance, the threads in Ruby are green and not native.
In Python, the threads are native and the GIL only prevents them from running on different cores.
In Perl, the threads are even worse. They just copy the whole interpreter, and are far from being as usable as in Python.

Maybe this article by the BDFL will help.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.