Python GIL and multithreading

Python GIL and multithreading - python

I would like to separate my sigle-thread application to number of working threads. Just 1 question - what about performance of this action? If GIL prevents python from executing more than 1 thread at the time will I have any profit?
Another point (from c/c++ point of view) - as I know each thread, anyway, can be only executed exclusively, so in the lower level than python interpreter I have the same limitation.
Summary: Will the the python threads have lesser efficiency that 'native' thread in part of task switching?

Don't worry about the GIL. Depending on the kinds of things your program does (calculation vs. I/O) you will have different performance characteristics. If your program is I/O bound then you probably won't notice the GIL at all.
Another approach is to use the multiprocessing module where each process runs in its own OS process with its own Python runtime. You can take full advantage of multiple cores with this approach, and it's usually safer because you don't have to worry about synchronising access to shared memory.

Related

Multiprocessing multithreading GIL?

So, since several days I do a lot of research about multiprocessing and multithreading on python and i'm very confused about many thing. So many times I see someone talking about GIL something that doesn't allow Python code to execute on several cpu cores, but when I code a program who create many threads I can see several cpu cores are active.
1st question: What's is really GIL? does it work? I think about something like when a process create too many thread the OS distributed task on multi cpu. Am I right?
Other thing, I want take advantage of my cpus. I think about something like create as much process as cpu core and on this each process create as much thread as cpu core. Am I on the right lane?

To start with, GIL only ensures that only one cpython bytecode instruction will run at any given time. It does not care about which CPU core runs the instruction. That is the job of the OS kernel.
So going over your questions:
GIL is just a piece of code. The CPython Virtual machine is the process which first compiles the code to Cpython bytecode but it's normal job is to interpret the CPython bytecode. GIL is a piece of code that ensures a single line of bytecode runs at a time no matter how many threads are running. Cpython Bytecode instructions is what constitutes the virtual machine stack. So in a way, GIL will ensure that only one thread holds the GIL at any given point of time. (also that it keeps releasing the GIL for other threads and not starve them.)
Now coming to your actual confusion. You mention that when you run a program with many threads, you can see multiple (may be all) CPU cores firing up. So I did some experimentation and found that your findings are right (which is obvious) but the behaviour is similar in a non threaded version too.
def do_nothing(i):
time.sleep(0.0001)
return i*2
ThreadPool(20).map(do_nothing, range(10000))
def do_nothing(i):
time.sleep(0.0001)
return i*2
[do_nothing(i) for i in range(10000)]
The first one in multithreaded and the second one is not. When you compare the CPU usage by by both the programs, you will find that in both the cases multiple CPU cores will fire up. So what you noticed, although right, has not much to do with GIL or threading. CPU usage going high in multiple cores is simply because OS kernel will distribute the execution of code to different cores based on availability.
Your last question is more of an experimental thing as different programs have different CPU/io usage. You just have to be aware of the cost of creation of a thread and a process and the working of GIL & PVM and optimize the number of threads and processes to get the maximum perf out.
You can go through this talk by David Beazley to understand how multithreading can make your code perform worse (or better).

There are answers about what the Global Interpreter Lock (GIL) is here. Buried among the answers is mention of Python "bytecode", which is central to the issue. When your program is compiled, the output is bytecode, i.e. low-level computer instructions for a fictitious "Python" computer, that gets interpreted by the Python interpreter. When the interpreter is executing a bytecode, it serializes execution by acquiring the Global Interpreter Lock. This means that two threads cannot be executing bytecode concurrently on two different cores. This also means that true multi-threading is not implemented. But does this mean that there is no reason to use threading? No! Here are a couple of situations where threading is still useful:
For certain operations the interpreter will release the GIL, i.e. when doing I/O. So consider as an example the case where you want to fetch a lot of URLs from different websites. Most of the time is spent waiting for a response to be returned once the request is made and this waiting can be overlapped even if formulating the requests has to be done serially.
Many Python functions and modules are implemented in the C language and are free to release the GIL after being called if their processing requirements allow it. The numpy module is one such highly optimized package.
Consequently, threading is best used when the tasks are not CPU-intensive, i.e. they do a lot of waiting for I/O to complete, or they do a lot of sleeping, etc.

Python threads all executing on a single core

I have a Python program that spawns many threads, runs 4 at a time, and each performs an expensive operation. Pseudocode:
for object in list:
t = Thread(target=process, args=(object))
# if fewer than 4 threads are currently running, t.start(). Otherwise, add t to queue
But when the program is run, Activity Monitor in OS X shows that 1 of the 4 logical cores is at 100% and the others are at nearly 0. Obviously I can't force the OS to do anything but I've never had to pay attention to performance in multi-threaded code like this before so I was wondering if I'm just missing or misunderstanding something.
Thanks.

Note that in many cases (and virtually all cases where your "expensive operation" is a calculation implemented in Python), multiple threads will not actually run concurrently due to Python's Global Interpreter Lock (GIL).
The GIL is an interpreter-level lock.
This lock prevents execution of
multiple threads at once in the Python
interpreter. Each thread that wants to
run must wait for the GIL to be
released by the other thread, which
means your multi-threaded Python
application is essentially single
threaded, right? Yes. Not exactly.
Sort of.
CPython uses what’s called “operating
system” threads under the covers,
which is to say each time a request to
make a new thread is made, the
interpreter actually calls into the
operating system’s libraries and
kernel to generate a new thread. This
is the same as Java, for example. So
in memory you really do have multiple
threads and normally the operating
system controls which thread is
scheduled to run. On a multiple
processor machine, this means you
could have many threads spread across
multiple processors, all happily
chugging away doing work.
However, while CPython does use
operating system threads (in theory
allowing multiple threads to execute
within the interpreter
simultaneously), the interpreter also
forces the GIL to be acquired by a
thread before it can access the
interpreter and stack and can modify
Python objects in memory all
willy-nilly. The latter point is why
the GIL exists: The GIL prevents
simultaneous access to Python objects
by multiple threads. But this does not
save you (as illustrated by the Bank
example) from being a lock-sensitive
creature; you don’t get a free ride.
The GIL is there to protect the
interpreters memory, not your sanity.
See the Global Interpreter Lock section of Jesse Noller's post for more details.
To get around this problem, check out Python's multiprocessing module.
multiple processes (with judicious use
of IPC) are[...] a much better
approach to writing apps for multi-CPU
boxes than threads.
-- Guido van Rossum (creator of Python)
Edit based on a comment from #spinkus:
If Python can't run multiple threads simultaneously, then why have threading at all?
Threads can still be very useful in Python when doing simultaneous operations that do not need to modify the interpreter's state. This includes many (most?) long-running function calls that are not in-Python calculations, such as I/O (file access or network requests)) and [calculations on Numpy arrays][6]. These operations release the GIL while waiting for a result, allowing the program to continue executing. Then, once the result is received, the thread must re-acquire the GIL in order to use that result in "Python-land"

Python has a Global Interpreter Lock, which can prevent threads of interpreted code from being processed concurrently.
http://en.wikipedia.org/wiki/Global_Interpreter_Lock
http://wiki.python.org/moin/GlobalInterpreterLock
For ways to get around this, try the multiprocessing module, as advised here:
Does running separate python processes avoid the GIL?

AFAIK, in CPython the Global Interpreter Lock means that there can't be more than one block of Python code being run at any one time. Although this does not really affect anything in a single processor/single-core machine, on a mulitcore machine it means you have effectively only one thread running at any one time - causing all the other core to be idle.

Should I use fork or threads?

In my script, I have a function foo which basically uses pynotify to notify user about something repeatedly after a time interval say 15 minutes.
def foo:
while True:
"""Does something"""
time.sleep(900)
My main script has to interact with user & does all other things so I just cant call the foo() function. directly.
Whats the better way of doing it and why?
Using fork or threads?

I won't tell you which one to use, but here are some of the advantages of each:
Threads can start more quickly than processes, and threads use fewer operating system resources than processes, including memory, file handles, etc. Threads also give you the option of communicating through shared variables (although many would say this is more of a disadvantage than an advantage - See below).
Processes each have their own separate memory and variables, which means that processes generally communicate by sending messages to each other. This is much easier to do correctly than having threads communicate via shared memory. Processes can also run truly concurrently, so that if you have multiple CPU cores, you can keep all of them busy using processes. In Python*, the global interpreter lock prevents threads from making much use of more than a single core.
* - That is, CPython, which the implementation of Python that you get if you go to http://python.org and download Python. Other Python implementations (such as Jython) do not necessarily prohibit Python from running threads on multiple CPUs simultaneously. Thanks to #EOL for the clarification.

For these kinds of problems, neither threads nor forked processes seem the right approach. If all you want to do is to once every 15 minutes notify the user of something, why not use an event loop like GLib's or Twisted's reactor ? This allows you to schedule operations that should run once in a while, and get on with the rest of your program.

Using multiple processes lets you exploit multiple CPU cores at the same time, while, in CPython, using threads doesn't (threads take turns using a single CPU core) -- so, if you have CPU intensive work and absolutely want to use threads, you should consider Jython or IronPython; with CPython, this consideration is often enough to sway the choice towards the multiprocessing module and away from the threading one (they offer pretty similar interfaces, because multiprocessing was designed to be easily put in place in lieu of threading).
Net of this crucial consideration, threads might often be a better choice (performance-wise) on Windows (where making a new process is a heavy task), but less often on Unix variants (Linux, BSD versions, OpenSolaris, MacOSX, ...), since making a new process is faster there (but if you're using IronPython or Jython, you should check, on the platforms you care about, that this still applies in the virtual machines in question -- CLR with either .NET or Mono for IronPython, your JVM of choice for Jython).

Processes are much simpler. Just turn them loose and let the OS handle it.
Also, processes are often much more efficient. Processes do not share a common pool of I/O resources; they are completely independent.
Python's subprocess.Popen handles everything.

If by fork you mean os.fork then I would avoid using that. It is not cross platform and too low level - you would need to implement communication between the processes yourself.
If you want to use a separate process then use either the subprocess module or if you are on Python 2.6 or later the new multiprocessing module. This has a very similar API to the threading module, so you could start off using threads and then easily switch to processes, or vice-versa.
For what you want to do I think I would use threads, unless """does something""" is CPU intensive and you want to take advantage of multiple cores, which I doubt in this particular case.

What is the global interpreter lock (GIL) in CPython?

What is a global interpreter lock and why is it an issue?
A lot of noise has been made around removing the GIL from Python, and I'd like to understand why that is so important. I have never written a compiler nor an interpreter myself, so don't be frugal with details, I'll probably need them to understand.

Python's GIL is intended to serialize access to interpreter internals from different threads. On multi-core systems, it means that multiple threads can't effectively make use of multiple cores. (If the GIL didn't lead to this problem, most people wouldn't care about the GIL - it's only being raised as an issue because of the increasing prevalence of multi-core systems.) If you want to understand it in detail, you can view this video or look at this set of slides. It might be too much information, but then you did ask for details :-)
Note that Python's GIL is only really an issue for CPython, the reference implementation. Jython and IronPython don't have a GIL. As a Python developer, you don't generally come across the GIL unless you're writing a C extension. C extension writers need to release the GIL when their extensions do blocking I/O, so that other threads in the Python process get a chance to run.

Suppose you have multiple threads which don't really touch each other's data. Those should execute as independently as possible. If you have a "global lock" which you need to acquire in order to (say) call a function, that can end up as a bottleneck. You can wind up not getting much benefit from having multiple threads in the first place.
To put it into a real world analogy: imagine 100 developers working at a company with only a single coffee mug. Most of the developers would spend their time waiting for coffee instead of coding.
None of this is Python-specific - I don't know the details of what Python needed a GIL for in the first place. However, hopefully it's given you a better idea of the general concept.

Let's first understand what the python GIL provides:
Any operation/instruction is executed in the interpreter. GIL ensures that interpreter is held by a single thread at a particular instant of time. And your python program with multiple threads works in a single interpreter. At any particular instant of time, this interpreter is held by a single thread. It means that only the thread which is holding the interpreter is running at any instant of time.
Now why is that an issue:
Your machine could be having multiple cores/processors. And multiple cores allow multiple threads to execute simultaneously i.e multiple threads could execute at any particular instant of time..
But since the interpreter is held by a single thread, other threads are not doing anything even though they have access to a core. So, you are not getting any advantage provided by multiple cores because at any instant only a single core, which is the core being used by the thread currently holding the interpreter, is being used. So, your program will take as long to execute as if it were a single threaded program.
However, potentially blocking or long-running operations, such as I/O, image processing, and NumPy number crunching, happen outside the GIL. Taken from here. So for such operations, a multithreaded operation will still be faster than a single threaded operation despite the presence of GIL. So, GIL is not always a bottleneck.
Edit: GIL is an implementation detail of CPython. IronPython and Jython don't have GIL, so a truly multithreaded program should be possible in them, thought I have never used PyPy and Jython and not sure of this.

Python 3.7 documentation
I would also like to highlight the following quote from the Python threading documentation:
CPython implementation detail: In CPython, due to the Global Interpreter Lock, only one thread can execute Python code at once (even though certain performance-oriented libraries might overcome this limitation). If you want your application to make better use of the computational resources of multi-core machines, you are advised to use multiprocessing or concurrent.futures.ProcessPoolExecutor. However, threading is still an appropriate model if you want to run multiple I/O-bound tasks simultaneously.
This links to the Glossary entry for global interpreter lock which explains that the GIL implies that threaded parallelism in Python is unsuitable for CPU bound tasks:
The mechanism used by the CPython interpreter to assure that only one thread executes Python bytecode at a time. This simplifies the CPython implementation by making the object model (including critical built-in types such as dict) implicitly safe against concurrent access. Locking the entire interpreter makes it easier for the interpreter to be multi-threaded, at the expense of much of the parallelism afforded by multi-processor machines.
However, some extension modules, either standard or third-party, are designed so as to release the GIL when doing computationally-intensive tasks such as compression or hashing. Also, the GIL is always released when doing I/O.
Past efforts to create a “free-threaded” interpreter (one which locks shared data at a much finer granularity) have not been successful because performance suffered in the common single-processor case. It is believed that overcoming this performance issue would make the implementation much more complicated and therefore costlier to maintain.
This quote also implies that dicts and thus variable assignment are also thread safe as a CPython implementation detail:
Is Python variable assignment atomic?
Thread Safety in Python's dictionary
Next, the docs for the multiprocessing package explain how it overcomes the GIL by spawning process while exposing an interface similar to that of threading:
multiprocessing is a package that supports spawning processes using an API similar to the threading module. The multiprocessing package offers both local and remote concurrency, effectively side-stepping the Global Interpreter Lock by using subprocesses instead of threads. Due to this, the multiprocessing module allows the programmer to fully leverage multiple processors on a given machine. It runs on both Unix and Windows.
And the docs for concurrent.futures.ProcessPoolExecutor explain that it uses multiprocessing as a backend:
The ProcessPoolExecutor class is an Executor subclass that uses a pool of processes to execute calls asynchronously. ProcessPoolExecutor uses the multiprocessing module, which allows it to side-step the Global Interpreter Lock but also means that only picklable objects can be executed and returned.
which should be contrasted to the other base class ThreadPoolExecutor that uses threads instead of processes
ThreadPoolExecutor is an Executor subclass that uses a pool of threads to execute calls asynchronously.
from which we conclude that ThreadPoolExecutor is only suitable for I/O bound tasks, while ProcessPoolExecutor can also handle CPU bound tasks.
Process vs thread experiments
At Multiprocessing vs Threading Python I've done an experimental analysis of process vs threads in Python.
Quick preview of the results:
In other languages
The concept seems to exist outside of Python as well, applying just as well to Ruby for example: https://en.wikipedia.org/wiki/Global_interpreter_lock
It mentions the advantages:
increased speed of single-threaded programs (no necessity to acquire or release locks on all data structures separately),
easy integration of C libraries that usually are not thread-safe,
ease of implementation (having a single GIL is much simpler to implement than a lock-free interpreter or one using fine-grained locks).
but the JVM seems to do just fine without the GIL, so I wonder if it is worth it. The following question asks why the GIL exists in the first place: Why the Global Interpreter Lock?

Python doesn't allow multi-threading in the truest sense of the word. It has a multi-threading package but if you want to multi-thread to speed your code up, then it's usually not a good idea to use it. Python has a construct called the Global Interpreter Lock (GIL).
https://www.youtube.com/watch?v=ph374fJqFPE
The GIL makes sure that only one of your 'threads' can execute at any one time. A thread acquires the GIL, does a little work, then passes the GIL onto the next thread. This happens very quickly so to the human eye it may seem like your threads are executing in parallel, but they are really just taking turns using the same CPU core. All this GIL passing adds overhead to execution. This means that if you want to make your code run faster then using the threading package often isn't a good idea.
There are reasons to use Python's threading package. If you want to run some things simultaneously, and efficiency is not a concern, then it's totally fine and convenient. Or if you are running code that needs to wait for something (like some IO) then it could make a lot of sense. But the threading library wont let you use extra CPU cores.
Multi-threading can be outsourced to the operating system (by doing multi-processing), some external application that calls your Python code (eg, Spark or Hadoop), or some code that your Python code calls (eg: you could have your Python code call a C function that does the expensive multi-threaded stuff).

Whenever two threads have access to the same variable you have a problem.
In C++ for instance, the way to avoid the problem is to define some mutex lock to prevent two thread to, let's say, enter the setter of an object at the same time.
Multithreading is possible in python, but two threads cannot be executed at the same time
at a granularity finer than one python instruction.
The running thread is getting a global lock called GIL.
This means if you begin write some multithreaded code in order to take advantage of your multicore processor, your performance won't improve.
The usual workaround consists of going multiprocess.
Note that it is possible to release the GIL if you're inside a method you wrote in C for instance.
The use of a GIL is not inherent to Python but to some of its interpreter, including the most common CPython.
(#edited, see comment)
The GIL issue is still valid in Python 3000.

Why Python (CPython and others) uses the GIL
From http://wiki.python.org/moin/GlobalInterpreterLock
In CPython, the global interpreter lock, or GIL, is a mutex that prevents multiple native threads from executing Python bytecodes at once. This lock is necessary mainly because CPython's memory management is not thread-safe.
How to remove it from Python?
Like Lua, maybe Python could start multiple VM, But python doesn't do that, I guess there should be some other reasons.
In Numpy or some other python extended library, sometimes, releasing the GIL to other threads could boost the efficiency of the whole programme.

I want to share an example from the book multithreading for Visual Effects. So here is a classic dead lock situation
static void MyCallback(const Context &context){
Auto<Lock> lock(GetMyMutexFromContext(context));
...
EvalMyPythonString(str); //A function that takes the GIL
...
}
Now consider the events in the sequence resulting a dead-lock.
╔═══╦════════════════════════════════════════╦══════════════════════════════════════╗
║ ║ Main Thread ║ Other Thread ║
╠═══╬════════════════════════════════════════╬══════════════════════════════════════╣
║ 1 ║ Python Command acquires GIL ║ Work started ║
║ 2 ║ Computation requested ║ MyCallback runs and acquires MyMutex ║
║ 3 ║ ║ MyCallback now waits for GIL ║
║ 4 ║ MyCallback runs and waits for MyMutex ║ waiting for GIL ║
╚═══╩════════════════════════════════════════╩══════════════════════════════════════╝

Why the Global Interpreter Lock?

What is exactly the function of Python's Global Interpreter Lock?
Do other languages that are compiled to bytecode employ a similar mechanism?

In general, for any thread safety problem you will need to protect your internal data structures with locks.
This can be done with various levels of granularity.
You can use fine-grained locking, where every separate structure has its own lock.
You can use coarse-grained locking where one lock protects everything (the GIL approach).
There are various pros and cons of each method. Fine-grained locking allows greater parallelism - two threads can
execute in parallel when they don't share any resources. However there is a much larger administrative overhead. For
every line of code, you may need to acquire and release several locks.
The coarse grained approach is the opposite. Two threads can't run at the same time, but an individual thread will run faster because its not doing so much bookkeeping. Ultimately it comes down to a tradeoff between single-threaded speed and parallelism.
There have been a few attempts to remove the GIL in python, but the extra overhead for single threaded machines was generally too large. Some cases can actually be slower even on multi-processor machines
due to lock contention.
Do other languages that are compiled to bytecode employ a similar mechanism?
It varies, and it probably shouldn't be considered a language property so much as an implementation property.
For instance, there are Python implementations such as Jython and IronPython which use the threading approach of their underlying VM, rather than a GIL approach. Additionally, the next version of Ruby looks to be moving towards introducing a GIL.

The following is from the official Python/C API Reference Manual:
The Python interpreter is not fully
thread safe. In order to support
multi-threaded Python programs,
there's a global lock that must be
held by the current thread before it
can safely access Python objects.
Without the lock, even the simplest
operations could cause problems in a
multi-threaded program: for example,
when two threads simultaneously
increment the reference count of the
same object, the reference count could
end up being incremented only once
instead of twice.
Therefore, the rule exists that only
the thread that has acquired the
global interpreter lock may operate on
Python objects or call Python/C API
functions. In order to support
multi-threaded Python programs, the
interpreter regularly releases and
reacquires the lock -- by default,
every 100 bytecode instructions (this
can be changed with
sys.setcheckinterval()). The lock is
also released and reacquired around
potentially blocking I/O operations
like reading or writing a file, so
that other threads can run while the
thread that requests the I/O is
waiting for the I/O operation to
complete.
I think it sums up the issue pretty well.

The global interpreter lock is a big mutex-type lock that protects reference counters from getting hosed. If you are writing pure python code, this all happens behind the scenes, but if you embedding Python into C, then you might have to explicitly take/release the lock.
This mechanism is not related to Python being compiled to bytecode. It's not needed for Java. In fact, it's not even needed for Jython (python compiled to jvm).
see also this question

Python, like perl 5, was not designed from the ground up to be thread safe. Threads were grafted on after the fact, so the global interpreter lock is used to maintain mutual exclusion to where only one thread is executing code at a given time in the bowels of the interpreter.
Individual Python threads are cooperatively multitasked by the interpreter itself by cycling the lock every so often.
Grabbing the lock yourself is needed when you are talking to Python from C when other Python threads are active to 'opt in' to this protocol and make sure that nothing unsafe happens behind your back.
Other systems that have a single-threaded heritage that later evolved into mulithreaded systems often have some mechanism of this sort. For instance, the Linux kernel has the "Big Kernel Lock" from its early SMP days. Gradually over time as multi-threading performance becomes an issue there is a tendency to try to break these sorts of locks up into smaller pieces or replace them with lock-free algorithms and data structures where possible to maximize throughput.

Regarding your second question, not all scripting languages use this, but it only makes them less powerful. For instance, the threads in Ruby are green and not native.
In Python, the threads are native and the GIL only prevents them from running on different cores.
In Perl, the threads are even worse. They just copy the whole interpreter, and are far from being as usable as in Python.

Maybe this article by the BDFL will help.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.