Does the presence of python GIL imply that in python multi threading the same operation is not so different from repeating it in a single thread?.
For example, If I need to upload two files, what is the advantage of doing them in two threads instead of uploading them one after another?.
I tried a big math operation in both ways. But they seem to take almost equal time to complete.
This seems to be unclear to me. Can someone help me on this?.
Thanks.
Python's threads get a slightly worse rap than they deserve. There are three (well, 2.5) cases where they actually get you benefits:
If non-Python code (e.g. a C library, the kernel, etc.) is running, other Python threads can continue executing. It's only pure Python code that can't run in two threads at once. So if you're doing disk or network I/O, threads can indeed buy you something, as most of the time is spent outside of Python itself.
The GIL is not actually part of Python, it's an implementation detail of CPython (the "reference" implementation that the core Python devs work on, and that you usually get if you just run "python" on your Linux box or something.
Jython, IronPython, and any other reimplementations of Python generally do not have a GIL, and multiple pure-Python threads can execute simultaneously.
The 0.5 case: Even if you're entirely pure-Python and see little or no performance benefit from threading, some problems are really convenient in terms of developer time and difficulty to solve with threads. This depends in part on the developer, too, of course.
It really depends on the library you're using. The GIL is meant to prevent Python objects and its internal data structures to be changed at the same time. If you're doing an upload, the library you use to do the actual upload might release the GIL while it's waiting for the actual HTTP request to complete (I would assume that is the case with the HTTP modules in the standard library, but I didn't check).
As a side note, if you really want to have things running in parallel, just use multiple processes. It will save you a lot of trouble and you'll end up with better code (more robust, more scalable, and most probably better structured).
It depends on the native code module that's executing. Native modules can release the GIL and then go off and do their own thing allowing another thread to lock the GIL. The GIL is normally held while code, both python and native, are operating on python objects. If you want more detail you'll probably need to go and read quite a bit about it. :)
See:
What is a global interpreter lock (GIL)? and Thread State and the Global Interpreter Lock
Multithreading is a concept where two are more tasks need be completed simultaneously, for example, I have word processor in this application there are N numbers of a parallel task have to work. Like listening to keyboard, formatting input text, sending a formatted text to display unit. In this context with sequential processing, it is time-consuming and one task has to wait till the next task completion. So we put these tasks in threads and simultaneously complete the task. Three threads are always up and waiting for the inputs to arrive, then take that input and produce the output simultaneously.
So multi-threading works faster if we have multi-core and processors. But in reality with single processors, threads will work one after the other, but we feel it's executing with greater speed, Actually, one instruction executes at a time and a processor can execute billions of instructions at a time. So the computer creates illusion that multi-task or thread working parallel. It just an illusion.
Related
So, since several days I do a lot of research about multiprocessing and multithreading on python and i'm very confused about many thing. So many times I see someone talking about GIL something that doesn't allow Python code to execute on several cpu cores, but when I code a program who create many threads I can see several cpu cores are active.
1st question: What's is really GIL? does it work? I think about something like when a process create too many thread the OS distributed task on multi cpu. Am I right?
Other thing, I want take advantage of my cpus. I think about something like create as much process as cpu core and on this each process create as much thread as cpu core. Am I on the right lane?
To start with, GIL only ensures that only one cpython bytecode instruction will run at any given time. It does not care about which CPU core runs the instruction. That is the job of the OS kernel.
So going over your questions:
GIL is just a piece of code. The CPython Virtual machine is the process which first compiles the code to Cpython bytecode but it's normal job is to interpret the CPython bytecode. GIL is a piece of code that ensures a single line of bytecode runs at a time no matter how many threads are running. Cpython Bytecode instructions is what constitutes the virtual machine stack. So in a way, GIL will ensure that only one thread holds the GIL at any given point of time. (also that it keeps releasing the GIL for other threads and not starve them.)
Now coming to your actual confusion. You mention that when you run a program with many threads, you can see multiple (may be all) CPU cores firing up. So I did some experimentation and found that your findings are right (which is obvious) but the behaviour is similar in a non threaded version too.
def do_nothing(i):
time.sleep(0.0001)
return i*2
ThreadPool(20).map(do_nothing, range(10000))
def do_nothing(i):
time.sleep(0.0001)
return i*2
[do_nothing(i) for i in range(10000)]
The first one in multithreaded and the second one is not. When you compare the CPU usage by by both the programs, you will find that in both the cases multiple CPU cores will fire up. So what you noticed, although right, has not much to do with GIL or threading. CPU usage going high in multiple cores is simply because OS kernel will distribute the execution of code to different cores based on availability.
Your last question is more of an experimental thing as different programs have different CPU/io usage. You just have to be aware of the cost of creation of a thread and a process and the working of GIL & PVM and optimize the number of threads and processes to get the maximum perf out.
You can go through this talk by David Beazley to understand how multithreading can make your code perform worse (or better).
There are answers about what the Global Interpreter Lock (GIL) is here. Buried among the answers is mention of Python "bytecode", which is central to the issue. When your program is compiled, the output is bytecode, i.e. low-level computer instructions for a fictitious "Python" computer, that gets interpreted by the Python interpreter. When the interpreter is executing a bytecode, it serializes execution by acquiring the Global Interpreter Lock. This means that two threads cannot be executing bytecode concurrently on two different cores. This also means that true multi-threading is not implemented. But does this mean that there is no reason to use threading? No! Here are a couple of situations where threading is still useful:
For certain operations the interpreter will release the GIL, i.e. when doing I/O. So consider as an example the case where you want to fetch a lot of URLs from different websites. Most of the time is spent waiting for a response to be returned once the request is made and this waiting can be overlapped even if formulating the requests has to be done serially.
Many Python functions and modules are implemented in the C language and are free to release the GIL after being called if their processing requirements allow it. The numpy module is one such highly optimized package.
Consequently, threading is best used when the tasks are not CPU-intensive, i.e. they do a lot of waiting for I/O to complete, or they do a lot of sleeping, etc.
I'm slightly confused about whether multithreading works in Python or not.
I know there has been a lot of questions about this and I've read many of them, but I'm still confused. I know from my own experience and have seen others post their own answers and examples here on StackOverflow that multithreading is indeed possible in Python. So why is it that everyone keep saying that Python is locked by the GIL and that only one thread can run at a time? It clearly does work. Or is there some distinction I'm not getting here?
Many posters/respondents also keep mentioning that threading is limited because it does not make use of multiple cores. But I would say they are still useful because they do work simultaneously and thus get the combined workload done faster. I mean why would there even be a Python thread module otherwise?
Update:
Thanks for all the answers so far. The way I understand it is that multithreading will only run in parallel for some IO tasks, but can only run one at a time for CPU-bound multiple core tasks.
I'm not entirely sure what this means for me in practical terms, so I'll just give an example of the kind of task I'd like to multithread. For instance, let's say I want to loop through a very long list of strings and I want to do some basic string operations on each list item. If I split up the list, send each sublist to be processed by my loop/string code in a new thread, and send the results back in a queue, will these workloads run roughly at the same time? Most importantly will this theoretically speed up the time it takes to run the script?
Another example might be if I can render and save four different pictures using PIL in four different threads, and have this be faster than processing the pictures one by one after each other? I guess this speed-component is what I'm really wondering about rather than what the correct terminology is.
I also know about the multiprocessing module but my main interest right now is for small-to-medium task loads (10-30 secs) and so I think multithreading will be more appropriate because subprocesses can be slow to initiate.
The GIL does not prevent threading. All the GIL does is make sure only one thread is executing Python code at a time; control still switches between threads.
What the GIL prevents then, is making use of more than one CPU core or separate CPUs to run threads in parallel.
This only applies to Python code. C extensions can and do release the GIL to allow multiple threads of C code and one Python thread to run across multiple cores. This extends to I/O controlled by the kernel, such as select() calls for socket reads and writes, making Python handle network events reasonably efficiently in a multi-threaded multi-core setup.
What many server deployments then do, is run more than one Python process, to let the OS handle the scheduling between processes to utilize your CPU cores to the max. You can also use the multiprocessing library to handle parallel processing across multiple processes from one codebase and parent process, if that suits your use cases.
Note that the GIL is only applicable to the CPython implementation; Jython and IronPython use a different threading implementation (the native Java VM and .NET common runtime threads respectively).
To address your update directly: Any task that tries to get a speed boost from parallel execution, using pure Python code, will not see a speed-up as threaded Python code is locked to one thread executing at a time. If you mix in C extensions and I/O, however (such as PIL or numpy operations) and any C code can run in parallel with one active Python thread.
Python threading is great for creating a responsive GUI, or for handling multiple short web requests where I/O is the bottleneck more than the Python code. It is not suitable for parallelizing computationally intensive Python code, stick to the multiprocessing module for such tasks or delegate to a dedicated external library.
Yes. :)
You have the low level thread module and the higher level threading module. But it you simply want to use multicore machines, the multiprocessing module is the way to go.
Quote from the docs:
In CPython, due to the Global Interpreter Lock, only one thread can
execute Python code at once (even though certain performance-oriented
libraries might overcome this limitation). If you want your
application to make better use of the computational resources of
multi-core machines, you are advised to use multiprocessing. However,
threading is still an appropriate model if you want to run multiple
I/O-bound tasks simultaneously.
Threading is Allowed in Python, the only problem is that the GIL will make sure that just one thread is executed at a time (no parallelism).
So basically if you want to multi-thread the code to speed up calculation it won't speed it up as just one thread is executed at a time, but if you use it to interact with a database for example it will.
I feel for the poster because the answer is invariably "it depends what you want to do". However parallel speed up in python has always been terrible in my experience even for multiprocessing.
For example check this tutorial out (second to top result in google): https://www.machinelearningplus.com/python/parallel-processing-python/
I put timings around this code and increased the number of processes (2,4,8,16) for the pool map function and got the following bad timings:
serial 70.8921644706279
parallel 93.49704207479954 tasks 2
parallel 56.02441442012787 tasks 4
parallel 51.026168536394835 tasks 8
parallel 39.18044807203114 tasks 16
code:
# increase array size at the start
# my compute node has 40 CPUs so I've got plenty to spare here
arr = np.random.randint(0, 10, size=[2000000, 600])
.... more code ....
tasks = [2,4,8,16]
for task in tasks:
tic = time.perf_counter()
pool = mp.Pool(task)
results = pool.map(howmany_within_range_rowonly, [row for row in data])
pool.close()
toc = time.perf_counter()
time1 = toc - tic
print(f"parallel {time1} tasks {task}")
I've been trying to wrap my head around how threads work in Python, and it's hard to find good information on how they operate. I may just be missing a link or something, but it seems like the official documentation isn't very thorough on the subject, and I haven't been able to find a good write-up.
From what I can tell, only one thread can be running at once, and the active thread switches every 10 instructions or so?
Where is there a good explanation, or can you provide one? It would also be very nice to be aware of common problems that you run into while using threads with Python.
Yes, because of the Global Interpreter Lock (GIL) there can only run one thread at a time. Here are some links with some insights about this:
http://www.artima.com/weblogs/viewpost.jsp?thread=214235
http://smoothspan.wordpress.com/2007/09/14/guido-is-right-to-leave-the-gil-in-python-not-for-multicore-but-for-utility-computing/
From the last link an interesting quote:
Let me explain what all that means.
Threads run inside the same virtual
machine, and hence run on the same
physical machine. Processes can run
on the same physical machine or in
another physical machine. If you
architect your application around
threads, you’ve done nothing to access
multiple machines. So, you can scale
to as many cores are on the single
machine (which will be quite a few
over time), but to really reach web
scales, you’ll need to solve the
multiple machine problem anyway.
If you want to use multi core, pyprocessing defines an process based API to do real parallelization. The PEP also includes some interesting benchmarks.
Python's a fairly easy language to thread in, but there are caveats. The biggest thing you need to know about is the Global Interpreter Lock. This allows only one thread to access the interpreter. This means two things: 1) you rarely ever find yourself using a lock statement in python and 2) if you want to take advantage of multi-processor systems, you have to use separate processes. EDIT: I should also point out that you can put some of the code in C/C++ if you want to get around the GIL as well.
Thus, you need to re-consider why you want to use threads. If you want to parallelize your app to take advantage of dual-core architecture, you need to consider breaking your app up into multiple processes.
If you want to improve responsiveness, you should CONSIDER using threads. There are other alternatives though, namely microthreading. There are also some frameworks that you should look into:
stackless python
greenlets
gevent
monocle
Below is a basic threading sample. It will spawn 20 threads; each thread will output its thread number. Run it and observe the order in which they print.
import threading
class Foo (threading.Thread):
def __init__(self,x):
self.__x = x
threading.Thread.__init__(self)
def run (self):
print str(self.__x)
for x in xrange(20):
Foo(x).start()
As you have hinted at Python threads are implemented through time-slicing. This is how they get the "parallel" effect.
In my example my Foo class extends thread, I then implement the run method, which is where the code that you would like to run in a thread goes. To start the thread you call start() on the thread object, which will automatically invoke the run method...
Of course, this is just the very basics. You will eventually want to learn about semaphores, mutexes, and locks for thread synchronization and message passing.
Note: wherever I mention thread i mean specifically threads in python until explicitly stated.
Threads work a little differently in python if you are coming from C/C++ background. In python, Only one thread can be in running state at a given time.This means Threads in python cannot truly leverage the power of multiple processing cores since by design it's not possible for threads to run parallelly on multiple cores.
As the memory management in python is not thread-safe each thread require an exclusive access to data structures in python interpreter.This exclusive access is acquired by a mechanism called GIL ( global interpretr lock ).
Why does python use GIL?
In order to prevent multiple threads from accessing interpreter state simultaneously and corrupting the interpreter state.
The idea is whenever a thread is being executed (even if it's the main thread), a GIL is acquired and after some predefined interval of time the
GIL is released by the current thread and reacquired by some other thread( if any).
Why not simply remove GIL?
It is not that its impossible to remove GIL, its just that in prcoess of doing so we end up putting mutiple locks inside interpreter in order to serialize access, which makes even a single threaded application less performant.
so the cost of removing GIL is paid off by reduced performance of a single threaded application, which is never desired.
So when does thread switching occurs in python?
Thread switch occurs when GIL is released.So when is GIL Released?
There are two scenarios to take into consideration.
If a Thread is doing CPU Bound operations(Ex image processing).
In Older versions of python , Thread switching used to occur after a fixed no of python instructions.It was by default set to 100.It turned out that its not a very good policy to decide when switching should occur since the time spent executing a single instruction can
very wildly from millisecond to even a second.Therefore releasing GIL after every 100 instructions regardless of the time they take to execute is a poor policy.
In new versions instead of using instruction count as a metric to switch thread , a configurable time interval is used.
The default switch interval is 5 milliseconds.you can get the current switch interval using sys.getswitchinterval().
This can be altered using sys.setswitchinterval()
If a Thread is doing some IO Bound Operations(Ex filesystem access or
network IO)
GIL is release whenever the thread is waiting for some for IO operation to get completed.
Which thread to switch to next?
The interpreter doesn’t have its own scheduler.which thread becomes scheduled at the end of the interval is the operating system’s decision. .
Use threads in python if the individual workers are doing I/O bound operations. If you are trying to scale across multiple cores on a machine either find a good IPC framework for python or pick a different language.
One easy solution to the GIL is the multiprocessing module. It can be used as a drop in replacement to the threading module but uses multiple Interpreter processes instead of threads. Because of this there is a little more overhead than plain threading for simple things but it gives you the advantage of real parallelization if you need it.
It also easily scales to multiple physical machines.
If you need truly large scale parallelization than I would look further but if you just want to scale to all the cores of one computer or a few different ones without all the work that would go into implementing a more comprehensive framework, than this is for you.
Try to remember that the GIL is set to poll around every so often in order to do show the appearance of multiple tasks. This setting can be fine tuned, but I offer the suggestion that there should be work that the threads are doing or lots of context switches are going to cause problems.
I would go so far as to suggest multiple parents on processors and try to keep like jobs on the same core(s).
I'm working on simulating a mesh network with a large number of nodes. The nodes pass data between different master nodes throughout the network.
Each master comes live once a second to receive the information, but the slave nodes don't know when the master is up or not, so when they have information to send, they try and do so every 5 ms for 1 second to make sure they can find the master.
Running this on a regular computer with 1600 nodes results in 1600 threads and the performance is extremely bad.
What is a good approach to handling the threading so each node acts as if it is running on its own thread?
In case it matters, I'm building the simulation in python 2.7, but I'm open to changing to something else if that makes sense.
For one, are you really using regular, default Python threads available in the default Python 2.7 interpreter (CPython), and is all of your code in Python? If so, you are probably not actually using multiple CPU cores because of the global interpreter lock CPython has (see https://wiki.python.org/moin/GlobalInterpreterLock). You could maybe try running your code under Jython, just to check if performance will be better.
You should probably rethink your application architecture and switch to manually scheduling events instead of using threads, or maybe try using something like greenlets (https://stackoverflow.com/a/15596277/1488821), but that would probably mean less precise timings because of lack of parallelism.
To me, 1600 threads sounds like a lot but not excessive given that it's a simulation. If this were a production application it would probably not be production-worthy.
A standard machine should have no trouble handling 1600 threads. As to the OS this article could provide you with some insights.
When it comes to your code a Python script is not a native application but an interpreted script and as such will require more CPU resources to execute.
I suggest you try implementing the simulation in C or C++ instead which will produce a native application which should execute more efficiently.
Do not use threading for that. If sticking to Python, let the nodes perform their actions one by one. If the performance you get doing so is OK, you will not have to use C/C++. If the actions each node perform are simple, that may work. Anyway, there is no reason to use threads in Python at all. Python threads are usable mostly for making blocking I/O not to block your program, not for multiple CPU kernels utilization.
If you want to really use parallel processing and to write your nodes as if they were really separated and exchanging only using messages, you may use Erlang (http://www.erlang.org/). It is a functional language very well suited for executing parallel processes and making them exchange messages. Erlang processes do not map to OS threads, and you may create thousands of them. However, Erlang is a purely functional language and may seem extremely strange if you have never used such languages. And it also is not very fast, so, like Python, it is unlikely to handle 1600 actions every 5ms unless the actions are rather simple.
Finally, if you do not get desired performance using Python or Erlang, you may move to C or C++. However, still do not use 1600 threads. In fact, using threads to gain performance is reasonable only if the number of threads does not dramatically exceed number of CPU kernels. A reactor pattern (with several reactor threads) is what you may need in that case (http://en.wikipedia.org/wiki/Reactor_pattern). There is an excellent implementation of the reactor pattern in boost.asio library. It is explained here: http://www.gamedev.net/blog/950/entry-2249317-a-guide-to-getting-started-with-boostasio/
Some random thoughts here:
I did rather well with several hundred threads working like this in Java; it can be done with the right language. (But I haven't tried this in Python.)
In any language, you could run the master node code in one thread; just have it loop continuously, running the code for each master in each cycle. You'll lose the benefits of multiple cores that way, though. On the other hand, you'll lose the problems of multithreading, too. (You could have, say, 4 such threads, utilizing the cores but getting the multithreading headaches back. It'll keep the thread-overhead down, too, but then there's blocking...)
One big problem I had was threads blocking each other. Enabling 100 threads to call the same method on the same object at the same time without waiting for each other requires a bit of thought and even research. I found my multithreading program at first often used only 25% of a 4-core CPU even when running flat out. This might be one reason you're running slow.
Don't have your slave nodes repeat sending data. The master nodes should come alive in response to data coming in, or have some way of storing it until they do come alive, or some combination.
It does pay to have more threads than cores. Once you have two threads, they can block each other (and will if they share any data). If you have code to run that won't block, you want to run it in its own thread so it won't be waiting for code that does block to unblock and finish. I found once I had a few threads, they started to multiply like crazy--hence my hundreds-of-threads program. Even when 100 threads block at one spot despite all my brilliance, there's plenty of other threads to keep the cores busy!
I've been trying to wrap my head around how threads work in Python, and it's hard to find good information on how they operate. I may just be missing a link or something, but it seems like the official documentation isn't very thorough on the subject, and I haven't been able to find a good write-up.
From what I can tell, only one thread can be running at once, and the active thread switches every 10 instructions or so?
Where is there a good explanation, or can you provide one? It would also be very nice to be aware of common problems that you run into while using threads with Python.
Yes, because of the Global Interpreter Lock (GIL) there can only run one thread at a time. Here are some links with some insights about this:
http://www.artima.com/weblogs/viewpost.jsp?thread=214235
http://smoothspan.wordpress.com/2007/09/14/guido-is-right-to-leave-the-gil-in-python-not-for-multicore-but-for-utility-computing/
From the last link an interesting quote:
Let me explain what all that means.
Threads run inside the same virtual
machine, and hence run on the same
physical machine. Processes can run
on the same physical machine or in
another physical machine. If you
architect your application around
threads, you’ve done nothing to access
multiple machines. So, you can scale
to as many cores are on the single
machine (which will be quite a few
over time), but to really reach web
scales, you’ll need to solve the
multiple machine problem anyway.
If you want to use multi core, pyprocessing defines an process based API to do real parallelization. The PEP also includes some interesting benchmarks.
Python's a fairly easy language to thread in, but there are caveats. The biggest thing you need to know about is the Global Interpreter Lock. This allows only one thread to access the interpreter. This means two things: 1) you rarely ever find yourself using a lock statement in python and 2) if you want to take advantage of multi-processor systems, you have to use separate processes. EDIT: I should also point out that you can put some of the code in C/C++ if you want to get around the GIL as well.
Thus, you need to re-consider why you want to use threads. If you want to parallelize your app to take advantage of dual-core architecture, you need to consider breaking your app up into multiple processes.
If you want to improve responsiveness, you should CONSIDER using threads. There are other alternatives though, namely microthreading. There are also some frameworks that you should look into:
stackless python
greenlets
gevent
monocle
Below is a basic threading sample. It will spawn 20 threads; each thread will output its thread number. Run it and observe the order in which they print.
import threading
class Foo (threading.Thread):
def __init__(self,x):
self.__x = x
threading.Thread.__init__(self)
def run (self):
print str(self.__x)
for x in xrange(20):
Foo(x).start()
As you have hinted at Python threads are implemented through time-slicing. This is how they get the "parallel" effect.
In my example my Foo class extends thread, I then implement the run method, which is where the code that you would like to run in a thread goes. To start the thread you call start() on the thread object, which will automatically invoke the run method...
Of course, this is just the very basics. You will eventually want to learn about semaphores, mutexes, and locks for thread synchronization and message passing.
Note: wherever I mention thread i mean specifically threads in python until explicitly stated.
Threads work a little differently in python if you are coming from C/C++ background. In python, Only one thread can be in running state at a given time.This means Threads in python cannot truly leverage the power of multiple processing cores since by design it's not possible for threads to run parallelly on multiple cores.
As the memory management in python is not thread-safe each thread require an exclusive access to data structures in python interpreter.This exclusive access is acquired by a mechanism called GIL ( global interpretr lock ).
Why does python use GIL?
In order to prevent multiple threads from accessing interpreter state simultaneously and corrupting the interpreter state.
The idea is whenever a thread is being executed (even if it's the main thread), a GIL is acquired and after some predefined interval of time the
GIL is released by the current thread and reacquired by some other thread( if any).
Why not simply remove GIL?
It is not that its impossible to remove GIL, its just that in prcoess of doing so we end up putting mutiple locks inside interpreter in order to serialize access, which makes even a single threaded application less performant.
so the cost of removing GIL is paid off by reduced performance of a single threaded application, which is never desired.
So when does thread switching occurs in python?
Thread switch occurs when GIL is released.So when is GIL Released?
There are two scenarios to take into consideration.
If a Thread is doing CPU Bound operations(Ex image processing).
In Older versions of python , Thread switching used to occur after a fixed no of python instructions.It was by default set to 100.It turned out that its not a very good policy to decide when switching should occur since the time spent executing a single instruction can
very wildly from millisecond to even a second.Therefore releasing GIL after every 100 instructions regardless of the time they take to execute is a poor policy.
In new versions instead of using instruction count as a metric to switch thread , a configurable time interval is used.
The default switch interval is 5 milliseconds.you can get the current switch interval using sys.getswitchinterval().
This can be altered using sys.setswitchinterval()
If a Thread is doing some IO Bound Operations(Ex filesystem access or
network IO)
GIL is release whenever the thread is waiting for some for IO operation to get completed.
Which thread to switch to next?
The interpreter doesn’t have its own scheduler.which thread becomes scheduled at the end of the interval is the operating system’s decision. .
Use threads in python if the individual workers are doing I/O bound operations. If you are trying to scale across multiple cores on a machine either find a good IPC framework for python or pick a different language.
One easy solution to the GIL is the multiprocessing module. It can be used as a drop in replacement to the threading module but uses multiple Interpreter processes instead of threads. Because of this there is a little more overhead than plain threading for simple things but it gives you the advantage of real parallelization if you need it.
It also easily scales to multiple physical machines.
If you need truly large scale parallelization than I would look further but if you just want to scale to all the cores of one computer or a few different ones without all the work that would go into implementing a more comprehensive framework, than this is for you.
Try to remember that the GIL is set to poll around every so often in order to do show the appearance of multiple tasks. This setting can be fine tuned, but I offer the suggestion that there should be work that the threads are doing or lots of context switches are going to cause problems.
I would go so far as to suggest multiple parents on processors and try to keep like jobs on the same core(s).