Unexpected parallelism behaviours with threading in python 2.7 on Ubuntu [duplicate]

Unexpected parallelism behaviours with threading in python 2.7 on Ubuntu [duplicate] - python

I've written a working program in Python that basically parses a batch of binary files, extracting data into a data structure. Each file takes around a second to parse, which translates to hours for thousands of files. I've successfully implemented a threaded version of the batch parsing method with an adjustable number of threads. I tested the method on 100 files with a varying number of threads, timing each run. Here are the results (0 threads refers to my original, pre-threading code, 1 threads to the new version run with a single thread spawned).
0 threads: 83.842 seconds
1 threads: 78.777 seconds
2 threads: 105.032 seconds
3 threads: 109.965 seconds
4 threads: 108.956 seconds
5 threads: 109.646 seconds
6 threads: 109.520 seconds
7 threads: 110.457 seconds
8 threads: 111.658 seconds
Though spawning a thread confers a small performance increase over having the main thread do all the work, increasing the number of threads actually decreases performance. I would have expected to see performance increases, at least up to four threads (one for each of my machine's cores). I know threads have associated overhead, but I didn't think this would matter so much with single-digit numbers of threads.
I've heard of the "global interpreter lock", but as I move up to four threads I do see the corresponding number of cores at work--with two threads two cores show activity during parsing, and so on.
I also tested some different versions of the parsing code to see if my program is IO bound. It doesn't seem to be; just reading in the file takes a relatively small proportion of time; processing the file is almost all of it. If I don't do the IO and process an already-read version of a file, I adding a second thread damages performance and a third thread improves it slightly. I'm just wondering why I can't take advantage of my computer's multiple cores to speed things up. Please post any questions or ways I could clarify.

This is sadly how things are in CPython, mainly due to the Global Interpreter Lock (GIL). Python code that's CPU-bound simply doesn't scale across threads (I/O-bound code, on the other hand, might scale to some extent).
There is a highly informative presentation by David Beazley where he discusses some of the issues surrounding the GIL. The video can be found here (thanks #Ikke!)
My recommendation would be to use the multiprocessing module instead of multiple threads.

The threading library does not actually utilize multiple cores simultaneously for computation. You should use the multiprocessing library instead for computational threading.

Related

Python multiprocessing - reassigning jobs dynamically from pool - without async?

So I have a batch of 1000 tasks that I assign using parmap/python multiprocessing module to 8 cores (dual xeon machine 16 physical cores). Currently this runs using synchronized.
The issue is that usually 1 of the cores lags well behind the other cores and still has several jobs/tasks to complete after all the other cores finished their work. This may be related to core speed (older computer) but more likely due to some of the tasks being more difficult than others - so the 1 core that gets the slightly more difficult jobs gets laggy...
I'm a little confused here - but is this what asynch parallelization does? I've tried using it before, but because this step is part of a very large processing step - it wasn't clear how to create a barrier to force the program to wait until all async processes are done.
Any advice/links to similar questions/answers are appreciated.
[EDIT] To clarify, the processes are ok to run independently, they all save data to disk and do not share variables.

parmap author here
By default, both in multiprocessing and in parmap, tasks are divided in chunks and chunks are sent to each multiprocessing process (see the multiprocessing documentation). The reason behind this is that sending tasks individually to a process would introduce significant computational overhead in many situations. The overhead is reduced if several tasks are sent at once, in chunks.
The number of tasks on each chunk is controlled with chunksize in multiprocessing (and pm_chunksize in parmap). By default, chunksize is computed as "number of tasks"/(4*"pool size"), rounded up (see the multiprocessing source code). So for your case, 1000/(4*4) = 62.5 -> 63 tasks per chunk.
If, as in your case, many computationally expensive tasks fall into the same chunk, that chunk will take a long time to finish.
One "cheap and easy" way to workaround this is to pass a smaller chunksize value. Note that using the extreme chunksize=1 may introduce undesired larger cpu overhead.
A proper queuing system as suggested in other answers is a better solution on the long term, but maybe an overkill for a one-time problem.

You really need to look at creating microservices and using a queue pool. For instance, you could put a list of jobs in celery or redis, and then have the microservices pull from the queue one at a time and process the job. Once done they pull the next item and so forth. That way your load is distributed based on readiness, and not based on a preset list.
http://www.celeryproject.org/
https://www.fullstackpython.com/task-queues.html

How should be spawn threads in python

I am running a python program in a server having python2.7.6 . I have used threading module of python to create multiple threads . I have created 23 threads , so i am confused whether all my processor cores are actually being used or not , Is there any way i can check this in my server . Any suggestion as what should be the ideal number of threads that should be spawned according to the number of processors that we have in order to improve efficiency of my program.

David Beazly has a great talk on threading in Python. He also has a great presentation on the subject here.
Unfortunately, Python has something called the GIL which limits Python to executing a single thread at a time. To use all your cores you'd have to use multiple processes: See Multiprocessing.
Some in the Python community don't necessarily look at the GIL as a setback, you can utilize multiple cores through other means than shared-memory threading.
Look here for a great blog post on utilizing multiple cores in Python.
Edit:
The above is true for CPython (the most common and also the reference implementation of Python). There's a few "yes but" answers out there (mostly referring to multithreading on a different Python implementation) but I'd generally point people to answers like this one that describe the GIL in CPython and alternatives for utilizing multiple cores6

There is no real answer to this question and everyone might have a different view. Determining the number of threads for your application to improve performance should be decided after testing for x scenarios with y of threads. Performance depends on how the OS scheduler is scheduling your threads to the available CPU cores depending on the CPU load or number of processes running. If you have 4 cores, then it doesn't mean it's good to execute 4 threads only. In fact you can run 23 threads too. This is what we call the illusion of parallelism given to us by the scheduler by scheduling processes after processes hence making us think everything is running simultaneously.
Here is the thing
If you run 1 thread, you may not gain enough performance. As you keep increasing threads to infinity, then the scheduler will take more time to schedule threads and hamper your overall application performance.

Why are threads spread between CPUs?

I am trying to get my head around threading vs. CPU usage. There are plenty of discussions about threading vs. multiprocessing (a good overview being this answer) so I decided to test this out by launching a maximum number of threads on my 8 CPU laptop running Windows 10, Python 3.4.
My assumption was that all the threads would be bound to a single CPU.
EDIT: it turns out that it was not a good assumption. I now understand that for multithreaded code, only one piece of python code can run at once (no matter where/on which core). This is different for multiprocessing code (where processes are independent and run indeed independently).
While I read about these differences, it is one answer which actually clarified this point.
I think it also explains the CPU view below: that it is an average view of many threads spread out on many CPUs, but only one of them running at one given time (which "averages" to all of them running all the time).
It is not a duplicate of the linked question (which addresses the opposite problem, i.e. all threads on one core) and I will leave it hanging in case someone has a similar question one day and is hopefully helped by my enlightenment.
The code
import threading
import time
def calc():
time.sleep(5)
while True:
a = 2356^36
n = 0
while True:
try:
n += 1
t = threading.Thread(target=calc)
t.start()
except RuntimeError:
print("max threads: {n}".format(n=n))
break
else:
print('.')
time.sleep(100000)
Led to 889 threads being started.
The load on the CPUs was however distributed (and surprisingly low for a pure CPU calculation, the laptop is otherwise idle with an empty load when not running my script):
Why is it so? Are the threads constantly moved as a pack between CPUs and what I see is just an average (the reality being that at a given moment all threads are on one CPU)? Or are they indeed distributed?

As of today it is still the case that 'one thread holds the GIL'. So one thread is running at a time.
The threads are managed on the operating system level. What happens is that every 100 'ticks' (=interpreter instruction) the running thread releases the GIL and resets the tick counter.
Because the threads in this example do continuous calculations, the tick limit of 100 instructions is reached very fast, leading to an almost immediate release of the GIL and a 'battle' between threads starts to acquire the GIL.
So, my assumption is that your operating system has a higher than expected load , because of (too) fast thread switching + almost continuous releasing and acquiring the GIL. The OS spends more time on switching than actually doing any useful calculation.
As you mention yourself, for using more than one core at a time, it's better to look at multiprocessing modules (joblib/Parallel).
Interesting read:
http://www.dabeaz.com/python/UnderstandingGIL.pdf

Um. The point of multithreading is to make sure they work gets spread out. A really easy cheat is to use as many threads as you have CPU cores. The point is they are all independent so they can actually run at the same time. If they were on the same core only one thread at a time could really run at all. They'd pass that core back and forth for processing at the OS level.
Your assumption is wrong and bizarre. What would ever lead you to think they should run on the same CPU and consequently go at 1/8th speed? As the only reason to thread them is typically to get the whole batch to go faster than a single core alone.
In fact, what the hell do you think writing parallel code is for if not to run independently on several cores at the same time? Like this would be pointless and hard to do, let's make complex fetching, branching, and forking routines to accomplish things slower than one core just plugging away at the data?

Multithreading in Python with ThreadPoolExecutor

I have some Python code that leverages ctypes.CDLL, according to the docs this does not involve the gil. With that being said, I am experiencing some bottlenecks that I am unclear of when profiling. If I run some trivial code using time.sleep or even ctypes.windll.kernel32.Sleep I can see the time scale equally as the number of threads matches the number of tasks, in other words if the task is to sleep 1 second and I submit 1 task in 1 thread or 20 tasks in 20 threads they both take ~1 second to complete.
Switching back to my code, it is not scaling out as expected but rather linearly. Profiling indicates waits from acquire() in _thread.lock.
What are some techniques to further dig into this to see where the issue is manifesting? Is ThreadPoolExecutor not the optimal choice here? I understood it implemented a basic thread pool and was no different than ThreadPool from multiprocessing.pool?

Multiple processes with multiple threads in Python

I've heard something about "If you want to get maximum performance from parallel application, you should create as many processes as your computer has CPUs, and in each process -- create some (how many?) threads".
Is it true?
I wrote a piece of code implementing this idiom:
import multiprocessing, threading
number_of_processes = multiprocessing.cpu_count()
number_of_threads_in_process = 25 # some constant
def one_thread():
# very heavyweight function with lots of CPU/IO/network usage
do_main_work()
def one_process():
for _ in range(number_of_threads_in_process):
t = threading.Thread(target=one_thread, args=())
t.start()
for _ in range(number_of_processes):
p = multiprocessing.Process(target=one_process, args=())
p.start()
Is it correct? Will my do_main_work function really run in parallel, not facing any GIL-issues?
Thank you.

It really depends very much on what you're doing.
Keep in mind that in CPython, only one thread at a time can be executing Python bytecode (because of the GIL). So for a computation-intensive problem in CPython threads won't help you that much.
One way to spread out work that can be done in parallel is to use a multiprocessing.Pool. By default this does not use more processes that your CPU has cores. Using many more processes will mainly have them fighting over resources (CPU, memory) than getting useful work done.
But taking advantage of multiple processors requires that you have work for them to do! In other words, if the problem cannot be divided into smaller pieces that can be calculated separately and in parallel, many CPU cores will not be of much use.
Additionally, not al problems are bound by the amount of calculation that has to be done.
The RAM of a computer is much slower than the CPU. If the data-set that you're working on is much bigger than the CPU's caches, reading data from and returning the results to RAM might become the speed limit. This is called memory bound.
And if you are working on much more data than can fit in the machine's memory, your program will be doing a lot of reading and writing from disk. A disk is slow compared to RAM and very slow compared to a CPU, so your program becomes I/O-bound.

# very heavyweight function with lots of CPU/IO/network usage
Lots of CPU will suffer because of GIL, so you'll only get benefit from multiple processes.
IO and network (in fact network is also kind of IO) won't be affected too much by GIL because lock is released explicitly and obtained again after IO operation is completed. There are macro-definitions in CPython for this:
Py_BEGIN_ALLOW_THREADS
... Do some blocking I/O operation ...
Py_END_ALLOW_THREADS
There still be a performance hit because of GIL being utilized in wrapping code, but you still get better performance with multiple threads.
Finally - and this is a general rule - not only for Python: Optimal number of threads/processes depends on what the program is actually doing. Generally if it utilizes CPU intensively, there is almost no performance boost if number of processes is greater than number of CPU cores. For example Gentoo documentation says that optimal number of threads for compiler is CPU cores + 1.

I think the number of threads you are using per process is too high.Usually for any Intel Processor the number of threads per process is 2.The number of cores vary from 2(Intel core i3) to 6(Intel core i7).So at a time when all the processes are running the maximum number of threads will be 6*2=12.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.