Does it make sense to multi-thread within multiprocessing?

Does it make sense to multi-thread within multiprocessing? - python

With Python's multiprocessing, would it make sense to have a Pool with a bunch of ThreadPools within them? Say I have something like:
def task(path):
# i/o bound
image = load(path)
# cpu bound but only takes up 1/10 of the time of the i/o bound stuff
image = preprocess(img)
# i/o bound
save(image, path)
Then I'd want to process a list of paths path_list. If I use ThreadPool I still end up hitting a ceiling because of the cpu bound bit. If I use a Pool I spend too much dead time waiting for i/o. So wouldn't it be best to split path_list over multiple processes that each in turn use multiple threads?
Another shorter way of restating my example is what if I have a method that should be multithreaded because it's i/o bound but I also want to make use of many cpu cores? If I use a Pool I'm using each core up for a single task which is i/o bound. If I use a ThreadPool I only get to use one core.

Does it make sense
Yes. Let's say you start with one process and one thread. Because some parts of the code block on IO, the process will utilize less than a 100% CPU - so we start adding threads. As long as we see an increase in task throughput, it means the CPU is our bottleneck. At some point, we might hit 100% CPU utilization in our process. Because of the GIL, a pure python process can utilize up to 100% CPU. But, as far as we know, the CPU might still be our bottleneck, and the only way to gain more CPU time is to create another process (or use subinterpreters, but let's ignore that for now).
In summary, this is a valid approach for increasing throughput of pure-python tasks that both utilize CPU and block on IO. But, it does not mean that it is a good approach in your case. First, your bottleneck might be the disk and not the CPU, in which case you don't need more CPU time, which means you don't need more processes. Second, even if the CPU is the bottleneck, multithreading within multiprocessing is not necessarily the simplest solution, the most performant solution, or the winning solution in other resource utilization metrics such as memory usage.
For example, if simplicity is your top priority, you could get all the CPU time you need just by using processes. This solution is easier to implement, but is heavy in terms of memory usage. Or, for example, if your goal is to achieve maximal performance and minimal memory utilization, then you you probably want to replace the threads with an IO loop and use a process pool executor for your CPU-bound tasks. Squeezing maximal performance from your hardware is not an easy task. Below is a methodology that I feel had served me well.
Aiming towards maximal performance
From now on, I'm assuming your goal is to make maximal use of your hardware in order to achieve a maximal throughput of "tasks". In that case, the final solution depends on your hardware, so you'll need to get to know it a little bit better. To try and reach your performance goals, I recommend to:
Understand your hardware utilization
Identify the bottleneck and estimate the maximal throughput
Design a solution to achieve that throughput
Implement the design, and optimize until you meet your requirements
In detail:
1. Understand your hardware utilization
In this case, there are a few pieces of hardware involved:
The RAM
The disk
The CPU
Let's look at one "task" and note how it uses the hardware:
Disk (read)
RAM (write)
CPU time
RAM (read)
Disk (write)
2. Identify the bottleneck and estimate the maximal throughput
To identify the bottleneck, let us calculate the maximum throughput of tasks that each hardware component can provide, assuming usage of them can be completely parallelized. I like to do that using python:
(note that I'm using random constants, you'll have to fill in the real data for your setup in order to use it).
# ----------- General consts
input_image_size = 20 * 2 ** 20 # 20MB
output_image_size = 15 * 2 ** 20 # 15MB
# ----------- Disk
# If you have multiple disks and disk access is the bottleneck, you could split the images between them
amount_of_disks = 2
disk_read_rate = 3.5 * 2 ** 30 # 3.5GBps, maximum read rate for a good SSD
disk_write_rate = 2.5 * 2 ** 30 # 2.5GBps, maximum write rate for a good SSD
disk_read_throughput = amount_of_disks * disk_read_rate / input_image_size
disk_write_throughput = amount_of_disks * disk_write_rate / output_image_size
# ----------- RAM
ram_bandwidth = 30 * 2 ** 30 # Assuming here similar write and read rates of 30GBps
# assuming you are working in userspace and not using a userspace filesystem,
# data is first read into kernel space, then copied to userspace. So in total,
# two writes and one read.
userspace_ram_bandwidth = ram_bandwidth / 3
ram_read_throughput = userspace_ram_bandwidth / input_image_size
ram_write_throughput = userspace_ram_bandwidth / output_image_size
# ----------- CPU
# We decrease one core, as at least some scheduling code and kernel code is going to run
core_amount = 8 - 1
# The measured amount of times a single core can run the preprocess function in a second.
# Assuming that you are not planning to optimize the preprocess function as well.
preprocess_function_rate = 1000
cpu_throughput = core_amount * preprocess_function_rate
# ----------- Conclusions
min_throughput, bottleneck_name = min([(disk_read_throughput, 'Disk read'),
(disk_write_throughput, 'Disk write'),
(ram_read_throughput, 'RAM read'),
(ram_write_throughput, 'RAM write'),
(cpu_throughput, 'CPU')])
cpu_cores_needed = min_throughput / preprocess_function_rate
print(f'Throughput: {min_throughput:.1f} tasks per second\n'
f'Bottleneck: {bottleneck_name}\n'
f'Worker amount: {cpu_cores_needed:.1f}')
This code outputs:
Throughput: 341.3 tasks per second
Bottleneck: Disk write
Worker amount: 0.3
That means:
The maximum rate we can achieve is around 341.3 tasks per second
The disk is the bottleneck. You might be able to increase your performance by, for example:
Buying more disks
Using ramfs or a similar solution that avoids using the disk altogether
In a system where all the steps in task are executed in parallel, you won't need to dedicate more than one core for running preprocess. (In python that means you'll probably need only one process, and threads or asyncio would suffice to achieve concurrency with other steps)
Note: the numbers are lying
This kind of estimation is very hard to get right. It's hard not to forget things in the calculation itself, and hard to achieve good measurements for the constants. For example, there is a big issue with the current calculation - reads and writes are not orthogonal. We assume in our calculation that everything is happening in parallel, so constants like disk_read_rate have to account for writes occurring simultaneously to the reads. The RAM rates should probably be decreased by at least 50%.
3. Design a solution to achieve that throughput
Similarly to what you'd offered in your question, my initial design would be something like:
Have a pool of workers load the images and send them on a queue to the next step (we'll need to be reading using multiple cores to use all available memory bandwidth)
Have a pool of workers process the images and send the results on a queue (the amount of workers should be chosen according to the output of the script above. For the current result, the number is 1)
Have a pool of workers save the processed images to the disk.
The actual implementation details will vary according to different technical constraints and overheads you will run into while implementing the solution. Without further details and measurements it is hard to guess what they will be exactly.
4. Implement the design, and optimize until you meet your requirements
Good luck, and be warned that even if you did a good job at estimating the maximal throughput, it might be very hard to get there. Comparing the maximum rate to your speed requirements might give you a good idea of the amount of effort needed. For example, if the rate you need is 10x slower than the maximum rate, you might be done pretty quickly. But if it is only 2x slower, you might want to consider doubling your hardware and start preparing for some hard work :)

kmarok's answer is good technical one. But, I would also consider the quote "Premature optimization is the root of all evil" concept.
In short, yes, it make sense. But, do you really need to?
Optimization is a trade off. You compromise code simplicity for better performance. Code simplicity is important; you'll need to further develop, debug, and test your software in the future. This will cost you in time. Simplicity buys you time. You need to be aware of the trade-off when you optimize.
I would first write a multithreaded version and measure it using your hardware.
Then I would try the multiprocessing version, and measure it too.
Does any of the versions, is good enough? It might be. If so, you just made your software simpler, more readable and better maintainable.

Chen's and Kamaork's answers resume most of what is needed to know, but there are 2 missing ideas:
Your code will be A process and not THE process, this means that you need to account of how much resources you have left and not how many you can have (it can even happen within your process, threads are not ilimited); this deadly problem happend to me leaving me with less than half of a celeron for a gui, not good.
The biggest optimization with threads you can do is "prediction" (this refers more specifically to when stuff happens), you can chain the threads in a better way when you know how much it takes to compite and its a consisten wait, reading about the tcp window may give you a better idea of how a delay can be optimized by expecting it and not by forcing it.

Related

Basic Mapreduce with threads is slower than sequential version

I am trying to do a word counter with mapreduce using threads, but this version is much slower than the sequential version. With a 300MB text file the mapreduce version takes about 80s, with the sequential version it takes significantly less. My question is due to not understanding why, as I have done all the stages of map reduce (split, mapping, shuffle, reduce) but I can't figure out why it is slower, as I have used about 6 threads to do the test. I was thinking that it could be that the creation of threads was expensive compared to the execution time, but since it takes about 80s I think it is clear that this is not the problem. Could you take a look at the code to see what it is? I'm pretty sure that the code works fine, the problem is that I don't know what is causing the slowness.
One last thing, when using a text file of more than 300MB, the program fills all the ram memory of my computer, is there any way to optimize it?

First of all several disclaimers:
to know the exact reason why the application is slow you need to profile it. In this answer I'm giving some common sense reasoning.
I'm assuming you are using cPython
When you parallelize some algorithm there are several factors that that influence performance. Some of them work in favour of speed (I'l mark them with +) and some against (-). Let's look at them:
you need to split the work first (-)
work is parallel workers is done simultaneously (+)
parallel workers may need to synchronize their work (-)
reduce requires time (-)
In order for you parallel algorithm give you some gain as compared to sequential you need that all factors that speeds things up overweight all factors that drags you down.
Also the gain from #2 should be big enough to compensate for the additional work you need to do as compared to sequential processing (this means that for some small data you will not get any boost as overhead for coordination will be bigger).
The main problems in your implementation now are items #2 and #3.
First of all the workers are not working in parallel. The portion of the task you parallelize is CPU bound. In python threads of a single process cannot use more than one CPU. So in this program they never execute in parallel. They share the same CPU.
Moreover every modification operation they do on the dicts uses locking/unlocking and this is much slower then sequential version that does not require such synchronization.
To speed up your algorithm you need:
use multiprocessing instead of multithreading (this way you can use multiple CPU for processing)
structure the algorithm in a way that does not require synchronization between workers when they do their job (each worker should use its own dict to store intermediate results)

If my 8 core CPU supports 16 threads, would 16 be a better number than 8 for number of processes in a Pool?

I am using multi-processing in python 3.7
Some articles say that a good number for number of processes to be used in Pool is the number of CPU cores.
My AMD Ryzen CPU has 8 cores and can run 16 threads.
So, should the number of processes be 8 or 16?
import multiprocessing as mp
pool = mp.Pool( processes = 16 ) # since 16 threads are supported?

Q : "So, should the number of processes be 8 or 16?"
So, should the herd of sub-processes distributed workloads are cache re-use intensive (not memory-I/O), the SpaceDOMAIN-constraints rule, as the size of the cache-able data will play cardinal role in deciding if 8 or 16.
Why ?
Because the costs of memory-I/O are about a thousand times more expensive in the TimeDOMAIN, paying about 3xx - 4xx [ns] per memory-I/O, compared to 0.1 ~ 0.4 [ns] for in-cache data.
How to Make The Decision ?
Make a small scale test, before deciding on production scale configuration.
So, should the herd of to-be distributed workloads are network-I/O, or other remarkable (locally non-singular) source of latency, dependent, the TimeDOMAIN may benefit from doing a latency-masking trick, running 16, 160 or merely 1600 threads ( not processes in this case ).
Why ?
Because the costs of doing the over-the-network-I/O provide so much waiting-time ( a few [ms] of network-I/O RTT latency are time enough to do about 1E7 ~ 10.000.000 per CPU-core uop-s, which is quite a lot of work. So, smart interleaving of even the whole processes, here also just using the latency-masked thread-based concurrent processing may fit ( as the threads waiting for the remote "answer" from over-the-network-I/O ought not fight for a GIL-lock, as they have nothing to compute until they receive their expected I/O-bytes back, have they? )
How to Make The Decision ?
Review the code to determine how many over-the-network-I/O fetches and how many about the cache-footprint sized reads are in the game (in 2020/Q2+ L1-caches grew to about a few [MB]-s). For those cases, where these operations repeat many times, do not hesitate to spin up one thread per each "slow" network-I/O target as the processing will benefit from the just by a coincidence created masking of the "long" waiting-times at a cost of just a cheap ("fast") and (due to "many" and "long" waiting times) rather sparse thread-switching or even the O/S-driven process-scheduler mapping the full sub-processes onto a free CPU-core.
So, should the herd of to-be distributed workloads is some mix of the above cases, there is no other way than to experiment on the actual hardware local / non-local resources.
Why ?
Because there is no rule of thumb to fine-tune the mapping of the workload processing onto the actual CPU-core resources.
Still,one may easily find to have paid way more than ever getting backThe known trapof achieving a SlowDown, instead of a ( just wished to get ) SpeedUp
In all cases, the overhead-strict, resources-aware and atomicity of workload respecting revised Amdahl's Law identifies a point-of-diminishing returns, after which any more workers ( CPU-core-s ) will not improve the wished to get Speedup. Many surprises of getting S << 1 are expressed in Stack Overflow posts, so one may read as many of what not to do (learning by anti-patterns) as one may wish.

Is it better to resume a process with different inputs in python or re-launch it?

I want to implement an online recursive parallel algorithm in python.
So every time I got a new observation I want to calculate a new coefficient matrix. Each row on this matrix must be calculated on parallel.
Is it too expensive to create for each time-step a new process that takes as an input the row of the previous time-step and calculates the row for the next and after the calculation kill it and create it again?
Or is it better to have the process running for the whole time? If the second is the best of the two, how can I resume the same process but with different inputs?
Is there any way?

Is it too expensive to create for each time-step a new process...?
Yes, this is always expensive and often very expensive. Persistent processes, that do not make you pay the constant overhead costs per each of the time-slice processing, are a more promising option here, but many additional factors have to be also taken into account first.
All process-instantiation/termination overhead-costs are the more expensive, the less mathematically dense/complex is the task to calculate. So if your processing is cheap in [TIME]-domain, all the overhead costs will look the more expensive ( as your processing will have to spend them many times in the row ... )
All process-instatiations will also pay remarkable overhead-costs for memory (re-)allocations for data in [SPACE]-domain ( whereas having a feasible semi-persistent data-structures, that persistent processes can work with, an in-place matrix operators may save you a lot on memory-allocation overhead-avoidance ... very important topic on large scale matrices like in numerical mathematics processing for FEM, ANN, Video/image-kernel applications etc. )
Do not rely just on one's gut feelings.
Review all details of this logic in the re-formulated Amdahl's Law to have all the quantitative figures before deciding on this design dilemma. Benchmark each of the processing stages, including the process-instantiations, including memory-transfer costs ( in parameters ), including processing costs of the computing phase of the processing "inside" the one step forward computations, including the costs of re-distribution of results among all involved counterparties.
Only next you will be able to quantitatively decide a break-even point, after which more processes will not improve the processing ( will stop lowering the overall duration and will start add more overhead costs than the parallel-process accelerated computing may manage to cover ).
Is it better to have the process running for the whole time?
This may help a lot, once avoiding to pay the repetitive costs on process instantiation and termination.
Yet, there are costs on signalling and data re-propagation among all such computing-service processes. Next all processes have to fit inside a real RAM, so as not to lose on going swap-out/swap-in [SPACE]-motivated tidal waves that flow indeeeeeeed very slowly and would kill all the idea of [TIME]-motivated performance increases.
Do benchmark + vectorise + best, JIT/LLVM-compile the code. A must !
This is your strongest power for performance increases, given python is your choice. If you are serious into performance, needless to tell here more. numpy + numba are just great for this. If shaving the last few [ns] for already performant code, narrow-specialisations of calling-interfaces and better vectorisation alignments ( cache friendliness ) are your tools for this.

Python multiprocessing ratio of processors and iteration

This is probably a stupid question. But, if I have a simple function and I want to run it say 100 times and I have 12 processors available, is it better to use 10 processors to run the multiprocessing code or 12?
Basically by using 12 cores will I be saving one iteration time? or it will run 10 iterations in 1st time and then 2 and then 10 and so on?

It's almost always better to use the number of processors available. However, some algorithms need processes to communicate partial results to achieve an end result (many image processing algorithms have this constraint). Those algorithms have a limit on the number of process that should be running in parallel, as beyond this limit, the cost of communication impairs performances.
However, it depends on a lot of things. Many algorithms are easily parallelizable, however, the cost of parallelism impair their acceleration. Basically, for parallelism to be worth anything, the actual work to be done must be an order of magnitude greater than the cost of parallelism.
In typical multi-threaded languages, you can easily reduce the cost of parallelism by re-using the same threads (thread pooling). However, python being python, you must use multi-processing to achieve true parallelism, which has a huge cost. However, there's a process pool if you wish to re-use processes.
You need to check how much time it takes to run your algorithm sequentially, how much time it takes to run one iteration, and how many iteration will you have. Only then will you know if parallelization is worth it. If it is worth it, then do tests for number of processes going from 1 to 100. This will allow you to find the sweet spot for your algorithm.

Python multiprocessing design

I have written an algorithm that takes geospatial data and performs a number of steps. The input data are a shapefile of polygons and covariate rasters for a large raster study area (~150 million pixels). The steps are as follows:
Sample points from within polygons of the shapefile
For each sampling point, extract values from the covariate rasters
Build a predictive model on the sampling points
Extract covariates for target grid points
Apply predictive model to target grid
Write predictions to a set of output grids
The whole process needs to be iterated a number of times (say 100) but each iteration currently takes more than an hour when processed in series. For each iteration, the most time-consuming parts are step 4 and 5. Because the target grid is so large, I've been processing it a block (say 1000 rows) at a time.
I have a 6-core CPU with 32 Gb RAM, so within each iteration, I had a go at using Python's multiprocessing module with a Pool object to process a number of blocks simultaneously (steps 4 and 5) and then write the output (the predictions) to the common set of output grids using a callback function that calls a global output-writing function. This seems to work, but is no faster (actually, it's probably slower) than processing each block in series.
So my question is, is there a more efficient way to do it? I'm interested in the multiprocessing module's Queue class, but I'm not really sure how it works. For example, I'm wondering if it's more efficient to have a queue that carries out steps 4 and 5 then passes the results to another queue that carries out step 6. Or is this even what Queue is for?
Any pointers would be appreciated.

The current state of Python's multi-processing capabilities are not great for CPU bound processing. I fear to tell you that there is no way to make it run faster using the multiprocessing module nor is it your use of multiprocessing that is the problem.
The real problem is that Python is still bound by the rules of the GlobalInterpreterLock(GIL) (I highly suggest the slides). There have been some exciting theoretical and experimental advances on working around the GIL. Python 3.2 event contains a new GIL which solves some of the issues, but introduces others.
For now, it is faster to execute many Python process with a single serial thread than to attempt to run many threads within one process. This will allow you avoid issues of acquiring the GIL between threads (by effectively having more GILs). This however is only beneficial if the IPC overhead between your Python processes doesn't eclipse the benefits of the processing.
Eli Bendersky wrote a decent overview article on his experiences with attempting to make a CPU bound process run faster with multiprocessing.
It is worth noting that PEP 371 had the desire to 'side-step' the GIL with the introduction of the multiprocessing module (previously a non-standard packaged named pyProcessing). However the GIL still seems to play too large of a role in the Python interpreter to make it work well with CPU bound algorithms. Many different people have worked on removing/rewriting the GIL, but nothing has made enough traction to make it into a Python release.

Some of the multiprocessing examples at python.org are not very clear IMO, and it's easy to start off with a flawed design. Here's a simplistic example I made to get me started on a project:
import os, time, random, multiprocessing
def busyfunc(runseconds):
starttime = int(time.clock())
while 1:
for randcount in range(0,100):
testnum = random.randint(1, 10000000)
newnum = testnum / 3.256
newtime = int(time.clock())
if newtime - starttime > runseconds:
return
def main(arg):
print 'arg from init:', arg
print "I am " + multiprocessing.current_process().name
busyfunc(15)
if __name__ == '__main__':
p = multiprocessing.Process(name = "One", target=main, args=('passed_arg1',))
p.start()
p = multiprocessing.Process(name = "Two", target=main, args=('passed_arg2',))
p.start()
p = multiprocessing.Process(name = "Three", target=main, args=('passed_arg3',))
p.start()
time.sleep(5)
This should exercise 3 processors for 15 seconds. It should be easy to modify it for more. Maybe this will help to debug your current code and ensure you are really generating multiple independent processes.
If you must share data due to RAM limitations, then I suggest this:
http://docs.python.org/library/multiprocessing.html#sharing-state-between-processes

As python is not really meant to do intensive number-cunching, I typically start converting time-critical parts of a python program to C/C++ and speed things up a lot.
Also, the python multithreading is not very good. Python keeps using a global semaphore for all kinds of things. So even when you use the Threads that python offers, things won't get faster. The threads are useful for applications, where threads will typically wait for things like IO.
When making a C module, you can manually release the global semaphore when processing your data (then, of course, do not access the python values anymore).
It takes some practise using the C API, but's its clearly structured and much easier to use than, for example, the Java native API.
See 'extending and embedding' in the python documentation.
This way you can make the time critical parts in C/C++, and the slower parts with faster programming work in python...

I recommend you first check which aspects of your code is taking the most time, so your gonna have to profile it, I've used http://packages.python.org/line_profiler/#line-profiler with much success, though it does require cython.
As for Queues, their mostly used for sharing data/synchronizing threads, though I've rarely used it. I do use multiprocessing all the time.
I mostly follow the map reduce philosophy, which is simple and clean but it has some major overhead, since values have to be packed into dictionaries and copied across each process, when applying the map function ...
You can try segmenting your file and applying your algorithm to different sets.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.