Python multiprocessing design

Python multiprocessing design - python

I have written an algorithm that takes geospatial data and performs a number of steps. The input data are a shapefile of polygons and covariate rasters for a large raster study area (~150 million pixels). The steps are as follows:
Sample points from within polygons of the shapefile
For each sampling point, extract values from the covariate rasters
Build a predictive model on the sampling points
Extract covariates for target grid points
Apply predictive model to target grid
Write predictions to a set of output grids
The whole process needs to be iterated a number of times (say 100) but each iteration currently takes more than an hour when processed in series. For each iteration, the most time-consuming parts are step 4 and 5. Because the target grid is so large, I've been processing it a block (say 1000 rows) at a time.
I have a 6-core CPU with 32 Gb RAM, so within each iteration, I had a go at using Python's multiprocessing module with a Pool object to process a number of blocks simultaneously (steps 4 and 5) and then write the output (the predictions) to the common set of output grids using a callback function that calls a global output-writing function. This seems to work, but is no faster (actually, it's probably slower) than processing each block in series.
So my question is, is there a more efficient way to do it? I'm interested in the multiprocessing module's Queue class, but I'm not really sure how it works. For example, I'm wondering if it's more efficient to have a queue that carries out steps 4 and 5 then passes the results to another queue that carries out step 6. Or is this even what Queue is for?
Any pointers would be appreciated.

The current state of Python's multi-processing capabilities are not great for CPU bound processing. I fear to tell you that there is no way to make it run faster using the multiprocessing module nor is it your use of multiprocessing that is the problem.
The real problem is that Python is still bound by the rules of the GlobalInterpreterLock(GIL) (I highly suggest the slides). There have been some exciting theoretical and experimental advances on working around the GIL. Python 3.2 event contains a new GIL which solves some of the issues, but introduces others.
For now, it is faster to execute many Python process with a single serial thread than to attempt to run many threads within one process. This will allow you avoid issues of acquiring the GIL between threads (by effectively having more GILs). This however is only beneficial if the IPC overhead between your Python processes doesn't eclipse the benefits of the processing.
Eli Bendersky wrote a decent overview article on his experiences with attempting to make a CPU bound process run faster with multiprocessing.
It is worth noting that PEP 371 had the desire to 'side-step' the GIL with the introduction of the multiprocessing module (previously a non-standard packaged named pyProcessing). However the GIL still seems to play too large of a role in the Python interpreter to make it work well with CPU bound algorithms. Many different people have worked on removing/rewriting the GIL, but nothing has made enough traction to make it into a Python release.

Some of the multiprocessing examples at python.org are not very clear IMO, and it's easy to start off with a flawed design. Here's a simplistic example I made to get me started on a project:
import os, time, random, multiprocessing
def busyfunc(runseconds):
starttime = int(time.clock())
while 1:
for randcount in range(0,100):
testnum = random.randint(1, 10000000)
newnum = testnum / 3.256
newtime = int(time.clock())
if newtime - starttime > runseconds:
return
def main(arg):
print 'arg from init:', arg
print "I am " + multiprocessing.current_process().name
busyfunc(15)
if __name__ == '__main__':
p = multiprocessing.Process(name = "One", target=main, args=('passed_arg1',))
p.start()
p = multiprocessing.Process(name = "Two", target=main, args=('passed_arg2',))
p.start()
p = multiprocessing.Process(name = "Three", target=main, args=('passed_arg3',))
p.start()
time.sleep(5)
This should exercise 3 processors for 15 seconds. It should be easy to modify it for more. Maybe this will help to debug your current code and ensure you are really generating multiple independent processes.
If you must share data due to RAM limitations, then I suggest this:
http://docs.python.org/library/multiprocessing.html#sharing-state-between-processes

As python is not really meant to do intensive number-cunching, I typically start converting time-critical parts of a python program to C/C++ and speed things up a lot.
Also, the python multithreading is not very good. Python keeps using a global semaphore for all kinds of things. So even when you use the Threads that python offers, things won't get faster. The threads are useful for applications, where threads will typically wait for things like IO.
When making a C module, you can manually release the global semaphore when processing your data (then, of course, do not access the python values anymore).
It takes some practise using the C API, but's its clearly structured and much easier to use than, for example, the Java native API.
See 'extending and embedding' in the python documentation.
This way you can make the time critical parts in C/C++, and the slower parts with faster programming work in python...

I recommend you first check which aspects of your code is taking the most time, so your gonna have to profile it, I've used http://packages.python.org/line_profiler/#line-profiler with much success, though it does require cython.
As for Queues, their mostly used for sharing data/synchronizing threads, though I've rarely used it. I do use multiprocessing all the time.
I mostly follow the map reduce philosophy, which is simple and clean but it has some major overhead, since values have to be packed into dictionaries and copied across each process, when applying the map function ...
You can try segmenting your file and applying your algorithm to different sets.

Related

Basic Mapreduce with threads is slower than sequential version

I am trying to do a word counter with mapreduce using threads, but this version is much slower than the sequential version. With a 300MB text file the mapreduce version takes about 80s, with the sequential version it takes significantly less. My question is due to not understanding why, as I have done all the stages of map reduce (split, mapping, shuffle, reduce) but I can't figure out why it is slower, as I have used about 6 threads to do the test. I was thinking that it could be that the creation of threads was expensive compared to the execution time, but since it takes about 80s I think it is clear that this is not the problem. Could you take a look at the code to see what it is? I'm pretty sure that the code works fine, the problem is that I don't know what is causing the slowness.
One last thing, when using a text file of more than 300MB, the program fills all the ram memory of my computer, is there any way to optimize it?

First of all several disclaimers:
to know the exact reason why the application is slow you need to profile it. In this answer I'm giving some common sense reasoning.
I'm assuming you are using cPython
When you parallelize some algorithm there are several factors that that influence performance. Some of them work in favour of speed (I'l mark them with +) and some against (-). Let's look at them:
you need to split the work first (-)
work is parallel workers is done simultaneously (+)
parallel workers may need to synchronize their work (-)
reduce requires time (-)
In order for you parallel algorithm give you some gain as compared to sequential you need that all factors that speeds things up overweight all factors that drags you down.
Also the gain from #2 should be big enough to compensate for the additional work you need to do as compared to sequential processing (this means that for some small data you will not get any boost as overhead for coordination will be bigger).
The main problems in your implementation now are items #2 and #3.
First of all the workers are not working in parallel. The portion of the task you parallelize is CPU bound. In python threads of a single process cannot use more than one CPU. So in this program they never execute in parallel. They share the same CPU.
Moreover every modification operation they do on the dicts uses locking/unlocking and this is much slower then sequential version that does not require such synchronization.
To speed up your algorithm you need:
use multiprocessing instead of multithreading (this way you can use multiple CPU for processing)
structure the algorithm in a way that does not require synchronization between workers when they do their job (each worker should use its own dict to store intermediate results)

Does it make sense to multi-thread within multiprocessing?

With Python's multiprocessing, would it make sense to have a Pool with a bunch of ThreadPools within them? Say I have something like:
def task(path):
# i/o bound
image = load(path)
# cpu bound but only takes up 1/10 of the time of the i/o bound stuff
image = preprocess(img)
# i/o bound
save(image, path)
Then I'd want to process a list of paths path_list. If I use ThreadPool I still end up hitting a ceiling because of the cpu bound bit. If I use a Pool I spend too much dead time waiting for i/o. So wouldn't it be best to split path_list over multiple processes that each in turn use multiple threads?
Another shorter way of restating my example is what if I have a method that should be multithreaded because it's i/o bound but I also want to make use of many cpu cores? If I use a Pool I'm using each core up for a single task which is i/o bound. If I use a ThreadPool I only get to use one core.

Does it make sense
Yes. Let's say you start with one process and one thread. Because some parts of the code block on IO, the process will utilize less than a 100% CPU - so we start adding threads. As long as we see an increase in task throughput, it means the CPU is our bottleneck. At some point, we might hit 100% CPU utilization in our process. Because of the GIL, a pure python process can utilize up to 100% CPU. But, as far as we know, the CPU might still be our bottleneck, and the only way to gain more CPU time is to create another process (or use subinterpreters, but let's ignore that for now).
In summary, this is a valid approach for increasing throughput of pure-python tasks that both utilize CPU and block on IO. But, it does not mean that it is a good approach in your case. First, your bottleneck might be the disk and not the CPU, in which case you don't need more CPU time, which means you don't need more processes. Second, even if the CPU is the bottleneck, multithreading within multiprocessing is not necessarily the simplest solution, the most performant solution, or the winning solution in other resource utilization metrics such as memory usage.
For example, if simplicity is your top priority, you could get all the CPU time you need just by using processes. This solution is easier to implement, but is heavy in terms of memory usage. Or, for example, if your goal is to achieve maximal performance and minimal memory utilization, then you you probably want to replace the threads with an IO loop and use a process pool executor for your CPU-bound tasks. Squeezing maximal performance from your hardware is not an easy task. Below is a methodology that I feel had served me well.
Aiming towards maximal performance
From now on, I'm assuming your goal is to make maximal use of your hardware in order to achieve a maximal throughput of "tasks". In that case, the final solution depends on your hardware, so you'll need to get to know it a little bit better. To try and reach your performance goals, I recommend to:
Understand your hardware utilization
Identify the bottleneck and estimate the maximal throughput
Design a solution to achieve that throughput
Implement the design, and optimize until you meet your requirements
In detail:
1. Understand your hardware utilization
In this case, there are a few pieces of hardware involved:
The RAM
The disk
The CPU
Let's look at one "task" and note how it uses the hardware:
Disk (read)
RAM (write)
CPU time
RAM (read)
Disk (write)
2. Identify the bottleneck and estimate the maximal throughput
To identify the bottleneck, let us calculate the maximum throughput of tasks that each hardware component can provide, assuming usage of them can be completely parallelized. I like to do that using python:
(note that I'm using random constants, you'll have to fill in the real data for your setup in order to use it).
# ----------- General consts
input_image_size = 20 * 2 ** 20 # 20MB
output_image_size = 15 * 2 ** 20 # 15MB
# ----------- Disk
# If you have multiple disks and disk access is the bottleneck, you could split the images between them
amount_of_disks = 2
disk_read_rate = 3.5 * 2 ** 30 # 3.5GBps, maximum read rate for a good SSD
disk_write_rate = 2.5 * 2 ** 30 # 2.5GBps, maximum write rate for a good SSD
disk_read_throughput = amount_of_disks * disk_read_rate / input_image_size
disk_write_throughput = amount_of_disks * disk_write_rate / output_image_size
# ----------- RAM
ram_bandwidth = 30 * 2 ** 30 # Assuming here similar write and read rates of 30GBps
# assuming you are working in userspace and not using a userspace filesystem,
# data is first read into kernel space, then copied to userspace. So in total,
# two writes and one read.
userspace_ram_bandwidth = ram_bandwidth / 3
ram_read_throughput = userspace_ram_bandwidth / input_image_size
ram_write_throughput = userspace_ram_bandwidth / output_image_size
# ----------- CPU
# We decrease one core, as at least some scheduling code and kernel code is going to run
core_amount = 8 - 1
# The measured amount of times a single core can run the preprocess function in a second.
# Assuming that you are not planning to optimize the preprocess function as well.
preprocess_function_rate = 1000
cpu_throughput = core_amount * preprocess_function_rate
# ----------- Conclusions
min_throughput, bottleneck_name = min([(disk_read_throughput, 'Disk read'),
(disk_write_throughput, 'Disk write'),
(ram_read_throughput, 'RAM read'),
(ram_write_throughput, 'RAM write'),
(cpu_throughput, 'CPU')])
cpu_cores_needed = min_throughput / preprocess_function_rate
print(f'Throughput: {min_throughput:.1f} tasks per second\n'
f'Bottleneck: {bottleneck_name}\n'
f'Worker amount: {cpu_cores_needed:.1f}')
This code outputs:
Throughput: 341.3 tasks per second
Bottleneck: Disk write
Worker amount: 0.3
That means:
The maximum rate we can achieve is around 341.3 tasks per second
The disk is the bottleneck. You might be able to increase your performance by, for example:
Buying more disks
Using ramfs or a similar solution that avoids using the disk altogether
In a system where all the steps in task are executed in parallel, you won't need to dedicate more than one core for running preprocess. (In python that means you'll probably need only one process, and threads or asyncio would suffice to achieve concurrency with other steps)
Note: the numbers are lying
This kind of estimation is very hard to get right. It's hard not to forget things in the calculation itself, and hard to achieve good measurements for the constants. For example, there is a big issue with the current calculation - reads and writes are not orthogonal. We assume in our calculation that everything is happening in parallel, so constants like disk_read_rate have to account for writes occurring simultaneously to the reads. The RAM rates should probably be decreased by at least 50%.
3. Design a solution to achieve that throughput
Similarly to what you'd offered in your question, my initial design would be something like:
Have a pool of workers load the images and send them on a queue to the next step (we'll need to be reading using multiple cores to use all available memory bandwidth)
Have a pool of workers process the images and send the results on a queue (the amount of workers should be chosen according to the output of the script above. For the current result, the number is 1)
Have a pool of workers save the processed images to the disk.
The actual implementation details will vary according to different technical constraints and overheads you will run into while implementing the solution. Without further details and measurements it is hard to guess what they will be exactly.
4. Implement the design, and optimize until you meet your requirements
Good luck, and be warned that even if you did a good job at estimating the maximal throughput, it might be very hard to get there. Comparing the maximum rate to your speed requirements might give you a good idea of the amount of effort needed. For example, if the rate you need is 10x slower than the maximum rate, you might be done pretty quickly. But if it is only 2x slower, you might want to consider doubling your hardware and start preparing for some hard work :)

kmarok's answer is good technical one. But, I would also consider the quote "Premature optimization is the root of all evil" concept.
In short, yes, it make sense. But, do you really need to?
Optimization is a trade off. You compromise code simplicity for better performance. Code simplicity is important; you'll need to further develop, debug, and test your software in the future. This will cost you in time. Simplicity buys you time. You need to be aware of the trade-off when you optimize.
I would first write a multithreaded version and measure it using your hardware.
Then I would try the multiprocessing version, and measure it too.
Does any of the versions, is good enough? It might be. If so, you just made your software simpler, more readable and better maintainable.

Chen's and Kamaork's answers resume most of what is needed to know, but there are 2 missing ideas:
Your code will be A process and not THE process, this means that you need to account of how much resources you have left and not how many you can have (it can even happen within your process, threads are not ilimited); this deadly problem happend to me leaving me with less than half of a celeron for a gui, not good.
The biggest optimization with threads you can do is "prediction" (this refers more specifically to when stuff happens), you can chain the threads in a better way when you know how much it takes to compite and its a consisten wait, reading about the tcp window may give you a better idea of how a delay can be optimized by expecting it and not by forcing it.

Multithreading regressions in Python

I have a project in Python that requires regressing many variables against many others. I am using a Jupyter Notebook for clarity but am also willing to use another IDE if it's easier. My code looks something like:
for a in dependent_variables:
for b in independent_variables:
regress a on b
My current dataset isn't huge, so this whole thing takes maybe 30 seconds, but I will soon have a much larger dataset that will significantly increase time required. I'm curious if this is a situation suitable for parallelization. Specifically, if I have a dual-threaded eight-core processor (meaning 16 CPUs total), is it possible to run simultaneous processes where each process regresses one of the first variables against one of the second variables, allowing me to complete, say, eight of these regressions at a time (if I allocate half of the CPUs to this process)? I am not super familiar with parallelization and most other answers I've found have discussed the parallelization of a single function call, not the simultaneous execution of multiple similar functions. I appreciate the help!

Nominally, this is
import itertools
import multiprocessing as mp
def regress_me(vars):
ind_var, dep_var = vars
# your regression may be better than mine...
result = "{} {}".format(ind_var, dep_var)
return result
if __name__ == "__main__":
with mp.Pool(8) as pool:
analyse_this = list(itertools.product(independent_variables,
dependent_variables))
result = mp.map(regress_me, analyse_this)
A lot depends on what is being passed between parent and child and whether you are using a forking system like linux or a spawning system like windows. If these datasets are being fetched from disk, it may be better to do the read in the worker regress_me instead of passing it from the parent. You can read up on that with the standard python multiprocessing library.

Why is it better to use synchronous programming for in-memory operations?

I have a complex nested data structure. I iterate through it and perform some calculations on each possible uniqe pair of elements. It's all in-memory mathematical functions. I don't read from files or do networking.
It takes a few hours to run, with do_work() being called 25,000 times. I am looking for ways to speed it up.
Although Pool.map() seems useful for my lists, it's proving to be difficult because I need to pass extra arguments into the function being mapped.
I thought using the Python multitasking library would help, but when I use Pool.apply_async() to call do_work(), it actually takes longer.
I did some googling and a blogger says "Use sync for in-memory operations — async is a complete waste when you aren’t making blocking calls." Is this true? Can someone explain why? Do the RAM read & write operations interfere with each other? Why does my code take longer with async calls? do_work() writes calculation results to a database, but it doesn't modify my data structure.
Surely there is a way to utilize my processor cores instead of just linearly iterating through my lists.
My starting point, doing it synchronously:
main_list = [ [ [a,b,c,[x,y,z], ... ], ... ], ... ] # list of identical structures
helper_list = [1,2,3]
z = 2
for i_1 in range(0, len(main_list)):
for i_2 in range(0, len(main_list)):
if i_1 < i_2: # only unique combinations
for m in range(0, len(main_list[i_1])):
for h, helper in enumerate(helper_list):
do_work(
main_list[i_1][m][0], main_list[i_2][m][0], # unique combo
main_list[i_1][m][1], main_list[i_1][m][2],
main_list[i_1][m][3][z], main_list[i_2][m][3][h],
helper_list[h]
)
Variable names have been changed to make it more readable.

This is just a general answer, but too long for a comment...
First of all, I think your biggest bottleneck at this very moment is Python itself. I don't know what do_work() does, but if it's CPU intensive, you have the GIL which completely prevents effective parallelisation inside one process. No matter what you do, threads will fight for the GIL and it will eventually make your code even slower. Remember: Python has real threading, but the CPU is shared inside a single process.
I recommend checking out the page of David M Beazley: http://dabeaz.com/GIL/gilvis who did a lot of effort to visualise the GIL behaviour in Python.
On the other hand, the module multiprocessing allows you to run multiple processes and "circumvent" the GIL downsides, but it will be tricky to get access to the same memory locations without bigger penalties or trade-offs.
Second: if you utilise heavy nested loops, you should think about using numba and trying to fit your data structures inside numpy (structured) arrays. This can give you order of magnitude of speed quite easily. Python is slow as hell for such things but luckily there are ways to squeeze out a lot when using appropriate libraries.
To sum up, I think the code you are running could be orders of magnitudes faster with numba and numpy structures.
Alternatively, you can try to rewrite the code in a language like Julia (very similar syntax to Python and the community is extremely helpful) and quickly check how fast it is in order to explore the limits of the performance. It's always a good idea to get a feeling how fast something (or parts of a code) can be in a language which has not such complex performance critical aspects like Python.

Your task is more CPU bound than relying on I/O operations. Asynchronous execution make sense when you have long I/O operations i.e. sending/receiving something from network etc.
What you can do is split task to the chunks and utilize threads and multiprocessing (run on different CPU cores).

Parallel computing

I have a two dimensional table (Matrix)
I need to process each line in this matrix independently from the others.
The process of each line is time consuming.
I'd like to use parallel computing resources in our university (Canadian Grid something)
Can I have some advise on how to start ? I never used parallel computing before.
Thanks :)

Start here: http://docs.python.org/library/multiprocessing.html
Be sure to read this: http://docs.python.org/library/multiprocessing.html#examples
This may be helpful: http://www.slideshare.net/pvergain/multiprocessing-with-python-presentation.
While excellent, it includes threads and multiprocessing, even though multiprocessing is often far, far superior to attempting multi-threading.
For Grid computing, multi-threading is largely useless.
Also, you probably also want to read up on celery.

I am one of the developper of a new library called scoop.
It was built exactly for this purpose (grid or super-computing, scientific computing). I suggest you give it a try.
In your case, all you would have to do is a call like this:
futures.map(YourFunc, matrixLine)
It will then be distributed on your grid or whatever environment you choose.

Like the commentators have said, find someone to talk to in your university. The answer to your question will be specific to what software is installed on the grid. If you have access to a grid, it's highly likely you also have access to a person whose job it is to answer your questions (and they will be pleased to help) - find this person!

From what you describe, I would say: first have a look at numpy.
Numpy provides methods to compute the columns and rows in a vectorized manner with nearly C speed. Depending on your problem, this could be faster than parallel computation with pure CPython.
You can than use parallel computing with numpy-arrays to get a really big speed up.
Possible ways to do this is using multiprocessing or Ipython on a cluster.

It is recommended that you use C++/C for performing this computation. You can use the OpenMP API using the #include<omp.h> header. You can start your parallel region using the #pragma amp parallel directive. Since you are parallelising a for-loop for computing your matrix multiplication, you can use #pragma omp parallel for { } to start your for-loop inside the parallel region. OpenMP will automatically take care of the process synchronisation.
Check this out for a sample code: https://gist.github.com/metallurgix/0dfafc03215ce89fc595
Remember to use a big matrix to see actual improvements in speed. A smaller matrix will perform poorly in fact due to increased task overhead created due to forking and joining the multiple threads.
You can also check out MPI if you want to parallelise your code using multiple processors instead of multiple threads.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.