Easiest way to parallelise a call to map?

Easiest way to parallelise a call to map? - python

Hey I have some code in Python which is basically a World Object with Player objects. At one point the Players all get the state of the world and need to return an action. The calculations the players do are independent and only use the instance variables of the respective player instance.
while True:
#do stuff, calculate state with the actions array of last iteration
for i, player in enumerate(players):
actions[i] = player.get_action(state)
What is the easiest way to run the inner for loop parallel? Or is this a bigger task than I am assuming?

The most straightforward way is to use multiprocessing.Pool.map (which works just like map):
import multiprocessing
pool = multiprocessing.Pool()
def do_stuff(player):
... # whatever you do here is executed in another process
while True:
pool.map(do_stuff, players)
Note however that this uses multiple processes. There is no way of doing multithreading in Python due to the GIL.
Usually parallelization is done with threads, which can access the same data inside your program (because they run in the same process). To share data between processes one needs to use IPC (inter-process communication) mechanisms like pipes, sockets, files etc. Which costs more resources. Also, spawning processes is much slower than spawning threads.
Other solutions include:
vectorization: rewrite your algorithm as computations on vectors and matrices and use hardware accelerated libraries to execute it
using another Python distribution that doesn't have a GIL
implementing your piece of parallel code in another language and calling it from Python
A big issue comes when your have to share data between the processes/threads. For example in your code, each task will access actions. If you have to share state, welcome to concurrent programming, a much bigger task, and one of the hardest thing to do right in software.

Related

Why do we blame GIL if CPU can execute one process (light weight) at a time? [duplicate]

I'm slightly confused about whether multithreading works in Python or not.
I know there has been a lot of questions about this and I've read many of them, but I'm still confused. I know from my own experience and have seen others post their own answers and examples here on StackOverflow that multithreading is indeed possible in Python. So why is it that everyone keep saying that Python is locked by the GIL and that only one thread can run at a time? It clearly does work. Or is there some distinction I'm not getting here?
Many posters/respondents also keep mentioning that threading is limited because it does not make use of multiple cores. But I would say they are still useful because they do work simultaneously and thus get the combined workload done faster. I mean why would there even be a Python thread module otherwise?
Update:
Thanks for all the answers so far. The way I understand it is that multithreading will only run in parallel for some IO tasks, but can only run one at a time for CPU-bound multiple core tasks.
I'm not entirely sure what this means for me in practical terms, so I'll just give an example of the kind of task I'd like to multithread. For instance, let's say I want to loop through a very long list of strings and I want to do some basic string operations on each list item. If I split up the list, send each sublist to be processed by my loop/string code in a new thread, and send the results back in a queue, will these workloads run roughly at the same time? Most importantly will this theoretically speed up the time it takes to run the script?
Another example might be if I can render and save four different pictures using PIL in four different threads, and have this be faster than processing the pictures one by one after each other? I guess this speed-component is what I'm really wondering about rather than what the correct terminology is.
I also know about the multiprocessing module but my main interest right now is for small-to-medium task loads (10-30 secs) and so I think multithreading will be more appropriate because subprocesses can be slow to initiate.

The GIL does not prevent threading. All the GIL does is make sure only one thread is executing Python code at a time; control still switches between threads.
What the GIL prevents then, is making use of more than one CPU core or separate CPUs to run threads in parallel.
This only applies to Python code. C extensions can and do release the GIL to allow multiple threads of C code and one Python thread to run across multiple cores. This extends to I/O controlled by the kernel, such as select() calls for socket reads and writes, making Python handle network events reasonably efficiently in a multi-threaded multi-core setup.
What many server deployments then do, is run more than one Python process, to let the OS handle the scheduling between processes to utilize your CPU cores to the max. You can also use the multiprocessing library to handle parallel processing across multiple processes from one codebase and parent process, if that suits your use cases.
Note that the GIL is only applicable to the CPython implementation; Jython and IronPython use a different threading implementation (the native Java VM and .NET common runtime threads respectively).
To address your update directly: Any task that tries to get a speed boost from parallel execution, using pure Python code, will not see a speed-up as threaded Python code is locked to one thread executing at a time. If you mix in C extensions and I/O, however (such as PIL or numpy operations) and any C code can run in parallel with one active Python thread.
Python threading is great for creating a responsive GUI, or for handling multiple short web requests where I/O is the bottleneck more than the Python code. It is not suitable for parallelizing computationally intensive Python code, stick to the multiprocessing module for such tasks or delegate to a dedicated external library.

Yes. :)
You have the low level thread module and the higher level threading module. But it you simply want to use multicore machines, the multiprocessing module is the way to go.
Quote from the docs:
In CPython, due to the Global Interpreter Lock, only one thread can
execute Python code at once (even though certain performance-oriented
libraries might overcome this limitation). If you want your
application to make better use of the computational resources of
multi-core machines, you are advised to use multiprocessing. However,
threading is still an appropriate model if you want to run multiple
I/O-bound tasks simultaneously.

Threading is Allowed in Python, the only problem is that the GIL will make sure that just one thread is executed at a time (no parallelism).
So basically if you want to multi-thread the code to speed up calculation it won't speed it up as just one thread is executed at a time, but if you use it to interact with a database for example it will.

I feel for the poster because the answer is invariably "it depends what you want to do". However parallel speed up in python has always been terrible in my experience even for multiprocessing.
For example check this tutorial out (second to top result in google): https://www.machinelearningplus.com/python/parallel-processing-python/
I put timings around this code and increased the number of processes (2,4,8,16) for the pool map function and got the following bad timings:
serial 70.8921644706279
parallel 93.49704207479954 tasks 2
parallel 56.02441442012787 tasks 4
parallel 51.026168536394835 tasks 8
parallel 39.18044807203114 tasks 16
code:
# increase array size at the start
# my compute node has 40 CPUs so I've got plenty to spare here
arr = np.random.randint(0, 10, size=[2000000, 600])
.... more code ....
tasks = [2,4,8,16]
for task in tasks:
tic = time.perf_counter()
pool = mp.Pool(task)
results = pool.map(howmany_within_range_rowonly, [row for row in data])
pool.close()
toc = time.perf_counter()
time1 = toc - tic
print(f"parallel {time1} tasks {task}")

Can standard C Python has more than one thread running at the same time? [duplicate]

Yes. :)
You have the low level thread module and the higher level threading module. But it you simply want to use multicore machines, the multiprocessing module is the way to go.
Quote from the docs:
In CPython, due to the Global Interpreter Lock, only one thread can
execute Python code at once (even though certain performance-oriented
libraries might overcome this limitation). If you want your
application to make better use of the computational resources of
multi-core machines, you are advised to use multiprocessing. However,
threading is still an appropriate model if you want to run multiple
I/O-bound tasks simultaneously.

Threading is Allowed in Python, the only problem is that the GIL will make sure that just one thread is executed at a time (no parallelism).
So basically if you want to multi-thread the code to speed up calculation it won't speed it up as just one thread is executed at a time, but if you use it to interact with a database for example it will.

I feel for the poster because the answer is invariably "it depends what you want to do". However parallel speed up in python has always been terrible in my experience even for multiprocessing.
For example check this tutorial out (second to top result in google): https://www.machinelearningplus.com/python/parallel-processing-python/
I put timings around this code and increased the number of processes (2,4,8,16) for the pool map function and got the following bad timings:
serial 70.8921644706279
parallel 93.49704207479954 tasks 2
parallel 56.02441442012787 tasks 4
parallel 51.026168536394835 tasks 8
parallel 39.18044807203114 tasks 16
code:
# increase array size at the start
# my compute node has 40 CPUs so I've got plenty to spare here
arr = np.random.randint(0, 10, size=[2000000, 600])
.... more code ....
tasks = [2,4,8,16]
for task in tasks:
tic = time.perf_counter()
pool = mp.Pool(task)
results = pool.map(howmany_within_range_rowonly, [row for row in data])
pool.close()
toc = time.perf_counter()
time1 = toc - tic
print(f"parallel {time1} tasks {task}")

parallel python, or MPI?

I have a code with heavy symbolic calculations (many multiple symbolic integrals). Also I have access to both an 8-core cpu computer (with 18 GB RAM) and a small 32 cpu cluster. I prefer to remain on my professor's 8-core pc rather than to go to another professor's lab using his cluster in a more limited time, however, I'm not sure it will work on the SMP system, so I am looking for a parallel tool in Python that can be used on both SMP and Clusters and of course prefer the codes on one system to be easily and with least effort modifiable for use on the other system.
So far, I have found Parallel Python (PP) promising for my need, but I have recently told that MPI also does the same (pyMPI or MPI4py). I couldn't approve this as seemingly very little is discussed about this on the web, only here it is stated that MPI (both pyMPI or MPI4py) is usable for clusters only, if I am right about that "only"!
Is "Parallel Python" my only choice, or I can also happily use MPI based solutions? Which one is more promising for my needs?
PS. It seems none of them have very comprehensive documentations so if you know some links to other than their official websites that can help a newbie in parallel computation I will be so grateful if you would also mention them in your answer :)
Edit.
My code has two loops one inside the other, the outer loop cannot be parallelized as it is an iteration method (a recursive solution) each step depending on the values calculated within its previous step. The outer loop contains the inner loop alongside 3 extra equations whose calculations depend on the whole results of the inner loop. However, the inner loop (which contains 9 out of 12 equations computable at each step) can be safely parallelized, all 3*3 equations are independent w.r.t each other, only depending on the previous step. All my equations are so computationally heavy as each contains many multiple symbolic integrals. Seemingly I can parallelize both the inner loop's 9 equations and the integration calculations in each of these 9 equation separately, and also parallelize all the integrations in other 3 equations alongside the inner loop. You can find my code here if it can help you better understand my need, it is written inside SageMath.

I would look in to multiprocessing (doc) which provides a bunch of nice tools for spawning and working with sub-processes.
To quote the documentation:
multiprocessing is a package that supports spawning processes using an
API similar to the threading module. The multiprocessing package
offers both local and remote concurrency, effectively side-stepping
the Global Interpreter Lock by using subprocesses instead of threads.
From the comments I think the Pool and it's map would serve your purposes (doc).
def work_done_in_inner_loop(arg):
# put your work code here
pass
p = Pool(9)
for o in outer_loop:
# what ever else you do
list_of_args = [...] # what your inner loop currently loops over
res = p.map(work_done_in_inner_loop,list_of_args])
# rest of code

It seems like there are a few reasonable ways to design this.
Let me refer to your jobs as the main job, the 9 intermediate jobs, and the many inner jobs the intermediate jobs can spin off. I'm assuming the intermediate jobs have a "merge" step after the inner jobs all finish, and the same for the outer job.
The simplest design is that the main job fires off the intermediate jobs and then waits for them all to finish before doings its merge step. Then intermediate jobs then fire off the inner jobs and wait for them all to finish before doing their merge steps.
This can work with a single shared queue, but you need a queue that doesn't block the worker pool while waiting, and I don't think multiprocessing's Pool and Queue can do that out of the box. As soon as you've got all of your processes waiting to join their children, nothing gets done.
One way around that is to change to a continuation-passing style. If you know which one of the intermediate jobs will finish last, you can pass it the handles to the other intermediate jobs and have it join on them and do the merge, instead of the outer job. And the intermediate similarly pass off the merge to their last inner job.
The problem is that you usually have no way of knowing what's going to finish last, even without scheduling issues. So that means you need some form of either sharing (e.g., a semaphore) or message passing between the jobs to negotiate that among themselves. You can do that on top of multiprocessing. The only problem is that it destroys the independence of your jobs, and you're suddenly dealing with all the annoying problems of shared concurrency.
A different alternative is to have separate pools and queues for each intermediate job, and some kind of load balancing between the pools that can ensure that each core is running one active process.
Or, of course, a single pool with a more complicated implementation than multiprocessing's, which does either load balancing or cooperative scheduling, so a joiner doesn't block a core.
Or a super-simple solution: Overschedule, and pay a little cost in context switching for simplicity. For example, you can run 32 workers even though you've only got 8 cores, so you've got 22 active workers and 10 waiting. Each core has 2 or 3 active workers, which will slow things down a bit, but maybe not too badly—and at least nobody's idle, and you didn't have to write any code beyond passing a different parameter to the multiprocessing.Pool constructor.
At any rate, multiprocessing is very simple, and it has almost no extra concepts that won't apply to other solutions. So it may take less time to play with it until you run into a brick wall or don't, than to try to figure out in advance whether it'll work for you.

I recently ran into a similar problem. However, the following solution is only valid if (1) you wish to run the python script individually on a group of files, AND (2) each invocation of the script is independent of the others.
If the above applies to you, the simplest solution is to write a wrapper in bash along the lines of:
for a_file in $list_of_files
do
python python_script.py a_file &
done
The '&' will run the preceding command as a sub-process. The advantage is that bash will not wait for the python script to finish before continuing with the for loop.
You may want to place a cap on the number of processes running simultaneously, since this code will use all available resources.

using multiple threads in Python

I'm trying to solve a problem, where I have many (on the order of ten thousand) URLs, and need to download the content from all of them. I've been doing this in a "for link in links:" loop up till now, but the amount of time it's taking is now too long. I think it's time to implement a multithreaded or multiprocessing approach. My question is, what is the best approach to take?
I know about the Global Interpreter Lock, but since my problem is network-bound, not CPU-bound, I don't think that will be an issue. I need to pass data back from each thread/process to the main thread/process. I don't need help implementing whatever approach (Terminate multiple threads when any thread completes a task covers that), I need advice on which approach to take. My current approach:
data_list = get_data(...)
output = []
for datum in data:
output.append(get_URL_data(datum))
return output
There's no other shared state.
I think the best approach would be to have a queue with all the data in it, and have several worker threads pop from the input queue, get the URL data, then push onto an output queue.
Am I right? Is there anything I'm missing? This is my first time implementing multithreaded code in any language, and I know it's generally a Hard Problem.

For your specific task I would recommend a multiprocessing worker pool. You simply define a pool and tell it how many processes you want to use (one per processor core by default) as well as a function you want to run on each unit of work. Then you ready every unit of work (in your case this would be a list of URLs) in a list and give it to the worker pool.
Your output will be a list of the return values of your worker function for every item of work in your original array. All the cool multi-processing goodness will happen in the background. There is of course other ways of working with the worker pool as well, but this is my favourite one.
Happy multi-processing!

The best approach I can think of in your use case will be to use a thread pool and maintain a work queue. The threads in the thread pool get work from the work queue, do the work and then go get some more work. This way you can finely control the number of threads working on your URLs.
So, create a WorkQueue, which in your case is basically a list containing the URLs that need to be downloaded.
Create a thread pool, which create the number of threads you specify, fetches work from the WorkQueue and assigns it to a thread. Each time a thread finishes and returns you check if the work queues has more work and accordingly assign work to that thread again. You may also want to put a hook so that every time work is added to the work queue, your threads assigns it to a free thread if available.

The fastest and most efficient method of doing IO bound tasks like this is an asynchronous event loop. The libcurl can do this, and there is a Python wrapper for that called pycurl. Using it's "multi" interface you can do high-performance client activities. I have done over 1000 simultaneous fetchs as fast as one.
However, the API is quite low-level and difficult to use. There is a simplifying wrapper here, which you can use as an example.

Parallelism in Python

What are the options for achieving parallelism in Python? I want to perform a bunch of CPU bound calculations over some very large rasters, and would like to parallelise them. Coming from a C background, I am familiar with three approaches to parallelism:
Message passing processes, possibly distributed across a cluster, e.g. MPI.
Explicit shared memory parallelism, either using pthreads or fork(), pipe(), et. al
Implicit shared memory parallelism, using OpenMP.
Deciding on an approach to use is an exercise in trade-offs.
In Python, what approaches are available and what are their characteristics? Is there a clusterable MPI clone? What are the preferred ways of achieving shared memory parallelism? I have heard reference to problems with the GIL, as well as references to tasklets.
In short, what do I need to know about the different parallelization strategies in Python before choosing between them?

Generally, you describe a CPU bound calculation. This is not Python's forte. Neither, historically, is multiprocessing.
Threading in the mainstream Python interpreter has been ruled by a dreaded global lock. The new multiprocessing API works around that and gives a worker pool abstraction with pipes and queues and such.
You can write your performance critical code in C or Cython, and use Python for the glue.

The new (2.6) multiprocessing module is the way to go. It uses subprocesses, which gets around the GIL problem. It also abstracts away some of the local/remote issues, so the choice of running your code locally or spread out over a cluster can be made later. The documentation I've linked above is a fair bit to chew on, but should provide a good basis to get started.

Ray is an elegant (and fast) library for doing this.
The most basic strategy for parallelizing Python functions is to declare a function with the #ray.remote decorator. Then it can be invoked asynchronously.
import ray
import time
# Start the Ray processes (e.g., a scheduler and shared-memory object store).
ray.init(num_cpus=8)
#ray.remote
def f():
time.sleep(1)
# This should take one second assuming you have at least 4 cores.
ray.get([f.remote() for _ in range(4)])
You can also parallelize stateful computation using actors, again by using the #ray.remote decorator.
# This assumes you already ran 'import ray' and 'ray.init()'.
import time
#ray.remote
class Counter(object):
def __init__(self):
self.x = 0
def inc(self):
self.x += 1
def get_counter(self):
return self.x
# Create two actors which will operate in parallel.
counter1 = Counter.remote()
counter2 = Counter.remote()
#ray.remote
def update_counters(counter1, counter2):
for _ in range(1000):
time.sleep(0.25)
counter1.inc.remote()
counter2.inc.remote()
# Start three tasks that update the counters in the background also in parallel.
update_counters.remote(counter1, counter2)
update_counters.remote(counter1, counter2)
update_counters.remote(counter1, counter2)
# Check the counter values.
for _ in range(5):
counter1_val = ray.get(counter1.get_counter.remote())
counter2_val = ray.get(counter2.get_counter.remote())
print("Counter1: {}, Counter2: {}".format(counter1_val, counter2_val))
time.sleep(1)
It has a number of advantages over the multiprocessing module:
The same code runs on a single multi-core machine as well as a large cluster.
Data is shared efficiently between processes on the same machine using shared memory and efficient serialization.
You can parallelize Python functions (using tasks) and Python classes (using actors).
Error messages are propagated nicely.
Ray is a framework I've been helping develop.

Depending on how much data you need to process and how many CPUs/machines you intend to use, it is in some cases better to write a part of it in C (or Java/C# if you want to use jython/IronPython)
The speedup you can get from that might do more for your performance than running things in parallel on 8 CPUs.

There are many packages to do that, the most appropriate as other said is multiprocessing, expecially with the class "Pool".
A similar result can be obtained by parallel python, that in addition is designed to work with clusters.
Anyway, I would say go with multiprocessing.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.