What are the options for achieving parallelism in Python? I want to perform a bunch of CPU bound calculations over some very large rasters, and would like to parallelise them. Coming from a C background, I am familiar with three approaches to parallelism:
Message passing processes, possibly distributed across a cluster, e.g. MPI.
Explicit shared memory parallelism, either using pthreads or fork(), pipe(), et. al
Implicit shared memory parallelism, using OpenMP.
Deciding on an approach to use is an exercise in trade-offs.
In Python, what approaches are available and what are their characteristics? Is there a clusterable MPI clone? What are the preferred ways of achieving shared memory parallelism? I have heard reference to problems with the GIL, as well as references to tasklets.
In short, what do I need to know about the different parallelization strategies in Python before choosing between them?
Generally, you describe a CPU bound calculation. This is not Python's forte. Neither, historically, is multiprocessing.
Threading in the mainstream Python interpreter has been ruled by a dreaded global lock. The new multiprocessing API works around that and gives a worker pool abstraction with pipes and queues and such.
You can write your performance critical code in C or Cython, and use Python for the glue.
The new (2.6) multiprocessing module is the way to go. It uses subprocesses, which gets around the GIL problem. It also abstracts away some of the local/remote issues, so the choice of running your code locally or spread out over a cluster can be made later. The documentation I've linked above is a fair bit to chew on, but should provide a good basis to get started.
Ray is an elegant (and fast) library for doing this.
The most basic strategy for parallelizing Python functions is to declare a function with the #ray.remote decorator. Then it can be invoked asynchronously.
import ray
import time
# Start the Ray processes (e.g., a scheduler and shared-memory object store).
ray.init(num_cpus=8)
#ray.remote
def f():
time.sleep(1)
# This should take one second assuming you have at least 4 cores.
ray.get([f.remote() for _ in range(4)])
You can also parallelize stateful computation using actors, again by using the #ray.remote decorator.
# This assumes you already ran 'import ray' and 'ray.init()'.
import time
#ray.remote
class Counter(object):
def __init__(self):
self.x = 0
def inc(self):
self.x += 1
def get_counter(self):
return self.x
# Create two actors which will operate in parallel.
counter1 = Counter.remote()
counter2 = Counter.remote()
#ray.remote
def update_counters(counter1, counter2):
for _ in range(1000):
time.sleep(0.25)
counter1.inc.remote()
counter2.inc.remote()
# Start three tasks that update the counters in the background also in parallel.
update_counters.remote(counter1, counter2)
update_counters.remote(counter1, counter2)
update_counters.remote(counter1, counter2)
# Check the counter values.
for _ in range(5):
counter1_val = ray.get(counter1.get_counter.remote())
counter2_val = ray.get(counter2.get_counter.remote())
print("Counter1: {}, Counter2: {}".format(counter1_val, counter2_val))
time.sleep(1)
It has a number of advantages over the multiprocessing module:
The same code runs on a single multi-core machine as well as a large cluster.
Data is shared efficiently between processes on the same machine using shared memory and efficient serialization.
You can parallelize Python functions (using tasks) and Python classes (using actors).
Error messages are propagated nicely.
Ray is a framework I've been helping develop.
Depending on how much data you need to process and how many CPUs/machines you intend to use, it is in some cases better to write a part of it in C (or Java/C# if you want to use jython/IronPython)
The speedup you can get from that might do more for your performance than running things in parallel on 8 CPUs.
There are many packages to do that, the most appropriate as other said is multiprocessing, expecially with the class "Pool".
A similar result can be obtained by parallel python, that in addition is designed to work with clusters.
Anyway, I would say go with multiprocessing.
Related
I am trying to use multithreading and/or multiprocessing to speed up my script somewhat. Essentially I have a list of 10,000 subnets I read in from CSV, that I want to convert into an IPv4 object and then store in an array.
My base code is as follows and executes in roughly 300ms:
aclsConverted = []
def convertToIP(ip):
aclsConverted.append(ipaddress.ip_network(ip))
for y in acls:
convertToIP(y['srcSubnet'])
If I try with concurrent.futures Threads it works but is 3-4x as slow, as follows:
aclsConverted = []
def convertToIP(ip):
aclsConverted.append(ipaddress.ip_network(ip))
with concurrent.futures.ThreadPoolExecutor(max_workers=20) as executor:
for y in acls:
executor.submit(convertToIP,y['srcSubnet'])
Then if I try with concurrent.futures Process it 10-15x as slow and the array is empty. Code is as follows
aclsConverted = []
def convertToIP(ip):
aclsConverted.append(ipaddress.ip_network(ip))
with concurrent.futures.ProcessPoolExecutor(max_workers=20) as executor:
for y in acls:
executor.submit(convertToIP,y['srcSubnet'])
The server I am running this on has 28 physical cores.
Any suggestions as to what I might be doing wrong will be gratefully received!
If tasks are too small, then the overhead of managing multiprocessing / multithreading is often more expensive than the benefit of running tasks in parallel.
You might try following:
Just to create two processes (not threads!!!), one treating the first 5000 subnets, the other the the other 5000 subnets.
There you might be able to see some performance improvement. but the tasks you perform are not that CPU or IO intensive, so not sure it will work.
Multithreading in Python on the other hand will have no performance improvement at all for tasks, that have no IO and that are pure python code.
The reason is the infamous GIL (global interpreter lock). In python you can never execute two python byte codes in parallel within the same process.
Multithreading in python makes still sense for tasks, that have IO (performing network accesses), that perform sleeps, that call modules, that are implemented in C and that do release the GIL. numpy for example releases the GIL and is thus a good candidate for multi threading
Hey I have some code in Python which is basically a World Object with Player objects. At one point the Players all get the state of the world and need to return an action. The calculations the players do are independent and only use the instance variables of the respective player instance.
while True:
#do stuff, calculate state with the actions array of last iteration
for i, player in enumerate(players):
actions[i] = player.get_action(state)
What is the easiest way to run the inner for loop parallel? Or is this a bigger task than I am assuming?
The most straightforward way is to use multiprocessing.Pool.map (which works just like map):
import multiprocessing
pool = multiprocessing.Pool()
def do_stuff(player):
... # whatever you do here is executed in another process
while True:
pool.map(do_stuff, players)
Note however that this uses multiple processes. There is no way of doing multithreading in Python due to the GIL.
Usually parallelization is done with threads, which can access the same data inside your program (because they run in the same process). To share data between processes one needs to use IPC (inter-process communication) mechanisms like pipes, sockets, files etc. Which costs more resources. Also, spawning processes is much slower than spawning threads.
Other solutions include:
vectorization: rewrite your algorithm as computations on vectors and matrices and use hardware accelerated libraries to execute it
using another Python distribution that doesn't have a GIL
implementing your piece of parallel code in another language and calling it from Python
A big issue comes when your have to share data between the processes/threads. For example in your code, each task will access actions. If you have to share state, welcome to concurrent programming, a much bigger task, and one of the hardest thing to do right in software.
I have decided to learn how multi-threading is done in Python, and I did a comparison to see what kind of performance gain I would get on a dual-core CPU. I found that my simple multi-threaded code actually runs slower than the sequential equivalent, and I cant figure out why.
The test I contrived was to generate a large list of random numbers and then print the maximum
from random import random
import threading
def ox():
print max([random() for x in xrange(20000000)])
ox() takes about 6 seconds to complete on my Intel Core 2 Duo, while ox();ox() takes about 12 seconds.
I then tried calling ox() from two threads to see how fast that would complete.
def go():
r = threading.Thread(target=ox)
r.start()
ox()
go() takes about 18 seconds to complete, with the two results printing within 1 second of eachother. Why should this be slower?
I suspect ox() is being parallelized automatically, because I if look at the Windows task manager performance tab, and call ox() in my python console, both processors jump to about 75% utilization until it completes. Does Python automatically parallelize things like max() when it can?
Python has the GIL. Python bytecode will only be executed by a single processor at a time. Only certain C modules (which don't manage Python state) will be able to run concurrently.
The Python GIL has a huge overhead in locking the state between threads. There are fixes for this in newer versions or in development branches - which at the very least should make multi-threaded CPU bound code as fast as single threaded code.
You need to use a multi-process framework to parallelize with Python. Luckily, the multiprocessing module which ships with Python makes that fairly easy.
Very few languages can auto-parallelize expressions. If that is the functionality you want, I suggest Haskell (Data Parallel Haskell)
The problem is in function random()
If you remove random from you code.
Both cores try to access to shared state of the random function.
Cores work consequentially and spent a lot of time on caches synchronization.
Such behavior is known as false sharing.
Read this article False Sharing
As Yann correctly pointed out, the Python GIL prevents parallelization from happening in this example. You can either use the python multiprocessing module to fix that or if you are willing to use other open source libraries, Ray is also a great option to get around the GIL problem and is easier to use and has more features than the Python multiprocessing library.
This is how you can parallelize your code example with Ray:
from random import random
import ray
ray.init()
#ray.remote
def ox():
print(max([random() for x in range(20000000)]))
%time x = ox.remote(); y = ox.remote(); ray.get([x, y])
On my machine, the single threaded ox() code you posted takes 1.84s and the two invocations with ray take 1.87s combined, so we get almost perfect parallelization here.
Ray also makes it very efficient to share data between tasks, on a single machine it will use shared memory under the hood, see https://ray-project.github.io/2017/10/15/fast-python-serialization-with-ray-and-arrow.html.
You can also run the same program across different machines on your cluster or the cloud without having to modify the program, see the documentation (https://ray.readthedocs.io/en/latest/using-ray-on-a-cluster.html and https://ray.readthedocs.io/en/latest/autoscaling.html).
Disclaimer: I'm one of the Ray developers.
I am attempting to solve multiple linear systems using python and scipy using threads. I am an absolute beginner when it comes to python threads. I have attached code which distils what I'm trying to accomplish. This code works but the execution time actually increases as one increases totalThreads. My guess is that spsolve is being treated as a critical section and is not actually being run concurrently.
My questions are as follows:
Is spsolve thread-safe?
If spsolve is blocking, is there a way around it?
Is there another linear solver package that I can be using which parallelizes better?
Is there a better way to write this code segment that will increase performance?
I have been searching the web for answers but with no luck. Perhaps, I am just using the wrong keywords. Thanks for everyone's help.
def Worker(threadnum, totalThreads):
for i in range(threadnum,N,totalThreads):
x[:,i] = sparse.linalg.spsolve( A, b[:,i] )
threads = []
for threadnum in range(totalThreads):
t = threading.Thread(target=Worker, args=(threadnum, totalThreads))
threads.append(t)
t.start()
for threadnum in range(totalThreads): threads[threadnum].join()
The first thing you should understand is that, counterintuitively, Python's threading module will not let you take advantage of multiple cores. This is due to something called the Global Interpreter Lock (GIL), which is a critical part of the standard cPython implementation. See here for more info: What is a global interpreter lock (GIL)?
You should consider using the multiprocessing module instead, which gets around the GIL by spinning up multiple independent Python processes. This is a bit more difficult to work with, because different processes have different memory spaces, so you can't just share a thread-safe object between all processes and expect the object to stay synchronized between all processes. Here is a great introduction to multiprocessing: http://www.doughellmann.com/PyMOTW/multiprocessing/
What's the differences between module threading, Thread, multiprocessing? (may be I have badly understood the conceptual differences between multithreads (share memory and global variables?) and multiprocesses (truly independant processes?
Could you please use to illustrate this simple example (calculus doesn't matter):
I have a loop that performs independant calculus that I wish to accelerate through parallel calculus:
def myfunct(d):
facto = 1
for x in range(d):
facto*=x
return facto
cases = [1,2,3,4] # and so on
for d in cases: #loop to parallelize
print myfunct(d) # or to store on a common list when calculated
thanks for your incoming pedagogical answers.
No need for answers beyond the documentation
"""multiprocessing is a package that
supports spawning processes using an
API similar to the threading module.
The multiprocessing package offers
both local and remote concurrency,
effectively side-stepping the Global
Interpreter Lock by using subprocesses
instead of threads. Due to this, the
multiprocessing module allows the
programmer to fully leverage multiple
processors on a given machine. It runs
on both Unix and Windows. """
threading is about threads
multiprocessing about processes
What else is needed?