distributed python programming

distributed python programming - python

I am trying to split the execution of a python program to two different machines. I am wondering if there's a way to call the python interpreter on one machine from another. Not running a script on another machine, but rather split the task of execution to two machines.
Over the course of the next couple of months, I will be teaching my self distributed programming, and I thought this would be a good way to start.
I think the first step is to use one machine to call another machine and send it a piece of the program. Then the next step would be that both machines execute the same program together and communicate to avoid problems. The third step would be three machines, etc.
Advice, tips, and thoughts are all welcome...

Disclamer: I am a developer of SCOOP.
Data-based technologies you may want to get acquainted with for distributed processing would be the MPI standard (for multi-computers, using mpi4py [prefered] or pympi) and the standard multiprocessing module allowing remote computation (but awkward, from my point of view).
You should begin with task-based frameworks, though. They provides a simple and user-friendly usage. Both of these were an utmost focus while creating SCOOP. You can try it with pip -U scoop. On Windows, you may wish to install PyZMQ first using their executable installers. You can check the provided examples and play with the various parameters to understand what causes performance degradation or increase with ease. I encourage you to compare it to its alternatives such as Celery for similar work.
Both of these frameworks allow remote launching of Python programs. More importantly, it does the parallel processing for you while you only need to feed them with your tasks.
You may want to check Fabric for an easy way to setup your remote environments or even control or launch scripts remotely.

Check out Ray, which is a library for writing parallel and distributed Python.
Ray uses the same syntax to parallelize code on a single multicore machine and in the distributed setting.
If you add the #ray.remote decorator to a function, it can be executed asynchronously in parallel (on any machine in the cluster). Remote function invocations return futures, whose values can be retrieved with ray.get.
The same thing can be done with Python classes (instead of functions), see the documentation for actors.
import ray
import time
ray.init()
#ray.remote
def function(x):
time.sleep(1)
return x
args = [1, 2, 3, 4]
# Submit 4 tasks in parallel.
result_ids = [function.remote(x) for x in args]
# Retrieve the results. Assuming at least 4 cores,
# this will take 1 second.
results = ray.get(result_ids)
See the Ray documentation for more. Note, I'm one of the Ray developers.

There is MPI version for Python [1] [2].
MPI (Message Passing Interface) is a standardized interface and it is cool because you find it also in C, Java, (Fortran) etc.
It enables you to communicate between your processes that run remote. You use these messages for synchronization and for information passing.
You also have collective operations, like broadcast, gather, reduce

Have a look at RPyC, you might find it usefull.

Related

Multithreading in python. Running the same program on multiple threads to make the program faster

So I have a quite time-extensive python program and I was wondering, if (since my CPU is multi-core) I can run the program on multiple threads at once. I always check Task Manager and python uses only one thread but pushes it to the max.
I tried searching, but I only found ways to run a function with different datasets on different threads, so I didn't try anything yet, I hope you can help!

multi-threading won't help you.
But Python's "multiprocessing" might - however, parallelization is not automatic, and you have to adapt your program, knowing what you are doing, in order to have any gains with it.
Python's multi-threading is capped to only have a single thread running actual Python code at once - you ave some gains if parts of your workload are spent with I/O, but not with a CPU intensive task.
Multiprocessing is a module on Python's standard library which provide the same interface as `"threading" and will actually run your code in parallel, in multiple processes each one with its own Python runtime. Its major drawback is that any data exchanged between processes have to be serialized and de-serialized, and that add some overhead.
In either case, you have to write your program so that certain functions (which can be entry-points for big workloads) run in new threads or sub-processes. Since you have no example code, there is no example we could create for you showing how the code could be - but look for tutorials on "python multiprocessing" - those should help you out.

Why does threading sorting functions work slower than sequentially running them in python? [duplicate]

I have decided to learn how multi-threading is done in Python, and I did a comparison to see what kind of performance gain I would get on a dual-core CPU. I found that my simple multi-threaded code actually runs slower than the sequential equivalent, and I cant figure out why.
The test I contrived was to generate a large list of random numbers and then print the maximum
from random import random
import threading
def ox():
print max([random() for x in xrange(20000000)])
ox() takes about 6 seconds to complete on my Intel Core 2 Duo, while ox();ox() takes about 12 seconds.
I then tried calling ox() from two threads to see how fast that would complete.
def go():
r = threading.Thread(target=ox)
r.start()
ox()
go() takes about 18 seconds to complete, with the two results printing within 1 second of eachother. Why should this be slower?
I suspect ox() is being parallelized automatically, because I if look at the Windows task manager performance tab, and call ox() in my python console, both processors jump to about 75% utilization until it completes. Does Python automatically parallelize things like max() when it can?

Python has the GIL. Python bytecode will only be executed by a single processor at a time. Only certain C modules (which don't manage Python state) will be able to run concurrently.
The Python GIL has a huge overhead in locking the state between threads. There are fixes for this in newer versions or in development branches - which at the very least should make multi-threaded CPU bound code as fast as single threaded code.
You need to use a multi-process framework to parallelize with Python. Luckily, the multiprocessing module which ships with Python makes that fairly easy.
Very few languages can auto-parallelize expressions. If that is the functionality you want, I suggest Haskell (Data Parallel Haskell)

The problem is in function random()
If you remove random from you code.
Both cores try to access to shared state of the random function.
Cores work consequentially and spent a lot of time on caches synchronization.
Such behavior is known as false sharing.
Read this article False Sharing

As Yann correctly pointed out, the Python GIL prevents parallelization from happening in this example. You can either use the python multiprocessing module to fix that or if you are willing to use other open source libraries, Ray is also a great option to get around the GIL problem and is easier to use and has more features than the Python multiprocessing library.
This is how you can parallelize your code example with Ray:
from random import random
import ray
ray.init()
#ray.remote
def ox():
print(max([random() for x in range(20000000)]))
%time x = ox.remote(); y = ox.remote(); ray.get([x, y])
On my machine, the single threaded ox() code you posted takes 1.84s and the two invocations with ray take 1.87s combined, so we get almost perfect parallelization here.
Ray also makes it very efficient to share data between tasks, on a single machine it will use shared memory under the hood, see https://ray-project.github.io/2017/10/15/fast-python-serialization-with-ray-and-arrow.html.
You can also run the same program across different machines on your cluster or the cloud without having to modify the program, see the documentation (https://ray.readthedocs.io/en/latest/using-ray-on-a-cluster.html and https://ray.readthedocs.io/en/latest/autoscaling.html).
Disclaimer: I'm one of the Ray developers.

IPython parallel computing vs pyzmq for cluster computing

I am currently working on some simulation code written in C, which runs on different remote machines. While the C part is finished I want to simplify my work by extending it with a python simulation api and some kind of a job-queue system, which should do the following:
1.specifiy a set of parameters on which simulations should be performed and put them into a queue on a host computer
2.perform simulation on remote machines by workers
3.return results to host computer
I had a look at different frameworks for accomplishing this task and my first choice goes down to IPython.parallel. I had a look at the documentation and from what I tested out it seems pretty easy to use. My approach would be to use a load balanced view like explained at
http://ipython.org/ipython-doc/dev/parallel/parallel_task.html#creating-a-loadbalancedview-instance
But what I dont see is:
what happens i.e. if the ipcontroller crashes, is my job queue gone?
what happens if a remote machine crashes? is there some kind of error handling?
Since I run relatively long simulations (1-2 weeks) I don't want my simulations to fail if some part of the system crashes. So is there maybe some way to handle this in IPython.parallel?
My Second approach would be to use pyzmq and implement the jobsystem from scratch.
In this case what would be the best zmq-pattern for this situation?
And last but not least, is there maybe a better framework for this scenario?

What lies behind the curtain is a bit more complex view on how to arrange the work-package flow alongside the ( parallelised ) number-crunching pipeline(s).
Being the work-package of a many CPU-core-week(s),
or
being the lumpsum volume of the job above a few hundred-of-thousands of CPU-core-hours, the principles are similar and follow a common sense.
Key Features
scaleability of the computing performance of all resources involved ( ideally a linear one )
ease of task submission role
fault-resilience of submitted task(s) ( ideally with an automated self-healing )
feasible TCO cost of access to / use of a sufficient pool of resources ( upfront co$ts, recurring co$ts, adaptation$ co$ts, co$ts of $peed )
Approaches to Solution
home-brew architecture for a distributed massively parallel scheduler based self-healing computation engine
re-use of available grid-based computing resources
Based on own experience to solve a need for repetitive runs of numerical intensive optimisation problem over a vast parameterSetVectorSPACE ( which could not be de-composed into any trivialised GPU parallelisation scheme ), selection of the second approach has been validated to be more productive rather than an attempt to burn dozens of man*years in just-another-trial to re-invent a wheel.
Being in academia environment, one may get a way easier to an acceptable access to resources-pool(s) for processing the work-packages, while commercial entities may acquire the same, based on their acceptable budgeting tresholds.

My gut instinct is to suggest rolling your own solution for this, because like you said otherwise you're depending on IPython not crashing.
I would run a simple python service on each node which listens for run commands. When it receives one it launches your C program. However, I suggest you ensure the C program is a true Unix daemon, so when it runs it completely disconnects itself from python. That way if your node python instance crashes you can still get data if the C program executes successfully. Have the C program write the output data to a file or database, and when the task is finished write "finished" to a "status" or something similar. The python service should monitor that file and when finished is indicated it should retrieve the data and send it back to the server.
The central idea of this design is to have as few possible points of failure as possible. As long as the C program doesn't crash, you can still get the data one way or another. As far as handling system crashes, network disconnects, etc, that's up to you.

How to pin threads to cores with predetermined memory pool objects? (80 core Nehalem architecture 2Tb RAM)

I've run into a minor HPC problem after running some tests on a 80core (160HT) nehalem architecture with 2Tb DRAM:
A server with more than 2 sockets starts to stall a lot (delay) as each thread starts to request information about objects on the "wrong" socket, i.e. requests goes from a thread that is working on some objects on the one socket to pull information that is actually in the DRAM on the other socket.
The cores appear 100% utilized, even though I know that they are waiting for the remote socket to return the request.
As most of the code runs asynchronously it is a lot easier to rewrite the code so I can just parse messages from the threads on the one socket to threads the other (no locked waiting).
In addition I want to lock each threads to memory pools, so I can update objects instead of wasting time (~30%) on the garbage collector.
Hence the question:
How to pin threads to cores with predetermined memory pool objects in Python?
A little more context:
Python has no problem running multicore when you put ZeroMQ in the middle and make an art out of passing messages between the memory pool managed by each ZMQworker. At ZMQ's 8M msg/second it the internal update of the objects take longer than the pipeline can be filled. This is all described here: http://zguide.zeromq.org/page:all#Chapter-Sockets-and-Patterns
So, with a little over-simplification, I spawn 80 ZMQworkerprocesses and 1 ZMQrouter and load the context with a large swarm of objects (584 million objects actually).
From this "start-point" the objects need to interact to complete the computation.
This is the idea:
If "object X" needs to interact with "Object Y" and is available in
the local memory pool of the python-thread, then the interaction
should be done directly.
If "Object Y" is NOT available in the same pool, then I want it to
send a message through the ZMQrouter and let the router return a
response at some later point in time. My architecture is non-blocking so what goes on in the particular python thread just continues without waiting for the zmqRouters response. Even for objects on the same socket but on a different core, I would prefer NOT to interact, as I prefer having clean message exchanges instead of having 2 threads manipulating the same memory object.
To do this I need to know:
how to figure out which socket a given python process (thread)
runs on.
how assign a memory pool on that particular socket to the python process (some malloc limit or similar so that the sum of memory pools do not push the memory pool from one socket to another)
Things I haven't thought of.
But I cannot find references in the python docs on how to do this and on google I must be searching for the wrong thing.
Update:
Regarding the question "why use ZeroMQ on a MPI architecture?", please read the thread: Spread vs MPI vs zeromq? as the application I am working on is being designed for a distributed deployment even though it is tested on a an architecture where MPI is more suitable.
Update 2:
Regarding the question:
"How to pin threads to cores with predetermined memory pools in Python(3)" the answer is in psutils:
>>> import psutil
>>> psutil.cpu_count()
4
>>> p = psutil.Process()
>>> p.cpu_affinity() # get
[0, 1, 2, 3]
>>> p.cpu_affinity([0]) # set; from now on, this process will run on CPU #0 only
>>> p.cpu_affinity()
[0]
>>>
>>> # reset affinity against all CPUs
>>> all_cpus = list(range(psutil.cpu_count()))
>>> p.cpu_affinity(all_cpus)
>>>
The worker can be pegged to a core whereby the NUMA may be exploited effectively (lookup your CPU type to verify that it is a NUMA architecture!)
The second element is to determine the memory-pool. That can be done with psutils as well or the resource library:

You might underestimate the issue, there is no super-easy way to accomplish what you want. As a general guideline, you need to work at the operating system level to get things set up the way you want. You want to work with so-called "CPU affinity" and "memory affinity" and you need to think hard about your system architecture as well as your software architecture to get things right. In real HPC, the named "affinities" are normally handled by an MPI library, such as Open MPI. You might want to consider using one and let your different processes be handled by that MPI library. The interface between operating system, MPI library and Python can be provided by the mpi4py package.
You also need to get your concept of threads and processes and the OS setting straight. While for the CPU time scheduler, a thread is a task to be scheduled and therefore theoretically could have an individual affinity, I am only aware of affinity masks for entire processes, i.e. for all threads within one process. For controlling memory access, NUMA (non-uniform memory access) is the right keyword and you might want to look into http://linuxmanpages.com/man8/numactl.8.php
In any case, you need to read articles about the affinity topic and might want to start reading in the Open MPI FAQs about CPU/memory affinity:
http://www.open-mpi.de/faq/?category=tuning#paffinity-defs
In case you want to achieve your goal without using an MPI library, look into the packages util-linux or schedutils and numactl of your Linux distribution in order to get useful commandline tools such as taskset, which you could e.g. call from within Python in order to set affinity masks for certain process IDs.
This article seems to vividly describe how an MPI library can be helpful with your issue:
http://blogs.cisco.com/performance/open-mpi-v1-5-processor-affinity-options/
This SO answer describes how you bisect your hardware architecture: https://stackoverflow.com/a/11761943/145400
Generally, I am wondering if the machine you are applying is the right one for the task or if you maybe are optimizing at the wrong end. If you are messaging within one machine and hitting memory bandwidth limits, I am not sure if ZMQ (through TCP/IP, right?) is the right tool at all to perform the messaging. Coming back to MPI, the message passing interface for HPC applications...

Just wondering if this might not be amenable to the use of python remote objects - this might be worth investigation but unfortunately I do not have access to such hardware.
As explained in the documentation while pyro is often used to distribute work across multiple machines on a network it can also be used to share processing between cores on a single machine.
On a lower level Pyro is just a form of inter-process communication. So everywhere you would otherwise have used a more primitive form of IPC (such as plain TCP/IP sockets) between Python components, you could consider to use Pyro instead.
While pyro may add some overhead it may well speed things up and should make things more maintainable.

How to schedule jobs in to client systems using python?

i have 4 systems , where in one i am using as a master and the rest 3 as slaves. I want to execute a set of functions in client systems and fetch back the result, to achieve this i had previously used Parallel Python , unfortunately as soon as job_server is created in it , it internally executes the given function using all systems. I want to individually assign particular function to be executed on an individual client machine , i have no idea on how to proceed with coding . Is there any framework in python, which allows users to do that ?

There is the multiprocessing module: http://docs.python.org/library/multiprocessing.html
You can find examples there that do exactly what you want (see Remote Managers).
As a side notice: I soon stumbled upon limitations in the multiprocessing module and now use PyRo as middleware. It's very simple and powerful but requires a bit more work to set up a basic framework.

There was a talk at a Pycon in the past few years about a tool that managed remote processes via ssh, which sounded really cool and I meant to check out. I've forgotten what the tool was, but you might find it by looking at the Pycon talks for the past few years.
The Python Parallel Processing wiki page has a big list of tools that might help you.

I don't quite understand what you have now and what you're trying to do. Pyro is easy to use, pretty fast and well documented for letting multiple python processes (on multiple systems) interact with each other. Take a look at that. You'll need to ensure that something is running on all of them so they're listening for commands, but that's not too hard.

In Pythomnic you can have multiple processes on multiple servers exchanging synchronous calls like this:
result = pmnc("other_process").module.method(...) # invokes remote method
So you can eiher have 3 distinctly named slaves and pick call target manually:
pmnc("slave_1").foo.bar() # calls bar() in foo.py on slave_1
or have slaves identically named and have the framework pick any available (if they are interchangeable of course)
pmnc("slave").foo.bar() # calls bar() in foo.py on any slave

check out RPyc

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.