IPython parallel computing vs pyzmq for cluster computing

IPython parallel computing vs pyzmq for cluster computing - python

I am currently working on some simulation code written in C, which runs on different remote machines. While the C part is finished I want to simplify my work by extending it with a python simulation api and some kind of a job-queue system, which should do the following:
1.specifiy a set of parameters on which simulations should be performed and put them into a queue on a host computer
2.perform simulation on remote machines by workers
3.return results to host computer
I had a look at different frameworks for accomplishing this task and my first choice goes down to IPython.parallel. I had a look at the documentation and from what I tested out it seems pretty easy to use. My approach would be to use a load balanced view like explained at
http://ipython.org/ipython-doc/dev/parallel/parallel_task.html#creating-a-loadbalancedview-instance
But what I dont see is:
what happens i.e. if the ipcontroller crashes, is my job queue gone?
what happens if a remote machine crashes? is there some kind of error handling?
Since I run relatively long simulations (1-2 weeks) I don't want my simulations to fail if some part of the system crashes. So is there maybe some way to handle this in IPython.parallel?
My Second approach would be to use pyzmq and implement the jobsystem from scratch.
In this case what would be the best zmq-pattern for this situation?
And last but not least, is there maybe a better framework for this scenario?

What lies behind the curtain is a bit more complex view on how to arrange the work-package flow alongside the ( parallelised ) number-crunching pipeline(s).
Being the work-package of a many CPU-core-week(s),
or
being the lumpsum volume of the job above a few hundred-of-thousands of CPU-core-hours, the principles are similar and follow a common sense.
Key Features
scaleability of the computing performance of all resources involved ( ideally a linear one )
ease of task submission role
fault-resilience of submitted task(s) ( ideally with an automated self-healing )
feasible TCO cost of access to / use of a sufficient pool of resources ( upfront co$ts, recurring co$ts, adaptation$ co$ts, co$ts of $peed )
Approaches to Solution
home-brew architecture for a distributed massively parallel scheduler based self-healing computation engine
re-use of available grid-based computing resources
Based on own experience to solve a need for repetitive runs of numerical intensive optimisation problem over a vast parameterSetVectorSPACE ( which could not be de-composed into any trivialised GPU parallelisation scheme ), selection of the second approach has been validated to be more productive rather than an attempt to burn dozens of man*years in just-another-trial to re-invent a wheel.
Being in academia environment, one may get a way easier to an acceptable access to resources-pool(s) for processing the work-packages, while commercial entities may acquire the same, based on their acceptable budgeting tresholds.

My gut instinct is to suggest rolling your own solution for this, because like you said otherwise you're depending on IPython not crashing.
I would run a simple python service on each node which listens for run commands. When it receives one it launches your C program. However, I suggest you ensure the C program is a true Unix daemon, so when it runs it completely disconnects itself from python. That way if your node python instance crashes you can still get data if the C program executes successfully. Have the C program write the output data to a file or database, and when the task is finished write "finished" to a "status" or something similar. The python service should monitor that file and when finished is indicated it should retrieve the data and send it back to the server.
The central idea of this design is to have as few possible points of failure as possible. As long as the C program doesn't crash, you can still get the data one way or another. As far as handling system crashes, network disconnects, etc, that's up to you.

Related

Possible to outsource computations to AWS and utilize results locally?

I'm working on a robot that uses a CNN that needs much more memory than my embedded computer (Jetson TX1) can handle. I was wondering if it would be possible (with an extremely low latency connection) to outsource the heavy computations to EC2 and send the results back to the be used in a Python script. If this is possible, how would I go about it and what would the latency look like (not computations, just sending to and from).

I think it's certainly possible. You would need some scripts or a web server to transfer data to and from. Here is how I think you might achieve it:
Send all your training data to an EC2 instance
Train your CNN
Save the weights and/or any other generated parameters you may need
Construct the CNN on your embedded system and input the weights
from the EC2 instance. Since you won't be needing to do any training
here and won't need to load in the training set, the memory usage
will be minimal.
Use your embedded device to predict whatever you may need
It's hard to give you an exact answer on latency because you haven't given enough information. The exact latency is highly dependent on your hardware, internet connection, amount of data you'd be transferring, software, etc. If you're only training once on an initial training set, you only need to transfer your weights once and thus latency will be negligible. If you're constantly sending data and training, or doing predictions on the remote server, latency will be higher.

Possible: of course it is.
You can use any kind of RPC to implement this. HTTPS requests, xml-rpc, raw UDP packets, and many more. If you're more interested in latency and small amounts of data, then something UDP based could be better than TCP, but you'd need to build extra logic for ordering the messages and retrying the lost ones. Alternatively something like Zeromq could help.
As for the latency: only you can answer that, because it depends on where you're connecting from. Start up an instance in the region closest to you and run ping, or mtr against it to find out what's the roundtrip time. That's the absolute minimum you can achieve. Your processing time goes on top of that.

I am a former employee of CENAPAD-UFC (National Centre of HPC, Federal University of Ceará), so I have something to say about outsourcing computer power.
CENAPAD has a big cluster, and it provides computational power for academic research. There, professors and students send their computation and their data, defined the output and go drink a coffee or two, while the cluster go on with the hard work. After lots of flops, the operation ended and they retrieve it via ssh and go back to their laptops.
For big chunks of computation, you wish to minimize any work that is not a useful computation. One such thing is commumication over detached computers. If you need to know when the computation has ended, let the HPC machine tell you that.
To compute stuff effectively, you may want to go deeper in the machine and performe some kind of distribution. I use OpenMP to distribute computation inside the same machine/thread distribution. To distribute between physically separated computers, but next (latency speaking), I use MPI. I have installed also another cluster in UFC for another department. There, the researchers used only MPI.
Maybe some read about distributed/grid/clusterized computing helps you:
https://en.m.wikipedia.org/wiki/SETI#home ; the first example of massive distributed computing that I ever heard
https://en.m.wikipedia.org/wiki/Distributed_computing
https://en.m.wikipedia.org/wiki/Grid_computing
https://en.m.wikipedia.org/wiki/Supercomputer ; this is CENAPAD like stuff
In my opinion, you wish to use a grid-like computation, with your personal PC working as a master node that may call the EC2 slaves; in this scenario, just use communication from master to slave to send program (if really needed) and data, in such a way that the master will have another thing to do not related with the sent data; also, let the slave tells your master when the computation reached it's end.

How to run a program and also execute code when data is received from the network?

I've written a little program in Python that basically does the following:
Gets a hotword as input from the user. If it matches the set keyword it continues.
After entering the correct hotword the user is asked to enter a command.
After the command is read the progam checks a command file to see if there is a command that matches that input
If a matching command is found, execute whatever that command says.
I'd like to add the ability to execute commands over a network as follows (and learn to use Twisted on the way):
Client #1 enters a command targeted at client #2.
The command gets sent to a server which routes it to client #2.
Client #2 receives the command and executes it if it's valid.
Note: Entering commands locally (using the code below) and remotely should be possible.
After some thinking I couldn't come up with any other way to implement this other than:
Have the above program run as process #1 (the program that runs locally as I've written at the beginning).
A Twisted client will be run as process #2 and receive the commands from remote clients. Whenever a command is received, the Twisted client will initialize a thread that'll parse the command, check for its validity and execute it if it's valid.
Since I don't have that much experience with threads and none with network programming, I couldn't think of any other scheme that makes sense to me.
Is the scheme stated above overly complicated?
I would appreciate some insight before trying to implement it this way.
The code for the python program (without the client) is:
The main (which is the start() method):
class Controller:
def __init__(self,listener, executor):
self.listener = listener
self.executor = executor
def start(self):
while True:
text = self.listener.listen_for_hotword()
if self.executor.is_hotword(text):
text = self.listener.listen_for_command()
if self.executor.has_matching_command(text):
self.executor.execute_command(text)
else:
tts.say("No command found. Please try again")
The Listener (gets input from the user):
class TextListener(Listener):
def listen_for_hotword(self):
text = raw_input("Hotword: ")
text =' '.join(text.split()).lower()
return text
def listen_for_command(self):
text = raw_input("What would you like me to do: ")
text = ' '.join(text.split()).lower()
return text
The executor (the class that executes the given command):
class Executor:
#TODO: Define default path
def __init__(self,parser, audio_path='../Misc/audio.wav'):
self.command_parser = parser
self.audio_path = audio_path
def is_hotword(self,hotword):
return self.command_parser.is_hotword(hotword)
def has_matching_command(self,command):
return self.command_parser.has_matching_command(command)
def execute_command(self,command):
val = self.command_parser.getCommand(command)
print val
val = os.system(val) #So we don't see the return value of the command
The command file parser:
class KeyNotFoundException(Exception):
pass
class YAMLParser:
THRESHOLD = 0.6
def __init__(self,path='Configurations/commands.yaml'):
with open(path,'r') as f:
self.parsed_yaml = yaml.load(f)
def getCommand(self,key):
try:
matching_command = self.find_matching_command(key)
return self.parsed_yaml["Commands"][matching_command]
except KeyError:
raise KeyNotFoundException("No key matching {}".format(key))
def has_matching_command(self,key):
try:
for command in self.parsed_yaml["Commands"]:
if jellyfish.jaro_distance(command,key) >=self.THRESHOLD:
return True
except KeyError:
return False
def find_matching_command(self,key):
for command in self.parsed_yaml["Commands"]:
if jellyfish.jaro_distance(command,key) >=0.5:
return command
def is_hotword(self,hotword):
return jellyfish.jaro_distance(self.parsed_yaml["Hotword"],hotword)>=self.THRESHOLD
Example configuration file:
Commands:
echo : echo hello
Hotword: start

I'm finding it extremely difficult to follow the background in your questions, but I'll take a stab at the questions themselves.
How to run a program and also execute code when data is received from the network?
As you noted in your question, the typical way to write a "walk and chew-gum" style app is to design your code in a threaded or an event-loop style.
Given you talking about threading and Twisted (which is event-loop style) I'm worried that you may be thinking about mixing the two.
I view them as fundamentally different styles of programming (each with places they excel) and that mixing them is generally a path to hell.
Let me give you some background to explain
Background: Threading vs Event programming
Threading
How to think of the concept:
I have multiple things I need to do at the same time, and I want my operating system to figure how and when to run those separate tasks.
Pluses:
'The' way to let one program use multiple processor cores at the same time
In the posix world the only way to let one process run on multiple CPU cores at the same time is via threads (with the typical ideal number of threads being no more then the cores in a given machine)
Easier to start with
The same code that you were running inline can be tossed off into a thread usually without needing a redesign (without GIL some locking would be required but more on that later)
Much easier to use with tasks that will eat all the CPU you can give at them
I.E. in most cases math is way easier to deal with using threading solutions then using event/async frameworks
Minuses:
Python has a special problem with threads
In CPython the global interpreter lock(GIL) can negate threads ability to multitask (making threads nearly useless). Avoiding the GIL is messy and can undo all the ease of use of working in threads
As you add threads (and locks) things get complicated fast, see this SO: Threading Best Practices
Rarely optimal for IO/user-interacting tasks
While threads are very good at dealing with small numbers of tasks that want to use lots of CPU (Ideally one thread per core), they are far less optimal at dealing with large counts of tasks that spend most of their time waiting.
Best use:
Computationally expensive things.
If you have big chucks of math that you want to run concurrently, its very unlikely that your going to be able to schedule the CPU utilization more intelligently then the operation system.
( Given CPythons GIL problem, threading shouldn't manually be used for math, instead a library that internally threads (like NumPy) should be used )
Event-loop/Asynchronous programming
How to think of the concept:
I have multiple things I need to do at the same time, but I (the programer) want direct control/direct-implimentation over how and when my sub-tasks are run
How you should be thinking about your code:
Think of all your sub-tasks in one big intertwined whole, your mind should always have the thought of "will this code run fast enough that it doesn't goof up the other sub-tasks I'm managing"
Pluses:
Extraordinarily efficient with network/IO/UI connections, including large counts of connections
Event-loop style programs are one of the key technologies that solved the c10k problem. Frameworks like Twisted can literally handle tens-of-thousands of connections in one python process running on a small machine.
Predictable (small) increase in complexity as other connections/tasks are added
While there is a fairly steep learning curve (particularly in twisted), once the basics are understood new connection types can be added to projects with a minimal increase in the overall complexity. Moving from a program that offers a keyboard interface to one that offers keyboard/telnet/web/ssl/ssh connections may just be a few lines of glue code per-interface (... this is highly variable by framework, but regardless of the framework event-loops complexity is more predictable then threads as connection counts increase)
Minuses:
Harder to get started.
Event/async programming requires a different style of design from the first line of code (though you'll find that style of design is somewhat portable from language to language)
One event-loop, one core
While event-loops can let you handle a spectacular number of IO connections at the same time, they in-and-of-themselves can't run on multiple cores at the same time. The conventional way to deal with this is to write programs in such a way that multiple instances of the program can be run at the same time, one for each core (this technique bypasses the GIL problem)
Multiplexing high CPU tasks can be difficult
Event programing requires cutting high CPU tasks into pieces such that each piece takes an (ideally predictably) small amount of CPU, otherwise the event system ceases to multitask whenever the high CPU task is run.
Best use:
IO based things
TL; DR - Given your question, look at twisted and -not- threads
While your application doesn't seem to be exclusively IO based, none of it seems to be CPU based (it looks like your currently playing audio via a system call, system spins off an independent process every time its called, so its work doesn't burn your processes CPU - though system blocks, so its a no-no in twisted - you have to use different calls in twisted).
Your question also doesn't suggest your concerned about maxing out multiple cores.
Therefor, given you specifically talked about Twisted, and an event-loop solution seems to be the best match for your application, I would recommend looking at Twisted and -not- threads.
Given the 'Best use' listed above you might be tempted to think that mixing twisted and threads is the way-to-got, but when doing that if you do anything even slightly wrong you will disable the advantages of both the event-loop (you'll block) and threading (GIL won't let the threads multitask) and have something super complex that provides you no advantage.
Is the scheme stated above overly complicated? I would appreciate some insight before trying to implement it this way.
The 'scheme' you gave is:
After some thinking I couldn't come up with any other way to implement this other than:
Have the above program run as process #1 (the program that runs locally as I've written at the beginning).
A Twisted client will be run as process #2 and receive the commands from remote clients. Whenever a command is received, the Twisted client will initialize a thread that'll parse the command, check for its validity and execute it if it's valid.
Since I don't have that much experience with threads and none with network programming, I couldn't think of any other scheme that makes sense to me.
In answer to "Is the scheme ... overly complicated", I would say almost certainly yes because your talking about twisted and threads. (see tr; dr above)
Given my certainly incomplete (and confused) understanding of what you want to build, I would imagine a twisted solution for you would look like:
Carefully study the krondo Twisted Introduction (it really requires tracing the example code line-for-line, but if you do the work its an AWESOME learning tool for twisted - and event programing in general)
From the ground-up you rewrite your 'hotword' thingy in twisted using what you learned in the krondo guide - starting out just providing whichever interface you currently have (keyboard?)
Add other communication interfaces to that code (telnet, web, etc) which would let you access the processing code you wrote for the keyboard(?) interface.
If, as you state in your question, you really need a server, you could write a second twisted program to provide that (you'll see examples of all that in the krondo guide). Though I'm guessing when you understand twisted's library support you'll realize you don't have to build any extra servers, that you can just include whichever protocols you need in your base code.

Multithreading With Very Large Number of Threads

I'm working on simulating a mesh network with a large number of nodes. The nodes pass data between different master nodes throughout the network.
Each master comes live once a second to receive the information, but the slave nodes don't know when the master is up or not, so when they have information to send, they try and do so every 5 ms for 1 second to make sure they can find the master.
Running this on a regular computer with 1600 nodes results in 1600 threads and the performance is extremely bad.
What is a good approach to handling the threading so each node acts as if it is running on its own thread?
In case it matters, I'm building the simulation in python 2.7, but I'm open to changing to something else if that makes sense.

For one, are you really using regular, default Python threads available in the default Python 2.7 interpreter (CPython), and is all of your code in Python? If so, you are probably not actually using multiple CPU cores because of the global interpreter lock CPython has (see https://wiki.python.org/moin/GlobalInterpreterLock). You could maybe try running your code under Jython, just to check if performance will be better.
You should probably rethink your application architecture and switch to manually scheduling events instead of using threads, or maybe try using something like greenlets (https://stackoverflow.com/a/15596277/1488821), but that would probably mean less precise timings because of lack of parallelism.

To me, 1600 threads sounds like a lot but not excessive given that it's a simulation. If this were a production application it would probably not be production-worthy.
A standard machine should have no trouble handling 1600 threads. As to the OS this article could provide you with some insights.
When it comes to your code a Python script is not a native application but an interpreted script and as such will require more CPU resources to execute.
I suggest you try implementing the simulation in C or C++ instead which will produce a native application which should execute more efficiently.

Do not use threading for that. If sticking to Python, let the nodes perform their actions one by one. If the performance you get doing so is OK, you will not have to use C/C++. If the actions each node perform are simple, that may work. Anyway, there is no reason to use threads in Python at all. Python threads are usable mostly for making blocking I/O not to block your program, not for multiple CPU kernels utilization.
If you want to really use parallel processing and to write your nodes as if they were really separated and exchanging only using messages, you may use Erlang (http://www.erlang.org/). It is a functional language very well suited for executing parallel processes and making them exchange messages. Erlang processes do not map to OS threads, and you may create thousands of them. However, Erlang is a purely functional language and may seem extremely strange if you have never used such languages. And it also is not very fast, so, like Python, it is unlikely to handle 1600 actions every 5ms unless the actions are rather simple.
Finally, if you do not get desired performance using Python or Erlang, you may move to C or C++. However, still do not use 1600 threads. In fact, using threads to gain performance is reasonable only if the number of threads does not dramatically exceed number of CPU kernels. A reactor pattern (with several reactor threads) is what you may need in that case (http://en.wikipedia.org/wiki/Reactor_pattern). There is an excellent implementation of the reactor pattern in boost.asio library. It is explained here: http://www.gamedev.net/blog/950/entry-2249317-a-guide-to-getting-started-with-boostasio/

Some random thoughts here:
I did rather well with several hundred threads working like this in Java; it can be done with the right language. (But I haven't tried this in Python.)
In any language, you could run the master node code in one thread; just have it loop continuously, running the code for each master in each cycle. You'll lose the benefits of multiple cores that way, though. On the other hand, you'll lose the problems of multithreading, too. (You could have, say, 4 such threads, utilizing the cores but getting the multithreading headaches back. It'll keep the thread-overhead down, too, but then there's blocking...)
One big problem I had was threads blocking each other. Enabling 100 threads to call the same method on the same object at the same time without waiting for each other requires a bit of thought and even research. I found my multithreading program at first often used only 25% of a 4-core CPU even when running flat out. This might be one reason you're running slow.
Don't have your slave nodes repeat sending data. The master nodes should come alive in response to data coming in, or have some way of storing it until they do come alive, or some combination.
It does pay to have more threads than cores. Once you have two threads, they can block each other (and will if they share any data). If you have code to run that won't block, you want to run it in its own thread so it won't be waiting for code that does block to unblock and finish. I found once I had a few threads, they started to multiply like crazy--hence my hundreds-of-threads program. Even when 100 threads block at one spot despite all my brilliance, there's plenty of other threads to keep the cores busy!

Advice: Python Framework Server/Worker Queue management (not Website)

I am looking for some advice/opinions of which Python Framework to use in an implementation of multiple 'Worker' PCs co-ordinated from a central Queue Manager.
For completeness, the 'Worker' PCs will be running Audio Conversion routines (which I do not need advice on, and have standalone code that works).
The Audio conversion takes a long time, and I need to co-ordinate an arbitrary number of the 'Workers' from a central location, handing them conversion tasks (such as where to get the source files, or where to ask for the job configuration) with them reporting back some additional info, such as the runtime of the converted audio etc.
At present, I have a script that makes a webservice call to get the 'configuration' for a conversion task, based on source files located on the worker already (we manually copy the source files to the worker, and that triggers a conversion routine). I want to change this, so that we can distribute conversion tasks ("Oy you, process this: xxx") based on availability, and in an ideal world, based on pending tasks too.
There is a chance that Workers can go offline mid-conversion (but this is not likely).
All the workers are Windows based, the co-ordinator can be WIndows or Linux.
I have (in my initial searches) come across the following - and I know that some are cross-dependent:
Celery (with RabbitMQ)
Twisted
Django
Using a framework, rather than home-brewing, seems to make more sense to me right now. I have a limited timeframe in which to develop this functional extension.
An additional consideration would be using a Framework that is compatible with PyQT/PySide so that I can write a simple UI to display Queue status etc.
I appreciate that the specifics above are a little vague, and I hope that someone can offer me a pointer or two.
Again: I am looking for general advice on which Python framework to investigate further, for developing a Server/Worker 'Queue management' solution, for non-web activities (this is why DJango didn't seem the right fit).

How about using pyro? It gives you remote object capability and you just need a client script to coordinate the work.

Writing a parallel programming framework, what have I missed?

Clarification: As per some of the comments, I should clarify that this is intended as a simple framework to allow execution of programs that are naturally parallel (so-called embarrassingly parallel programs). It isn't, and never will be, a solution for tasks which require communication or synchronisation between processes.
I've been looking for a simple process-based parallel programming environment in Python that can execute a function on multiple CPUs on a cluster, with the major criterion being that it needs to be able to execute unmodified Python code. The closest I found was Parallel Python, but pp does some pretty funky things, which can cause the code to not be executed in the correct context (with the appropriate modules imported etc).
I finally got tired of searching, so I decided to write my own. What I came up with is actually quite simple. The problem is, I'm not sure if what I've come up with is simple because I've failed to think of a lot of things. Here's what my program does:
I have a job server which hands out jobs to nodes in the cluster.
The jobs are handed out to servers listening on nodes by passing a dictionary that looks like this:
{
'moduleName':'some_module',
'funcName':'someFunction',
'localVars': {'someVar':someVal,...},
'globalVars':{'someOtherVar':someOtherVal,...},
'modulePath':'/a/path/to/a/directory',
'customPathHasPriority':aBoolean,
'args':(arg1,arg2,...),
'kwargs':{'kw1':val1, 'kw2':val2,...}
}
moduleName and funcName are mandatory, and the others are optional.
A node server takes this dictionary and does:
sys.path.append(modulePath)
globals()[moduleName]=__import__(moduleName, localVars, globalVars)
returnVal = globals()[moduleName].__dict__[funcName](*args, **kwargs)
On getting the return value, the server then sends it back to the job server which puts it into a thread-safe queue.
When the last job returns, the job server writes the output to a file and quits.
I'm sure there are niggles that need to be worked out, but is there anything obvious wrong with this approach? On first glance, it seems robust, requiring only that the nodes have access to the filesystem(s) containing the .py file and the dependencies. Using __import__ has the advantage that the code in the module is automatically run, and so the function should execute in the correct context.
Any suggestions or criticism would be greatly appreciated.
EDIT: I should mention that I've got the code-execution bit working, but the server and job server have yet to be written.

I have actually written something that probably satisfies your needs: jug. If it does not solve your problems, I promise you I'll fix any bugs you find.
The architecture is slightly different: workers all run the same code, but they effectively generate a similar dictionary and ask the central backend "has this been run?". If not, they run it (there is a locking mechanism too). The backend can simply be the filesystem if you are on an NFS system.

I myself have been tinkering with batch image manipulation across my computers, and my biggest problem was the fact that some things don't easily or natively pickle and transmit across the network.
for example: pygame's surfaces don't pickle. these I have to convert to strings by saving them in StringIO objects and then dumping it across the network.
If the data you are transmitting (eg your arguments) can be transmitted without fear, you should not have that many problems with network data.
Another thing comes to mind: what do you plan to do if a computer suddenly "disappears" while doing a task? while returning the data? do you have a plan for re-sending tasks?

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.