Moving Parallel Python code to the cloud

Moving Parallel Python code to the cloud - python

Upon hearing that the scientific computing project (happens to be the stochastic tractography method described here) I'm currently running for an investigator would take 4 months on our 50 node cluster, the investigator has asked me to examine other options. The project is currently using parallel python to farm out chunks of a 4d array to different cluster nodes, and put the processed chunks back together.
The jobs I'm currently working with are probably much too coarsely grained, (5 seconds to 10 minutes, I had to increase the timeout default in parallel python) and I estimate I could speed up the process by 2-4 times by rewriting it to make better use of resources (splitting up and putting back together the data is taking too long, that should be parallelized as well). Most of the work in done by numpy arrays.
Let's assume that 2-4 times isn't enough, and I decide to get the code off of our local hardware. For high throughput computing like this, what are my commercial options and how will I need to modify the code?

You might be interested in PiCloud. I have never used it, but their offer apparently includes the Enthought Python Distribution, which covers the standard scientific libraries.
It's tough to say if this will work for your specific case, but the Parallel Python interface is pretty generic. So hopefully not too many changes would be needed. Maybe you can even write a custom scheduler class (implementing the same interface as PP). Actually that might be useful for many people, so maybe you can drum up some support in the PP forum.

The most obvious commercial options which come to mind are Amazon EC2 and the Rackspace Cloud. I have played with both and found the Rackspace API a little easier to use.
The good news is that you can prototype and play with their compute instances (short- or long-lived virtual machines of the OS of your choice) for very little investment, typically $US 0.10 / hr or so. You create them on demand and then release them back to the cloud when you are done, and only pay for what you use. For example, I saw a demo on Django deployment using 6 Rackspace instances which took perhaps an hour and cost the speakers less than a dollar.
For your use case (not clear exactly what you meant by 'high throughput'), you will have to look at your budget and your computing needs, as well as your total network throughput (you pay for that, too). A few small-scale tests and a simple spreadsheet calculation should tell you if it's really practical or not.
There are Python APIs for both Rackspace Cloud and Amazon EC2. Whichever you use, I recommend python-based Fabric for automated deployment and configuration of your instances.

Related

Possible to outsource computations to AWS and utilize results locally?

I'm working on a robot that uses a CNN that needs much more memory than my embedded computer (Jetson TX1) can handle. I was wondering if it would be possible (with an extremely low latency connection) to outsource the heavy computations to EC2 and send the results back to the be used in a Python script. If this is possible, how would I go about it and what would the latency look like (not computations, just sending to and from).

I think it's certainly possible. You would need some scripts or a web server to transfer data to and from. Here is how I think you might achieve it:
Send all your training data to an EC2 instance
Train your CNN
Save the weights and/or any other generated parameters you may need
Construct the CNN on your embedded system and input the weights
from the EC2 instance. Since you won't be needing to do any training
here and won't need to load in the training set, the memory usage
will be minimal.
Use your embedded device to predict whatever you may need
It's hard to give you an exact answer on latency because you haven't given enough information. The exact latency is highly dependent on your hardware, internet connection, amount of data you'd be transferring, software, etc. If you're only training once on an initial training set, you only need to transfer your weights once and thus latency will be negligible. If you're constantly sending data and training, or doing predictions on the remote server, latency will be higher.

Possible: of course it is.
You can use any kind of RPC to implement this. HTTPS requests, xml-rpc, raw UDP packets, and many more. If you're more interested in latency and small amounts of data, then something UDP based could be better than TCP, but you'd need to build extra logic for ordering the messages and retrying the lost ones. Alternatively something like Zeromq could help.
As for the latency: only you can answer that, because it depends on where you're connecting from. Start up an instance in the region closest to you and run ping, or mtr against it to find out what's the roundtrip time. That's the absolute minimum you can achieve. Your processing time goes on top of that.

I am a former employee of CENAPAD-UFC (National Centre of HPC, Federal University of Ceará), so I have something to say about outsourcing computer power.
CENAPAD has a big cluster, and it provides computational power for academic research. There, professors and students send their computation and their data, defined the output and go drink a coffee or two, while the cluster go on with the hard work. After lots of flops, the operation ended and they retrieve it via ssh and go back to their laptops.
For big chunks of computation, you wish to minimize any work that is not a useful computation. One such thing is commumication over detached computers. If you need to know when the computation has ended, let the HPC machine tell you that.
To compute stuff effectively, you may want to go deeper in the machine and performe some kind of distribution. I use OpenMP to distribute computation inside the same machine/thread distribution. To distribute between physically separated computers, but next (latency speaking), I use MPI. I have installed also another cluster in UFC for another department. There, the researchers used only MPI.
Maybe some read about distributed/grid/clusterized computing helps you:
https://en.m.wikipedia.org/wiki/SETI#home ; the first example of massive distributed computing that I ever heard
https://en.m.wikipedia.org/wiki/Distributed_computing
https://en.m.wikipedia.org/wiki/Grid_computing
https://en.m.wikipedia.org/wiki/Supercomputer ; this is CENAPAD like stuff
In my opinion, you wish to use a grid-like computation, with your personal PC working as a master node that may call the EC2 slaves; in this scenario, just use communication from master to slave to send program (if really needed) and data, in such a way that the master will have another thing to do not related with the sent data; also, let the slave tells your master when the computation reached it's end.

I need a way to combine high priority process with low CPU percentage for a background server

I have a pipe coming from my Web server to my primary development desktop in order to have a slot open for heavy CPU processes without paying for the premium from Amazon or another cloud platform. I do however still use this machine for other personal things such as video encoding or gaming.
Is there a way to combine both a NICE value and a cpulimit value in order to slow down the maximum percentage of the CPU being used but it has the highest priority so it will absolutely be done when requested. Say for example I wanted 25% of my CPU available on demand to the process no matter what I was doing on the machine currently.
Ideally I would like it to be able to allow a higher percentage during times that I am not using the machine but setting a minimum that is always available.
Is there a clean way to do this? The only way that I found so far is by sticking the process in a separate virtual machine but it feels like I'm making things a whole lot more complicated than they need to be in order to make it run smoothly. On top of that, the ability to allow a higher percentage from a limited virtual machine currently doesn't exist as far as I know.
As a side note, I'm doing all this on a Mac so this solution will have to be Unix based. And the server I'm using is python's CherryPy for easy expansions on new developments.
Thank you in advance.

IPython parallel computing vs pyzmq for cluster computing

I am currently working on some simulation code written in C, which runs on different remote machines. While the C part is finished I want to simplify my work by extending it with a python simulation api and some kind of a job-queue system, which should do the following:
1.specifiy a set of parameters on which simulations should be performed and put them into a queue on a host computer
2.perform simulation on remote machines by workers
3.return results to host computer
I had a look at different frameworks for accomplishing this task and my first choice goes down to IPython.parallel. I had a look at the documentation and from what I tested out it seems pretty easy to use. My approach would be to use a load balanced view like explained at
http://ipython.org/ipython-doc/dev/parallel/parallel_task.html#creating-a-loadbalancedview-instance
But what I dont see is:
what happens i.e. if the ipcontroller crashes, is my job queue gone?
what happens if a remote machine crashes? is there some kind of error handling?
Since I run relatively long simulations (1-2 weeks) I don't want my simulations to fail if some part of the system crashes. So is there maybe some way to handle this in IPython.parallel?
My Second approach would be to use pyzmq and implement the jobsystem from scratch.
In this case what would be the best zmq-pattern for this situation?
And last but not least, is there maybe a better framework for this scenario?

What lies behind the curtain is a bit more complex view on how to arrange the work-package flow alongside the ( parallelised ) number-crunching pipeline(s).
Being the work-package of a many CPU-core-week(s),
or
being the lumpsum volume of the job above a few hundred-of-thousands of CPU-core-hours, the principles are similar and follow a common sense.
Key Features
scaleability of the computing performance of all resources involved ( ideally a linear one )
ease of task submission role
fault-resilience of submitted task(s) ( ideally with an automated self-healing )
feasible TCO cost of access to / use of a sufficient pool of resources ( upfront co$ts, recurring co$ts, adaptation$ co$ts, co$ts of $peed )
Approaches to Solution
home-brew architecture for a distributed massively parallel scheduler based self-healing computation engine
re-use of available grid-based computing resources
Based on own experience to solve a need for repetitive runs of numerical intensive optimisation problem over a vast parameterSetVectorSPACE ( which could not be de-composed into any trivialised GPU parallelisation scheme ), selection of the second approach has been validated to be more productive rather than an attempt to burn dozens of man*years in just-another-trial to re-invent a wheel.
Being in academia environment, one may get a way easier to an acceptable access to resources-pool(s) for processing the work-packages, while commercial entities may acquire the same, based on their acceptable budgeting tresholds.

My gut instinct is to suggest rolling your own solution for this, because like you said otherwise you're depending on IPython not crashing.
I would run a simple python service on each node which listens for run commands. When it receives one it launches your C program. However, I suggest you ensure the C program is a true Unix daemon, so when it runs it completely disconnects itself from python. That way if your node python instance crashes you can still get data if the C program executes successfully. Have the C program write the output data to a file or database, and when the task is finished write "finished" to a "status" or something similar. The python service should monitor that file and when finished is indicated it should retrieve the data and send it back to the server.
The central idea of this design is to have as few possible points of failure as possible. As long as the C program doesn't crash, you can still get the data one way or another. As far as handling system crashes, network disconnects, etc, that's up to you.

Testing memory usage of python frameworks in Virtualenv

I'm creating an app in several different python web frameworks to see which has the better balance of being comfortable for me to program in and performance. Is there a way of reporting the memory usage of a particular app that is being run in virtualenv?
If not, how can I find the average, maximum and minimum memory usage of my web framework apps?

It depends on how you're going to run the application in your environment. There are many different ways to run Python web apps. Recently popular methods seem to be Gunicorn and uWSGI. So you'd be best off running the application as you would in your environment and you could simply use a process monitor to see how much memory and CPU is being used by the process running your applicaiton.

I'll second Matt W's note about the application environment being a major factor ( Gunicorn , uWSGI, nginx->paster/pserve, eventlet, apache+mod_wsgi, etc etc etc)
I'll also add this- the year is 2012. In 1999, memory and CPU for stuff like this were huge concerns. But it's 2012. Computers are significantly more powerful, expanding them is much easier and cheaper, and frameworks are coded better.
You're essentially looking at benchmarking things that have no practical matter and will only be theoretically 'neat' and informative.
The performance bottlenecks on Python webapps are usually:
database communications bottleneck
database schema
concurrent connections / requests-per-second
In terms of database communications bottleneck , the general approaches to solving it are:
communicate less
aggressive caching
optimize your sql queries and result sets , so there's less data
upgrade your db infrastructure
dedicated machine(s)
cluster master/slave or shard
In terms of database schema, convenience comes at a price. It's faster to get certain things done in Django -- but you're going to be largely stuck with the schema it creates. Pyramid+SqlAlchemy is more flexible and you can build against a finely tuned database with it... but you're not going to get any of the automagic tools that Django gives.
for concurrent connections / requests per second, it's largely due to the environment. running the same app under paster, uwsgi and other deployment strategies will have different results.
here's a link to a good , but old, benchmark - http://nichol.as/benchmark-of-python-web-servers
you'll note there's a slide for peak memory usage there, and although there are a few outliers and a decent amount of clustering going on, the worst performer had 122MB. that's nothing.
you could interpret gevent as awesome for having 3MB compared to uwsgi's 15 or cogen's 122... but these are all a small fraction of a modern system's memory.
the frameworks have such a small overhead and will barely be a factor in operating performance. even the database portions are nothing. reference this posting about SqlAlchemy ( Why is SQLAlchemy insert with sqlite 25 times slower than using sqlite3 directly? ) , where the maintainer notes some impressive performance notes: straight-up sql generation was ~.5s for 100k rows. when a full ORM with integrity checks/etc is involved , it becomes 16s for the same amount of rows. that is nothing.
So , my point is simple- the two factors you should consider are:
how fast / comfortable can i program now
how fast / comfortable can i program in a year from now ( i.e. how likely is my project to grow 'technical debt' using this framework , and how much of a problem will that become )
play with the frameworks to decide which one you like the most, but don't waste your time on performance testing , because all you're going to do is waste time.

The choice of hosting mechanism isn't the cause of memory usage, it is how you configure them, plus what fat Python web application you decide to run.
The benchmark being quoted of:
http://nichol.as/benchmark-of-python-web-servers
is a good example of where benchmarks can get it quite wrong.
The configurations of the different hosting mechanisms in that benchmark were not comparable and so there is no way you can use the results to evaluate memory usage of each properly. I would not pay much attention to that benchmark if memory is your concern.
Ignoring memory, some of the other comments made about where the real bottlenecks are going to be are valid. For a lot more detail on this whole issue see my PyCon talk.
http://lanyrd.com/2012/pycon/spcdg/

Parallel processing options in Python

I recently created a python script that performed some natural language processing tasks and worked quite well in solving my problem. But it took 9 hours. I first investigated using hadoop to break the problem down into steps and hopefully take advantage of the scalable parallel processing I'd get by using Amazon Web Services.
But a friend of mine pointed out the fact that Hadoop is really for large amounts of data store on disk, for which you want to perform many simple operations. In my situation I have a comparatively small initial data set (low 100s of Mbs) on which I perform many complex operations, taking up a lot of memory during the process, and taking many hours.
What framework can I use in my script to take advantage of scalable clusters on AWS (or similar services)?

Parallel Python is one option for distributing things over multiple machines in a cluster.

This example shows how to do a MapReduce like script, using processes on a single machine. Secondly, if you can, try caching intermediate results. I did this for a NLP task and obtained a significant speed up.

My package, jug, might be very appropriate for your needs. Without more information, I can't really say how the code would look like, but I designed it for sub-hadoop sized problems.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.