I recently created a python script that performed some natural language processing tasks and worked quite well in solving my problem. But it took 9 hours. I first investigated using hadoop to break the problem down into steps and hopefully take advantage of the scalable parallel processing I'd get by using Amazon Web Services.
But a friend of mine pointed out the fact that Hadoop is really for large amounts of data store on disk, for which you want to perform many simple operations. In my situation I have a comparatively small initial data set (low 100s of Mbs) on which I perform many complex operations, taking up a lot of memory during the process, and taking many hours.
What framework can I use in my script to take advantage of scalable clusters on AWS (or similar services)?
Parallel Python is one option for distributing things over multiple machines in a cluster.
This example shows how to do a MapReduce like script, using processes on a single machine. Secondly, if you can, try caching intermediate results. I did this for a NLP task and obtained a significant speed up.
My package, jug, might be very appropriate for your needs. Without more information, I can't really say how the code would look like, but I designed it for sub-hadoop sized problems.
Related
A Python program of mine uses the multiprocessing module to parallelize iterations of a search problem. Besides doing other things, each iteration loops over a CPU-expensive process that is already optimized in Cython. Because this process gets called multiple times while looping, this significantly slows down the total runtime.
What is the recommended way to achieve a speed-up in this case? As the expensive process can't be further CPU-optimized, I've considered parallelizing the loop. However, as the loop lives in an already parallelized (by multiprocessing) program, I don't think this would be possible on the same machine.
My research on this has failed to find any best practices or any sort of direction.
As a quick way to see if it might be possible to optimize your existing code, you might check your machines CPU usage while the code is running.
If all your cores are ~100% then adding more processes etc isn't likely to improve things.
In that case you could
1 - Try further algorithm optimisation (though best bang for the buck is to profile your code first to see where it's slow). Though if you've already been using Cython then likely this might have limited returns
2 - Try a faster machine and/or with more cores
Another approach however (one that I've used) is instead to develop a serverless design, and run your CPU intensive, parallel parts of your algorithm using any of the cloud vendors serverless models.
I've personally used AWS lamda, where we parallelized our code to run with 200+ simultaneous lambda processes, that is roughly equivalent to a 200+ core single machine.
For us, this essentially resulted in a 50-100 times increase in performance (measured as reduction in total processing time) compared to running on a 8-core server.
You do have to do more work to implement a serverless deployment model, and then wrapper code to manage everything, which isn't trivial. However the ability to essentially scale infinitely horizontally may potentially make sense for you.
I have written a script to do some research on HTTP Archive data. This script needs to make HTTP requests to sites scraped by HTTP Archive in order to classify sites into groups (e.g., Drupal, WordPress, etc). The script is working really well; however, the list of sites that I am handling is 300,000 sites long.
I would like to be able to complete the categorization of sites as fast as possible. I have experimented with running multiple instances of the script at the same time and it is working well with appropriate locks in place to prevent race conditions.
How can I max this out to get all of these operations completed as fast as possible? For instance, I am looking at spinning up a VPS with 8 CPUs and 16 GB RAM. How do I maximize these resources to make sure I'm using every bit of processing power possible? I may consider spinning up something more powerful, but I want to make sure I understand how to get the most out of it so I'm not wasting money.
Thanks!
Multiprocessing module is the best option that lets you harness the maximum power of your 8 CPUs:
https://docs.python.org/3.3/library/multiprocessing.html
I have just written a script that is intended to be run 24/7 to update some files. However, if it takes 3 minutes to update one file, then it would take 300 minutes to update 100 files.
Is it possible to run n instances of the script to manage n separate files to speed up the turnaround time?
Yes it is possible. Use the multiprocessing module to start several concurrent processes. This has the advantage that you do not run into problems because of the Global Interpreter Lock and threads as is explained in the manual page. The manual page includes all the examples you will need to make your script execute in parallel. Of course this works best if the processes do not have to interact, which your example suggests.
I suggest you first find out if there is any way to reduce the 3 minutes in a single thread.
The method I use to discover speedup opportunities is demonstrated here.
That will also tell you if you are purely I/O bound.
If you are completely I/O bound, and all files are on a single disk, parallelism won't help.
In that case, possibly storing the files on a solid-state drive would help.
On the other hand, if you are CPU bound, parallelism will help, as #hochl said.
Regardless, find the speedup opportunities and fix them.
I've never seen any good-size program that didn't have one or several of them.
That will give you one speedup factor, and parallelism will give you another, and the total speedup will be the product of those two factors.
I'm new to using multiple cpu's to process jobs and was wondering if people could let me know the pro/cons of parallelpython(or any type of python module) versus hadoop streaming?
I have a very large cpu intensive process that I would like to spread across several servers.
Since moving data becomes harder and harder with size; when it comes to parallel computing, data localization becomes very important. Hadoop as a map/reduce framework maximizes the localization of data being processed. It also gives you a way to spread your data efficiently across your cluster (hdfs). So basically, even if you use other parallel modules, as long as you don't have your data localized on the computers you are doing process or as long as you have to move your data across cluster all the time, you wouldn't get maximum benefit from parallel computing. That's one of the key ideas of hadoop.
The main difference is that Hadoop is good at processing big data (dozen to terabytes of data). It provides a simple logical framework called MapReduce which is well appropriate for data aggregation, and a distributed storage system called HDFS.
If your inputs are smaller than 1 gigabyte, you probably don't want to use Hadoop.
Upon hearing that the scientific computing project (happens to be the stochastic tractography method described here) I'm currently running for an investigator would take 4 months on our 50 node cluster, the investigator has asked me to examine other options. The project is currently using parallel python to farm out chunks of a 4d array to different cluster nodes, and put the processed chunks back together.
The jobs I'm currently working with are probably much too coarsely grained, (5 seconds to 10 minutes, I had to increase the timeout default in parallel python) and I estimate I could speed up the process by 2-4 times by rewriting it to make better use of resources (splitting up and putting back together the data is taking too long, that should be parallelized as well). Most of the work in done by numpy arrays.
Let's assume that 2-4 times isn't enough, and I decide to get the code off of our local hardware. For high throughput computing like this, what are my commercial options and how will I need to modify the code?
You might be interested in PiCloud. I have never used it, but their offer apparently includes the Enthought Python Distribution, which covers the standard scientific libraries.
It's tough to say if this will work for your specific case, but the Parallel Python interface is pretty generic. So hopefully not too many changes would be needed. Maybe you can even write a custom scheduler class (implementing the same interface as PP). Actually that might be useful for many people, so maybe you can drum up some support in the PP forum.
The most obvious commercial options which come to mind are Amazon EC2 and the Rackspace Cloud. I have played with both and found the Rackspace API a little easier to use.
The good news is that you can prototype and play with their compute instances (short- or long-lived virtual machines of the OS of your choice) for very little investment, typically $US 0.10 / hr or so. You create them on demand and then release them back to the cloud when you are done, and only pay for what you use. For example, I saw a demo on Django deployment using 6 Rackspace instances which took perhaps an hour and cost the speakers less than a dollar.
For your use case (not clear exactly what you meant by 'high throughput'), you will have to look at your budget and your computing needs, as well as your total network throughput (you pay for that, too). A few small-scale tests and a simple spreadsheet calculation should tell you if it's really practical or not.
There are Python APIs for both Rackspace Cloud and Amazon EC2. Whichever you use, I recommend python-based Fabric for automated deployment and configuration of your instances.