Clarification: As per some of the comments, I should clarify that this is intended as a simple framework to allow execution of programs that are naturally parallel (so-called embarrassingly parallel programs). It isn't, and never will be, a solution for tasks which require communication or synchronisation between processes.
I've been looking for a simple process-based parallel programming environment in Python that can execute a function on multiple CPUs on a cluster, with the major criterion being that it needs to be able to execute unmodified Python code. The closest I found was Parallel Python, but pp does some pretty funky things, which can cause the code to not be executed in the correct context (with the appropriate modules imported etc).
I finally got tired of searching, so I decided to write my own. What I came up with is actually quite simple. The problem is, I'm not sure if what I've come up with is simple because I've failed to think of a lot of things. Here's what my program does:
I have a job server which hands out jobs to nodes in the cluster.
The jobs are handed out to servers listening on nodes by passing a dictionary that looks like this:
{
'moduleName':'some_module',
'funcName':'someFunction',
'localVars': {'someVar':someVal,...},
'globalVars':{'someOtherVar':someOtherVal,...},
'modulePath':'/a/path/to/a/directory',
'customPathHasPriority':aBoolean,
'args':(arg1,arg2,...),
'kwargs':{'kw1':val1, 'kw2':val2,...}
}
moduleName and funcName are mandatory, and the others are optional.
A node server takes this dictionary and does:
sys.path.append(modulePath)
globals()[moduleName]=__import__(moduleName, localVars, globalVars)
returnVal = globals()[moduleName].__dict__[funcName](*args, **kwargs)
On getting the return value, the server then sends it back to the job server which puts it into a thread-safe queue.
When the last job returns, the job server writes the output to a file and quits.
I'm sure there are niggles that need to be worked out, but is there anything obvious wrong with this approach? On first glance, it seems robust, requiring only that the nodes have access to the filesystem(s) containing the .py file and the dependencies. Using __import__ has the advantage that the code in the module is automatically run, and so the function should execute in the correct context.
Any suggestions or criticism would be greatly appreciated.
EDIT: I should mention that I've got the code-execution bit working, but the server and job server have yet to be written.
I have actually written something that probably satisfies your needs: jug. If it does not solve your problems, I promise you I'll fix any bugs you find.
The architecture is slightly different: workers all run the same code, but they effectively generate a similar dictionary and ask the central backend "has this been run?". If not, they run it (there is a locking mechanism too). The backend can simply be the filesystem if you are on an NFS system.
I myself have been tinkering with batch image manipulation across my computers, and my biggest problem was the fact that some things don't easily or natively pickle and transmit across the network.
for example: pygame's surfaces don't pickle. these I have to convert to strings by saving them in StringIO objects and then dumping it across the network.
If the data you are transmitting (eg your arguments) can be transmitted without fear, you should not have that many problems with network data.
Another thing comes to mind: what do you plan to do if a computer suddenly "disappears" while doing a task? while returning the data? do you have a plan for re-sending tasks?
Related
Disclaimer: I know this question will annoy some people because it's vague, theoretical, and has little code.
I have a AWS Lambda function in Python which reads a file of denormalized records off S3, formats its contents correctly, and then uploads that to DynamoDB with a batch write. It all works as advertised. I then tried to break up the uploading part of this pipeline into threads with the hope of more efficiently utilizing DynamoDBs write capacity. However, the multithread version is slower by about 50%. Since the code is very long I have included pseudocode.
NUM_THREADS = 4
for every line in the file:
Add line to list of lines
if we've read enough lines for a single thread:
Create thread that uploads list of lines
thread.start()
clear list of lines.
for every thread started:
thread.join()
Important notes and possible sources of the problem I've checked so far:
When testing this locally using DynamoDB Local, threading does make my program run faster.
If instead I use only 1 thread, or even if I use multiple threads but I join the thread right after I start it (effectively single threaded), the program completes much quicker. With 1 thread ~30s, multi thread ~45s.
I have no shared memory between threads, no locks, etc.
I have tried creating new DynamoDB connections for each thread and sharing one connection instead, with no effect.
I have confirmed that adding more threads does not overwhelm the write capacity of DynamoDB, since it makes the same number of batch write requests and I don't have more unprocessed items throughout execution than with a single thread.
Threading should improve the execution time since the program is network bound, even though Python threads do not really run on multiple cores.
I have tried reading the entire file first, and then spawning all the threads, thinking that perhaps it's better to not interrupt the disk IO, but to no effect.
I have tried both the Thread library as well as the Process library.
Again I know this question is very theoretical so it's probably hard to see the source of the issue, but is there some Lambda quirk I'm not aware of? Is there something I else I can try to help diagnose the issue? Any help is appreciated.
Nate, have you completely ruled out a problem on the Dynamodb end? The total number of write requests may be the same, but the number per second would be different with a multi-thread.
The console has some useful graphs to show if your writes (or batch writes) are being throttled at all. If you don't have the right 'back off, retry' logic in your Lambda function, Lambda will just try and try again and your problem gets worse.
One other thing, which might have been obvious to you (but not me!). I was under the impression that batch_writes saved you money on the capacity planning front. (That 200 writes in batches of 20 would only cost you 10 write units, for example. I could have sworn I heard an AWS guy mention this in a presentation, but that's beside the point.)
In fact the batch_writes save you some time, but nothing economically.
One last thought: I'd bet that Lambda processing time is cheaper than upping your Dynamodb write capacity. If you're in no particular rush for Lambda to finish, why not let it run its course on single-thread?
Good luck!
Turns out that the threading is faster, but only when the file reached a certain file size. I was originally work on a file size of about 1/2 MG. With a 10 MG file, the threaded version came out about 50% faster. Still unsure why it wouldn't work with the smaller file, maybe it just needs time to get a'cooking, you know what I mean? Computers are moody things.
As a backdrop I have good experience with python and dynamoDB along with using python's multiprocessing library. Since your file size was fairly small it may have been the setup time of the process that confused you about performance. If you haven't already, use python multiprocessing pools and use map or imap depending on your use case if you need to communicate any data back to the main thread. Using a pool is the darn simpliest way to run multiple processes in python. If you need your application to run faster as a priority you may want to look into using golang concurrency and you could always build the code into binary to use from within python. Cheers.
I am currently working on some simulation code written in C, which runs on different remote machines. While the C part is finished I want to simplify my work by extending it with a python simulation api and some kind of a job-queue system, which should do the following:
1.specifiy a set of parameters on which simulations should be performed and put them into a queue on a host computer
2.perform simulation on remote machines by workers
3.return results to host computer
I had a look at different frameworks for accomplishing this task and my first choice goes down to IPython.parallel. I had a look at the documentation and from what I tested out it seems pretty easy to use. My approach would be to use a load balanced view like explained at
http://ipython.org/ipython-doc/dev/parallel/parallel_task.html#creating-a-loadbalancedview-instance
But what I dont see is:
what happens i.e. if the ipcontroller crashes, is my job queue gone?
what happens if a remote machine crashes? is there some kind of error handling?
Since I run relatively long simulations (1-2 weeks) I don't want my simulations to fail if some part of the system crashes. So is there maybe some way to handle this in IPython.parallel?
My Second approach would be to use pyzmq and implement the jobsystem from scratch.
In this case what would be the best zmq-pattern for this situation?
And last but not least, is there maybe a better framework for this scenario?
What lies behind the curtain is a bit more complex view on how to arrange the work-package flow alongside the ( parallelised ) number-crunching pipeline(s).
Being the work-package of a many CPU-core-week(s),
or
being the lumpsum volume of the job above a few hundred-of-thousands of CPU-core-hours, the principles are similar and follow a common sense.
Key Features
scaleability of the computing performance of all resources involved ( ideally a linear one )
ease of task submission role
fault-resilience of submitted task(s) ( ideally with an automated self-healing )
feasible TCO cost of access to / use of a sufficient pool of resources ( upfront co$ts, recurring co$ts, adaptation$ co$ts, co$ts of $peed )
Approaches to Solution
home-brew architecture for a distributed massively parallel scheduler based self-healing computation engine
re-use of available grid-based computing resources
Based on own experience to solve a need for repetitive runs of numerical intensive optimisation problem over a vast parameterSetVectorSPACE ( which could not be de-composed into any trivialised GPU parallelisation scheme ), selection of the second approach has been validated to be more productive rather than an attempt to burn dozens of man*years in just-another-trial to re-invent a wheel.
Being in academia environment, one may get a way easier to an acceptable access to resources-pool(s) for processing the work-packages, while commercial entities may acquire the same, based on their acceptable budgeting tresholds.
My gut instinct is to suggest rolling your own solution for this, because like you said otherwise you're depending on IPython not crashing.
I would run a simple python service on each node which listens for run commands. When it receives one it launches your C program. However, I suggest you ensure the C program is a true Unix daemon, so when it runs it completely disconnects itself from python. That way if your node python instance crashes you can still get data if the C program executes successfully. Have the C program write the output data to a file or database, and when the task is finished write "finished" to a "status" or something similar. The python service should monitor that file and when finished is indicated it should retrieve the data and send it back to the server.
The central idea of this design is to have as few possible points of failure as possible. As long as the C program doesn't crash, you can still get the data one way or another. As far as handling system crashes, network disconnects, etc, that's up to you.
I'm working on a simple experiment in Python. I have a "master" process, in charge of all the others, and every single process has a connection via unix socket to the master process. I would like to be able for the master process to be able to monitor all of the sockets for a response - but there could theoretically be almost a hundred of them. How would threads impact the memory and performance of the application? What would be the best solution? Thanks a lot!
One hundred simultaneous threads might be pushing the reasonable limits of threading. If you find this is the cleanest way to organize your code, I'd say give it a try, but threading really doesn't scale very far.
What works better is to use a technique like select to wait for one of the sockets to be readable / writable / or has an error to report. This mechanism lets you go to sleep until something interesting happens, handle as many sockets have content to handle, and then go back to sleep again, all in a single thread of execution. Removing the multi-threading can often reduce chances for errors, and this style of programming should get you into the hundreds of connections no trouble. (If you want to go beyond about 100, I'd use the poll functionality instead of select -- constantly rebuilding the list of interesting file descriptors takes time that poll does not require.)
Something to consider is the Python Twisted Framework. They've gone to some length to provide a consistent way to hook callbacks onto events for this exact sort of programming. (If you're familiar with node.js, it's a bit like that, but Python.) I must admit a slight aversion to Twisted -- I never got very far in their documentation without being utterly baffled -- but a lot of people made it further in the docs than I did. You might find it a better fit than I have.
The easiest way to conduct comparative tests of threads versus processes for socket handling is to use the SocketServer in Python's standard library. You can easily switch approaches (while keeping everything else the same) by inheriting from either ThreadingMixIn or ForkingMixIn. Here is a simple example to get you started.
Another alternative is a select/poll approach using non-blocking sockets in a single process and a single thread.
If you're interested in software that is already fully developed and highly evolved, consider these high-performance Python based server packages:
The Twisted framework uses the async single process, single thread style.
The Tornado framework is similar (less evolved, less full featured, but easier to understand)
And Gunicorn which is a high-performance forking server.
This question already has answers here:
How can I sandbox Python in pure Python?
(7 answers)
Python, safe, sandbox [duplicate]
(9 answers)
Closed 9 years ago.
Let's say there is a server on the internet that one can send a piece of code to for evaluation. At some point server takes all code that has been submitted, and starts running and evaluating it. However, at some point it will definitely bump into "os.system('rm -rf *')" sent by some evil programmer. Apart from "rm -rf" you could expect people try using the server to send spam or dos someone, or fool around with "while True: pass" kind of things.
Is there a way to coop with such unfriendly/untrusted code? In particular I'm interested in a solution for python. However if you have info for any other language, please share.
If you are not specific to CPython implementation, you should consider looking at PyPy[wiki] for these purposes — this Python dialect allows transparent code sandboxing.
Otherwise, you can provide fake __builtin__ and __builtins__ in the corresponding globals/locals arguments to exec or eval.
Moreover, you can provide dictionary-like object instead of real dictionary and trace what untrusted code does with it's namespace.
Moreover, you can actually trace that code (issuing sys.settrace() inside restricted environment before any other code executed) so you can break execution if something will go bad.
If none of solutions is acceptable, use OS-level sandboxing like chroot, unionfs and standard multiprocess python module to spawn code worker in separate secured process.
You can check pysandbox which does just that, though the VM route is probably safer if you can afford it.
It's impossible to provide an absolute solution for this because the definition of 'bad' is pretty hard to nail down.
Is opening and writing to a file bad or good? What if that file is /dev/ram?
You can profile signatures of behavior, or you can try to block anything that might be bad, but you'll never win. Javascript is a pretty good example of this, people run arbitrary javascript code all the time on their computers -- it's supposed to be sandboxed but there's all sorts of security problems and edge conditions that crop up.
I'm not saying don't try, you'll learn a lot from the process.
Many companies have spent millions (Intel just spent billions on McAffee) trying to understand how to detect 'bad code' -- and every day machines running McAffe anti-virus get infected with viruses. Python code isn't any less dangerous than C. You can run system calls, bind to C libraries, etc.
I would seriously consider virtualizing the environment to run this stuff, so that exploits in whatever mechanism you implement can be firewalled one more time by the configuration of the virtual machine.
Number of users and what kind of code you expect to test/run would have considerable influence on choices btw. If they aren't expected to link to files or databases, or run computationally intensive tasks, and you have very low pressure, you could be almost fine by just preventing file access entirely and imposing a time limit on the process before it gets killed and the submission flagged as too expensive or malicious.
If the code you're supposed to test might be any arbitrary Django extension or page, then you're in for a lot of work probably.
You can try some generic sanbox such as Sydbox or Gentoo's sandbox. They are not Python-specific.
Both can be configured to restrict read/write to some directories. Sydbox can even sandbox sockets.
I think a fix like this is going to be really hard and it reminds me of a lecture I attended about the benefits of programming in a virtual environment.
If you're doing it virtually its cool if they bugger it. It wont solve a while True: pass but rm -rf / won't matter.
Unless I'm mistaken (and I very well might be), this is much of the reason behind the way Google changed Python for the App Engine. You run Python code on their server, but they've removed the ability to write to files. All data is saved in the "nosql" database.
It's not a direct answer to your question, but an example of how this problem has been dealt with in some circumstances.
I have a code-base that I'm looking to split up and add to by using threading, however I'm relatively new on how to handle it. Please before reading further respect my wish of NOT just re-writing this code and tossing it back at me with the problem solved. I would much rather work the problem out by someone pointing me in the right direction, than someone solving it FOR me; I don't learn well that way.
The fully functioning code-base is here -- It requires the mechanize and beautifulsoup libraries which can be installed via easy_install.
I've separated out all of my functions, and tried to keep the code as clean as possible (I'm sure there are some optimizations in there that I'll get reamed for, but the main problem is how to thread this.
My ultimate goal is to pack this into a thread, and then share cookies between other initialized browser objects in order to do other things while my original code is running 'backgrounded'.
I've tried thus:
class Recon(threading.Thread):
def __init__(self):
threading.Thread.__init__(self)
#Packed the stuff above my original while loop in here, minus functions.
def run(self):
#Packed my code past the while loop in here.
somevar = Recon()
somevar.start()
Problem I'm having is that, once I run the program it will run the things in init, but afterwards it just sits there and freezes on me. No traceback, no errors, just doesn't do anything, doesn't even return my command prompt back to my control.
Could I just get some tips, or a general flow of how to convert this? I got overwhelmed and deleted the code I was trying with so I don't have that example, but do I need to be prepending 'self.' to all of my variables? Do I need to just define my vars as global?
Here is a reproduction of what I'm having trouble with after having tried to convert the script to use threading.
As long as you have a single thread (as in the above snippet, where you instantiate Recon just once), it shouldn't matter much what you do where; but of course I imagine the reason you're introducing threading is to eventually move to having multiple threads active.
If that's the case, then the first key issue is to ensure that you never have two or more threads simultaneously trying to use the same shared system/resource -- for example, multiple threads writing at the same time to ReconFile, in the case of the code at the pastebin URL you mention.
The classic way to avoid such issues is to use locking, but my favorite way is quite different: make sure any such resource is accessed by only one dedicated thread, and use a Queue.Queue instance (intrinsically threadsafe) to have other threads post work-request to the dedicated thread (so instead of writing to ReconFile directly each other thread would make a list of lines to be written contiguously, then .put the list on the queue where the "recon file writing" worker thread is waiting via .get).
When you need to get results back from such actions (not the case here), the requesting thread would place its own personal "queue on which to return results" as part of the "work request packet" it puts to the worker thread's queue. I've presented much more detail about this recommended architecture in the threading chapter of "Python in a Nutshell" 2nd edition (and why, as the book's author, I would of course never recommend you perform an illegal download of a free pirate copy of my book, I can however mention there's plenty of sites offering such pirate copies for download -- the legal way to read my book for free is to sign up for a trial offer to O'Reilly's "safari" online books website).
This does not address the specific problem you're observing, since it's happening when you only have one thread around. I notice that thread is trying to perform lots of I/O on standard input and standard output, which is possibly problematic from a thread -- consider doing the input for a thread before you start it (in the main thread) and for needed output use Python's standard logging module, which is guaranteed to be thread-safe. Do you still observe problems then? If that's the case, then the next step is to pepper your code with logging.info calls so that you can pinpoint exactly where it's stalling -- and tell us about that, so we can try to help from there!