How to schedule jobs in to client systems using python? - python

i have 4 systems , where in one i am using as a master and the rest 3 as slaves. I want to execute a set of functions in client systems and fetch back the result, to achieve this i had previously used Parallel Python , unfortunately as soon as job_server is created in it , it internally executes the given function using all systems. I want to individually assign particular function to be executed on an individual client machine , i have no idea on how to proceed with coding . Is there any framework in python, which allows users to do that ?

There is the multiprocessing module: http://docs.python.org/library/multiprocessing.html
You can find examples there that do exactly what you want (see Remote Managers).
As a side notice: I soon stumbled upon limitations in the multiprocessing module and now use PyRo as middleware. It's very simple and powerful but requires a bit more work to set up a basic framework.

There was a talk at a Pycon in the past few years about a tool that managed remote processes via ssh, which sounded really cool and I meant to check out. I've forgotten what the tool was, but you might find it by looking at the Pycon talks for the past few years.
The Python Parallel Processing wiki page has a big list of tools that might help you.

I don't quite understand what you have now and what you're trying to do. Pyro is easy to use, pretty fast and well documented for letting multiple python processes (on multiple systems) interact with each other. Take a look at that. You'll need to ensure that something is running on all of them so they're listening for commands, but that's not too hard.

In Pythomnic you can have multiple processes on multiple servers exchanging synchronous calls like this:
result = pmnc("other_process").module.method(...) # invokes remote method
So you can eiher have 3 distinctly named slaves and pick call target manually:
pmnc("slave_1").foo.bar() # calls bar() in foo.py on slave_1
or have slaves identically named and have the framework pick any available (if they are interchangeable of course)
pmnc("slave").foo.bar() # calls bar() in foo.py on any slave

check out RPyc

Related

distributed python programming

I am trying to split the execution of a python program to two different machines. I am wondering if there's a way to call the python interpreter on one machine from another. Not running a script on another machine, but rather split the task of execution to two machines.
Over the course of the next couple of months, I will be teaching my self distributed programming, and I thought this would be a good way to start.
I think the first step is to use one machine to call another machine and send it a piece of the program. Then the next step would be that both machines execute the same program together and communicate to avoid problems. The third step would be three machines, etc.
Advice, tips, and thoughts are all welcome...
Disclamer: I am a developer of SCOOP.
Data-based technologies you may want to get acquainted with for distributed processing would be the MPI standard (for multi-computers, using mpi4py [prefered] or pympi) and the standard multiprocessing module allowing remote computation (but awkward, from my point of view).
You should begin with task-based frameworks, though. They provides a simple and user-friendly usage. Both of these were an utmost focus while creating SCOOP. You can try it with pip -U scoop. On Windows, you may wish to install PyZMQ first using their executable installers. You can check the provided examples and play with the various parameters to understand what causes performance degradation or increase with ease. I encourage you to compare it to its alternatives such as Celery for similar work.
Both of these frameworks allow remote launching of Python programs. More importantly, it does the parallel processing for you while you only need to feed them with your tasks.
You may want to check Fabric for an easy way to setup your remote environments or even control or launch scripts remotely.
Check out Ray, which is a library for writing parallel and distributed Python.
Ray uses the same syntax to parallelize code on a single multicore machine and in the distributed setting.
If you add the #ray.remote decorator to a function, it can be executed asynchronously in parallel (on any machine in the cluster). Remote function invocations return futures, whose values can be retrieved with ray.get.
The same thing can be done with Python classes (instead of functions), see the documentation for actors.
import ray
import time
ray.init()
#ray.remote
def function(x):
time.sleep(1)
return x
args = [1, 2, 3, 4]
# Submit 4 tasks in parallel.
result_ids = [function.remote(x) for x in args]
# Retrieve the results. Assuming at least 4 cores,
# this will take 1 second.
results = ray.get(result_ids)
See the Ray documentation for more. Note, I'm one of the Ray developers.
There is MPI version for Python [1] [2].
MPI (Message Passing Interface) is a standardized interface and it is cool because you find it also in C, Java, (Fortran) etc.
It enables you to communicate between your processes that run remote. You use these messages for synchronization and for information passing.
You also have collective operations, like broadcast, gather, reduce
Have a look at RPyC, you might find it usefull.

Code interpreter in a web service

I'd like to build a website with a sandboxed interpreter (or compiler) either on the client side of on the server side that can take short blocks of code (python/java/c/c++ any common language would do) as input and execute it.
What I want to build is a place where given a programming question, the user can type in the solution and we can run it through some test cases, to either approve the solution or provide a test case where it breaks.
Looking for pointers to libraries, existing implementation or a general idea.
Any help much appreciated.
There are many contest websites that do something like this-- TopCoder and Timus Online Judge are two examples. They don't have much information on the technology, however.
codepad.org is the closest to what you want to do. They run programs on heavily sandboxed and firewalled EC2 servers that are periodically wiped, to prevent exploits.
Codepad is at least partially based on geordi, an IRC bot designed to run arbitrary C++ programs. It uses Haskell and traps system calls to prevent harmful activity.
Of slightly less interest, one of Google App Engine's example projects is a Python shell. It relies on GAE's server-side sandboxing to prevent malicious activity.
In terms of interface, the simplest would be to do something like the Internation Informatics Olympiad. Have people write a function with a certain name in the target language, then invoke that from your testing framework. Have simple functions that will let them request information from the framework, if necessary.
For Python you can compile PyPy in sandboxed mode which gives you a complete interpreter and full standard library but without the ability to execute arbitrary system calls. You can also limit the runtime and heap size of executed scripts.
Here's some code I wrote a while back to execute an arbitrary string containing a Python script in the pypy-sandbox binary and return the output. You can call this code from regular CPython.
Take a look at the paper An Enticing Environment for Programming which discusses building just such an environment.

Python - question regarding the concurrent use of `multiprocess`

I want to use Python's multiprocessing to do concurrent processing without using locks (locks to me are the opposite of multiprocessing) because I want to build up multiple reports from different resources at the exact same time during a web request (normally takes about 3 seconds but with multiprocessing I can do it in .5 seconds).
My problem is that, if I expose such a feature to the web and get 10 users pulling the same report at the same time, I suddenly have 60 interpreters open at the same time (which would crash the system). Is this just the common sense result of using multiprocessing, or is there a trick to get around this potential nightmare?
Thanks
If you're really worried about having too many instances you could think about protecting the call with a Semaphore object. If I understand what you're doing then you can use the threaded semaphore object:
from threading import Semaphore
sem = Semaphore(10)
with sem:
make_multiprocessing_call()
I'm assuming that make_multiprocessing_call() will cleanup after itself.
This way only 10 "extra" instances of python will ever be opened, if another request comes along it will just have to wait until the previous have completed. Unfortunately this won't be in "Queue" order ... or any order in particular.
Hope that helps
You are barking up the wrong tree if you are trying to use multiprocess to add concurrency to a network app. You are barking up a completely wrong tree if you're creating processes for each request. multiprocess is not what you want (at least as a concurrency model).
There's a good chance you want an asynchronous networking framework like Twisted.
locks are only ever nessecary if you have multiple agents writing to a source. If they are just accessing, locks are not needed (and as you said defeat the purpose of multiprocessing).
Are you sure that would crash the system? On a web server using CGI, each request spawns a new process, so it's not unusual to see thousands of simultaneous processes (granted in python one should use wsgi and avoid this), which do not crash the system.
I suggest you test your theory -- it shouldn't be difficult to manufacture 10 simultaneous accesses -- and see if your server really does crash.

What are some good ways to do intermachine locking?

Our server cluster consists of 20 machines, each with 10 pids of 5 threads. We'd like some way to prevent any two threads, in any pid, on any machine, from modifying the same object at the same time.
Our code's written in Python and runs on Linux, if that helps narrow things down.
Also, it's a pretty rare case that two such threads want to do this, so we'd prefer something that optimizes the "only one thread needs this object" case to be really fast, even if it means that the "one thread has locked this object and another one needs it" case isn't great.
What are some of the best practices?
If you want to synchronize across machines you need a Distributed Lock Manager.
I did some quick googling and came up with: Stackoverflow.
Unfortunately they only suggest Java version, but it's a start.
If you are trying to synchronize access to files: Your filesystem should already have some wort of locking service in place. If not consider changing it.
I assume you came across this blog post http://amix.dk/blog/post/19386 during your googling?
The author demonstrates a simple interface to memcachedb which it uses as a dummy distributed lock manager. It's a great idea, and memcache is probably one of the faster thing's you'll be able to interface with. Note that it does use the more recently added with statement.
Here is an example usage from his blog post:
from __future__ import with_statement
import memcache
from memcached_lock import dist_lock
client = memcache.Client(['127.0.0.1:11211'])
with dist_lock('test', client):
print 'Is there anybody out there!?'
if you can get the complete infrastructure for a distributed lock manager then go ahead and use that. But that infrastructure is not easy to setup! But here is a practical solution:
-designate the node with the lowest ip address as the the master node
(that means if the node with lowest ip address hangs, a new node with lowest ip address will become new master)
-let all nodes contact the master node to get the lock on the object.
-let the master node use native lock semantics to get the lock.
this will simplify things unless you need complete clustering infrastructure and DLM to do the job.
Write code using immutable objects. Write objects that implement the Singleton Pattern.
Use a stable Distributed messaging technology such as IPC, webservices, or XML-RPC.
I would take a look at Twisted. They got plenty of solutions for such task.
I wouldn't use threads in Python esp with regards to the GIL, I would look at using Processes as working applications and use a comms technology as described above for intercommunications.
Your singleton class could then appear in one of these applications and interfaced via comms technology of choice.
Not a fast solution with all the interfacing, but if done correctly should be stable.
There may be a better way of doing this, but i would use the Lock class from the threading module to access the "protected" objects in a with statement, here would be an example:
from __future__ import with_statement
from threading import Lock
mylock = Lock()
with mylock.acquire():
[ 'do things with protected data here' ]
[ 'the rest of the code' ]
for more examples about Lock usages, have a look here.
Edit: this solution isn't suitable for this question as threading.Lock is not distributed, sorry

How can I do synchronous rpc calls

I'm building a program that has a class used locally, but I want the same class to be used the same way over the network. This means I need to be able to make synchronous calls to any of its public methods. The class reads and writes files, so I think XML-RPC is too much overhead. I created a basic rpc client/server using the examples from twisted, but I'm having trouble with the client.
c = ClientCreator(reactor, Greeter)
c.connectTCP(self.host, self.port).addCallback(request)
reactor.run()
This works for a single call, when the data is received I'm calling reactor.stop(), but if I make any more calls the reactor won't restart. Is there something else I should be using for this? maybe a different twisted module or another framework?
(I'm not including the details of how the protocol works, because the main point is that I only get one call out of this.)
Addendum & Clarification:
I shared a google doc with notes on what I'm doing. http://docs.google.com/Doc?id=ddv9rsfd_37ftshgpgz
I have a version written that uses fuse and can combine multiple local folders into the fuse mount point. The file access is already handled within a class, so I want to have servers that give me network access to the same class. After continuing to search, I suspect pyro (http://pyro.sourceforge.net/) might be what I'm really looking for (simply based on reading their home page right now) but I'm open to any suggestions.
I could achieve similar results by using an nfs mount and combining it with my local folder, but I want all of the peers to have access to the same combined filesystem, so that would require every computer to bee an nfs server with a number of nfs mounts equal to the number of computers in the network.
Conclusion:
I have decided to use rpyc as it gave me exactly what I was looking for. A server that keeps an instance of a class that I can manipulate as if it was local. If anyone is interested I put my project up on Launchpad (http://launchpad.net/dstorage).
If you're even considering Pyro, check out RPyC first, and re-consider XML-RPC.
Regarding Twisted: try leaving the reactor up instead of stopping it, and just ClientCreator(...).connectTCP(...) each time.
If you self.transport.loseConnection() in your Protocol you won't be leaving open connections.
For a synchronous client, Twisted probably isn't the right option. Instead, you might want to use the socket module directly.
import socket
s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.connect((self.host, self.port))
s.send(output)
data = s.recv(size)
s.close()
The recv() call might need to be repeated until you get an empty string, but this shows the basics.
Alternatively, you can rearrange your entire program to support asynchronous calls...
Why do you feel that it needs to be synchronous?
If you want to ensure that only one of these is happening at a time, invoke all of the calls through a DeferredSemaphore so you can rate limit the actual invocations (to any arbitrary value).
If you want to be able to run multiple streams of these at different times, but don't care about concurrency limits, then you should at least separate reactor startup and teardown from the invocations (the reactor should run throughout the entire lifetime of the process).
If you just can't figure out how to express your application's logic in a reactor pattern, you can use deferToThread and write a chunk of purely synchronous code -- although I would guess this would not be necessary.
If you are using Twisted you should probably know that:
You will not be making synchronous calls to any network service
The reactor can only ever be run once, so do not stop it (by calling reactor.stop()) until your application is ready to exit.
I hope this answers your question. I personally believe that Twisted is exactly the correct solution for your use case, but that you need to work around your synchronicity issue.
Addendum & Clarification:
Part of what I don't understand is
that when I call reactor.run() it
seems to go into a loop that just
watches for network activity. How do I
continue running the rest of my
program while it uses the network? if
I can get past that, then I can
probably work through the
synchronicity issue.
That is exactly what reactor.run() does. It runs a main loop which is an event reactor. It will not only wait for entwork events, but anything else you have scheduled to happen. With Twisted you will need to structure the rest of your application in a way to deal with its asynchronous nature. Perhaps if we knew what kind of application it is, we could advise.

Categories