Reading the stdout from slave nodes with ipcluster

Reading the stdout from slave nodes with ipcluster - python

I've set up a cluster using
ipcluster start --n=8
then accessed it using
from IPython.parallel import Client
c=Client()
dview=c[:]
e=[i for i in c]
I'm running processes on the slave nodes (e[0]-e[7]) which take a lot of time and I'd like them to send progress reports to the master so I can keep an eye on how far through they are.
There are two ways I can think to do this but so far I haven't been able to implement either of them, despite hours of trawling through question pages.
Either I want the nodes to push some data back to the master without prompt. i.e. within the long process that is run on the nodes I implement a function which passes its progress to the master at regular intervals.
Or I could redirect the stdout of the nodes to the that of the master and then just keep track of the progress using print. This is what I've been working on so far. Each node has its own stdout so print doesn't do anything if run remotely. I've tried pushing sys.stdout to the nodes but this just closes it.
I can't believe I'm the only person who wants to do this so maybe I'm missing something very simple. How can I keep track of long processes happening remotely using ipython?

stdout is already captured, logged, and tracked, and arrives at Clients as it comes, before the result is complete.
IPython ships with an example script that monitors stdout/err of all engines, which can easily be tweaked to only monitor a subset of this information, etc.
In the Client itself, you can check the metadata dict for stdout/err (Client.metadata[msg_id].stdout) before results are done. Use Client.spin() to flush any incoming messages off of the zeromq sockets, to ensure this data is up-to-date.
If you want stdout to update frequently, make sure you call sys.stdout.flush() to guarantee that the stream is actually published at that point, rather than relying on implicit flushes, which may not happen until the work completes.

Related

Can I allow my server process to restart without killing existing connections?

In an attempt to make my terminal based program survive longer I was told to look into forking the process off of system. I can't find much specifying a PID to which I want to spawn a new process off of.
is this possible in Linux? I am a Windows guy mainly.
My program is going to be dealing with sockets and if my application crashed then I would lose lots of information. I was under the impression that if it was forked from system the sockets would stay alive?
EDIT: Here is what I am trying to do. I have multiple computers that I want to communicate with. So I am building a program that lets me listen on a socket(simple). Then I will connect to it from each of my remote computers(simple).
Once I have a connection I want to open a new terminal, and use my program to start interacting with the remote computer(simple).
The questions came from this portion.. The client shell will send all traffic to the main shell who will then send it out to the remote computer. When a response is received it goes to main shell and forwards it to client shell.
The issue is keeping each client shell in the loop. I want all client shells to know who is connected to who on each client shell. So client shell 1 should tell me if I have a client shell 2, 3, 4, 5, etc and who is connected to it. This jumped into sharing resources between different processes. So I was thinking about using local sockets to send data between all these client shells. But then I ran into a problem if the main shell were to die, everything is lost. So I wanted a way to try and secure it.
If that makes sense.

So, you want to be able to reload a program without losing your open socket connections?
The first thing to understand is that when a process exits, all open file descriptors are closed. This includes socket connections. Running as a daemon does not change that. A process becomes a daemon by becoming independent of your terminal sesssion, so that it will continue to run when your terminal sesssion ends. But, like any other process, when a daemon terminates for any reason (normal exit, crashed, killed, machine is restarted, etc), then all connections to it cease to exist. BTW this is not specific to unix, Windows is the same.
So, the short answer to your question is NO, there's no way to tell unix/linux to not close your sockets when your process stops, it will close them and that's that.
The long answer is, there are a few ways to re-engineer things to get around this:
1) You can have your program exec() itself when you send it a special message or signal (eg SIGHUP). In unix, exec (or its several variants), does not end or start any process, it simply loads code into the current process and starts execution. The new code takes the place of the old within the same process. Since the process remains the same, any open files remain open. However you will lose any data that you had in memory, so the sockets will be open, but your program will know nothing about them. On startup you'd have to use various system calls to discover which descriptors are open in your process and whether any of them are socket connections to clients. One way to get around this would be to pass critical information as command line arguments or environment variables which can be passed through the exec() call and thus preserved for use of the new code when it starts executing.
Keep in mind that this only works when the process calls exec ITSELF while it is still running. So you cannot recover from a crash or any other cause of your process ending.. your connections will be gone. But this method does solve the problem of you wanting to load new code without losing your connections.
2) You can bypass the issue by dividing your server (master) into two processes. The first (call it the "proxy") accepts the TCP connections from the clients and keeps them open. The proxy can never exit, so it should be kept so simple that you'll rarely want to change that code. The second process runs the "worker", which is the code that implements your application logic. All the code you might want to change often should go in the worker. Now all you need do establish interprocess communication from the proxy to the worker, and make sure that if the worker exits, there's enough information in the proxy to re-establish your application state when the worker starts up again. In a really simple, low volume application, the mechanism can be as simple as the proxy doing a fork() + exec() of the worker each time it needs to do something. A fancier way to do this, which I have used with good results, is a unix domain datagram (SOCK_DGRAM) socket. The proxy receives messages from the clients, forwards them to the worker through the datagram socket, the worker does the work, and responds with the result back to the proxy, which in turn forwards it back to the client. This works well because as long as the proxy is running and has opened the unix domain socket, the worker can restart at will. Shared memory can also work as a way to communicate between proxy and worker.
3) You can use the unix domain socket along with the sendmesg() and recvmsg() functions along with the SCM_RIGHTS flag to pass not the client data itself, but to actually send the open socket file descriptors from the old instance to the new. This is the only way to pass open file descriptors between unrelated processes. Using this mechanism, there are all sorts of strategies you can implement.. for example, you could start a new instance of your master program, and have it connect (via a unix domain socket) to the old instance and transfer all the sockets over. Then your old instance can exit. Or, you can use the proxy/worker model, but instead of passing messages through the proxy, you can just have the proxy hand the socket descriptor to the worker via the unix domain socket between them, and then the worker can talk directly to the client using that descriptor. Or, you could have your master send all its socket file descriptors to another "stash" process that holds on to them in case the master needs to restart. There are all sorts of architectures possible. Keeping in mind that the operating system just provides the ability to ship the descriptors around, all the other logic you have to code for yourself.
4) You can accept that no matter how careful you are, inevitably connections will be lost. Networks are unreliable, programs crash sometimes, machines are restarted. So rather than going to significant effort to make sure your connections don't close, you can instead engineer your system to recover when they inevitably do.
The simplest approach to this would be: Since your clients know who they wish to connect to, you could have your client processes run a loop where, if the connection to the master is lost for any reason, they periodically try to reconnect (let's say every 10-30 seconds), until they succeed. So all the master has to do is to open up the rendezvous (listening) socket and wait, and the connections will be re-established from every client that is still out there running. The client then has to re-send any information it has which is necessary to re-establish proper state in the master.
The list of connected computers can be kept in the memory of the master, there is no reason to write it to disk or anywhere else, since when the master exits (for any reason), those connections don't exist anymore. Any client can then connect to your server (master) process and ask it for a list of clients that are connected.
Personally, I would take this last approach. Since it seems that in your system, the connections themselves are much more valuable than the state of the master, being able to recover them in the event of a loss would be the first priority.
In any case, since it seems that the role of the master is to simply pass data back and forth among clients, this would be a good application of "asynchronous" socket I/O using the select() or poll() functions, this allows you to communicate between multiple sockets in one process without blocking. Here's a good example of a poll() based server that accepts multiple connections:
https://www.ibm.com/support/knowledgecenter/ssw_ibm_i_71/rzab6/poll.htm
As far as running your process "off System".. in Unix/Linux this is referred to running as a daemon. In *ix, these processes are children of process id 1, the init process.. which is the first process that starts when the system starts. You can't tell your process to become a child of init, this happens automatically when the existing parent exits. All "orphaned" processes are adopted by init. Since there are many easily found examples of writing a unix daemon (at this point the code you need to write to do this has become pretty standardized), I won't paste any code here, but here's one good example I found: http://web.archive.org/web/20060603181849/http://www.linuxprofilm.com/articles/linux-daemon-howto.html#ss4.1
If your linux distribution uses systemd (a recent replacement for init in some distributions), then you can do it as a systemd service, which is systemd's idea of a daemon but they do some of the work for you (for better or for worse.. there's a lot of complaints about systemd.. wars have been fought just about)...

Forking from your own program, is one approach - however a much simpler and easier one is to create a service. A service is a little wrapper around your program that deals with keeping it running, restarting it if it fails and providing ways to start and stop it.
This link shows you how to write a service. Although its specifically for a web server application, the same logic can be applied to anything.
https://medium.com/#benmorel/creating-a-linux-service-with-systemd-611b5c8b91d6
Then to start the program you would write:
sudo systemctl start my_service_name
To stop it:
sudo systemctl stop my_service_name
To view its outputs:
sudo journalctl -u my_service_name

Is it possible to create a long running process in NodeJs

Is it possible to create a long running process in NodeJs to handle many background operations without interrupting the main thread; something like Celery in Python.
Hint, it's highly preferable to be able to manage that long-running process, in case of failure, or need to be restarted, away from the main process.

http://nodejs.org/api/child_process.html is the right API to create long-running processes, you will have complete control over the child processes (access to stdin/out/err, can send signals etc). This approach however requires that your node process is parent of those children.. If you want the child to outlive the parent, take a look at options.detached during child creation (and following child.unref()).
Please note, however, that Node.js is suited extremely well to avoid such architecture. Typically node.js do all the background stuff in the main thread. I've been writing apps with lots of traffic (like thousands requests per second), with DB, Redis and RabbitMQ access all from the main thread and without any child processes - and it was worked fine, as it should, thanks to Node's evented IO system.
I'm generally using child_process api only to launch separate executables (e.g. ffmpeg to transcode some video file), apart of such scenarios separate processes are probably not what you want.
There is also cluster api which allow single master to handle numerous worker processes, though I think it isn't what you look for, either.

You can create child process to handle your background operations. And then use messages to pass data between the new process and your main thread.
http://nodejs.org/api/child_process.html
Update
It looks like you need to use the server queues, sort of beanstalkd http://kr.github.io/beanstalkd/ + https://www.npmjs.com/package/fivebeans.

How to detect unresponsive/frozen processes?

I have several scripts that I use to do some web crawling. They are always running, and should never stop. However, after about a week, they systematically "freeze": there is no output anymore, no response to Ctrl+C or anything. The only way is to kill the process and restart it.
I suspect that these issues come from the library I use for retrieving the data (urllib2), but the issue is very hard to reproduce.
I am thus wondering how I could check the state of the process and kill/restart it automatically if it is frozen. I was thinking of creating a PID file, and update it regularly. Another script could then periodically check the last modification date of this PID file, and restart the process if it's too old. I could use something like Monit to do the monitoring.
Is this how I should do it? Is there another best practice/common way for checking the responsiveness of a process?

If you have a process that is always running, has no connected terminal, and is the process group leader - that is a daemon. You undoubtedly know all that.
There are some defacto practices in coding programs like that. One is to have a signal handler which takes SIGHUP and forces the program to reinitialize itself. This means closing all of the open log files, rereading config scripts, etc. I do not know how applicable that is to your problem but it sometimes solves issues like frozen daemons at my work.
You can customize the idea by employing SIGUSR1 and SIGUSR2 signals to do special things, like write status to a file, or anything else. Since signals come in on an interrupt, the trap statement in scripts and signal handlers in python itself will push program state onto the interrupt stack and do "stuff".
In your case you may want the program fork/exec itself and then kill the parent.

Have python process talk back on SIGUSR1 call

I have a python script that can run for long time in the background, and am trying to find a way of getting a status update from it. Basically we're considering to send it a SIGUSR1 signal, and then have it report back a status update.
Catching the signal in Python is not the issue, lots of information about that.
But how to get back information to the process initiating the signal? It seems that there is no way to figure out the pid of the initiating process by the receiving process, which could provide a way to send information back. A single reply message is enough here (in the tune of 'busy uploading; at 55% now; will finish at such a time'); a continuing update would be fantastic but not necessary.
What I've come up with is to write this data to a temporary file with predetermined name - has the issue of leaving stale files behind, and need some kind of clean-up routine then. But that sounds like a hack. Is there anything better available?
The way the running process is signalled doesn't matter, it doesn't have to be kill -SIGUSR1 pid. Any way to communicate with it would do. As long as the communication can be initiated from a new process that's started after the main process runs, possibly running under as different user.

Signals are not designed to be general inter-process communication mechanisms that allow for passing data. They can't do much more than provide a notification. What the target process does in response can be fairly general (generating output to a particular file that the sender then knows to go look at, for example), but passing data directly back to the sender would require a different mechanism like a pipe, shared memory, message queue, etc. Also note that, in general, a process receiving a signal can't really determine who sent the signal, so it wouldn't know where to send a response anyway.

Python "Task Server"

My question is: which python framework should I use to build my server?
Notes:
This server talks HTTP with it's clients: GET and POST (via pyAMF)
Clients "submit" "tasks" for processing and, then, sometime later, retrieve the associated "task_result"
submit and retrieve might be separated by days - different HTTP connections
The "task" is a lump of XML describing a problem to be solved, and a "task_result" is a lump of XML describing an answer.
When a server gets a "task", it queues it for processing
The server manages this queue and, when tasks get to the top, organises that they are processed.
the processing is performed by a long running (15 mins?) external program (via subprocess) which is feed the task XML and which produces a "task_result" lump of XML which the server picks up and stores (for later Client retrieval).
it serves a couple of basic HTML pages showing the Queue and processing status (admin purposes only)
I've experimented with twisted.web, using SQLite as the database and threads to handle the long running processes.
But I can't help feeling that I'm missing a simpler solution. Am I? If you were faced with this, what technology mix would you use?

I'd recommend using an existing message queue. There are many to choose from (see below), and they vary in complexity and robustness.
Also, avoid threads: let your processing tasks run in a different process (why do they have to run in the webserver?)
By using an existing message queue, you only need to worry about producing messages (in your webserver) and consuming them (in your long running tasks). As your system grows you'll be able to scale up by just adding webservers and consumers, and worry less about your queuing infrastructure.
Some popular python implementations of message queues:
http://code.google.com/p/stomper/
http://code.google.com/p/pyactivemq/
http://xph.us/software/beanstalkd/

I'd suggest the following. (Since it's what we're doing.)
A simple WSGI server (wsgiref or werkzeug). The HTTP requests coming in will naturally form a queue. No further queueing needed. You get a request, you spawn the subprocess as a child and wait for it to finish. A simple list of children is about all you need.
I used a modification of the main "serve forever" loop in wsgiref to periodically poll all of the children to see how they're doing.
A simple SQLite database can track request status. Even this may be overkill because your XML inputs and results can just lay around in the file system.
That's it. Queueing and threads don't really enter into it. A single long-running external process is too complex to coordinate. It's simplest if each request is a separate, stand-alone, child process.
If you get immense bursts of requests, you might want a simple governor to prevent creating thousands of children. The governor could be a simple queue, built using a list with append() and pop(). Every request goes in, but only requests that fit will in some "max number of children" limit are taken out.

My reaction is to suggest Twisted, but you've already looked at this. Still, I stick by my answer. Without knowing you personal pain-points, I can at least share some things that helped me reduce almost all of the deferred-madness that arises when you have several dependent, blocking actions you need to perform for a client.
Inline callbacks (lightly documented here: http://twistedmatrix.com/documents/8.2.0/api/twisted.internet.defer.html) provide a means to make long chains of deferreds much more readable (to the point of looking like straight-line code). There is an excellent example of the complexity reduction this affords here: http://blog.mekk.waw.pl/archives/14-Twisted-inlineCallbacks-and-deferredGenerator.html
You don't always have to get your bulk processing to integrate nicely with Twisted. Sometimes it is easier to break a large piece of your program off into a stand-alone, easily testable/tweakable/implementable command line tool and have Twisted invoke this tool in another process. Twisted's ProcessProtocol provides a fairly flexible way of launching and interacting with external helper programs. Furthermore, if you suddenly decide you want to cloudify your application, it is not all that big of a deal to use a ProcessProtocol to simply run your bulk processing on a remote server (random EC2 instances perhaps) via ssh, assuming you have the keys setup already.

You can have a look at celery

It seems any python web framework will suit your needs. I work with a similar system on a daily basis and I can tell you, your solution with threads and SQLite for queue storage is about as simple as you're going to get.
Assuming order doesn't matter in your queue, then threads should be acceptable. It's important to make sure you don't create race conditions with your queues or, for example, have two of the same job type running simultaneously. If this is the case, I'd suggest a single threaded application to do the items in the queue one by one.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.