dealing with subprocess that floods stdout

dealing with subprocess that floods stdout - python

I'm dealing with subprocess that occasionally goes into infinite loop and floods stdout with garbage. I generally need to capture stdout, except for those cases.
This discussion gives a way to limit the amount of time a subprocess takes, but the problem is that for a reasonable timeout it can produce GB's of output before being killed.
Is there a way to limit the amount of output that's captured from the process?

If you can't detect when the flooding occurs, chances are nobody else can. Since you do the capturing, you are of course free to limit your capturing, but that requires that you know when the looping has occured.
Perhaps you can use rate-limiting, if the "regular" rate is lower than the one observed when the spamming occurs?

You can connect the subprocess's stout to a file-like object that limits the amount of data it will pass to the real stdout when you call Popen. The file-like object could be a fifo or a cStringIO.

Related

Capture stderr in ProcessPoolExecutor

Im using ProcessPoolExecutor to execute external command is there any way to capture stderr when the process is finished (similar to subprocess)? Im capturing executor.submit() future result but that returns 0 or 1.

This might not be an answer but it is of that direction and far too long for a comment, so here goes.
I would say no to that. You might be able to achieve this by tinkering with stderr file descriptors, redirecting that to a stream of your own and returning this as the worker result but I am wondering if ProcessPoolExecutor is suitable for your task - of course not knowing what it is.
A subprocess created by a process pool does not finish like a subprocess created by yourself. It stays alive waiting for more work to arrive until you close the pool. If your worker produces stdout or stderr, they go to the same place where your main process directs its stdout and stderr.
Your workers will also process many different tasks. If your pool size is four and you submit ten tasks, how do you then decipher from a plain stderr capture which task created what message?
I have a hunch this needs to be redesigned. You would be able to raise exceptions in your workers and then later capture those from your future objects. Or it might be that your task is something where a pool is just not suitable. If subprocesses do what you want them to do, why not use them instead?
Pools are good for parallelising repetitive tasks that return and receive modest amount of data (implemented as queues that are not miracle performers), with a standard interface and standardised output/error handling. Pools simplify your code by providing the routine part. If your subtasks require different inputs, their outputs and error handling vary greatly or there is a lot of data to be transmitted, you might be better off by building the parallel processing part yourself.

Python - Limit amount of data subprocess.Popen can produce

I found lots of similar questions asking size of an object at run time in python. Some of the answers suggests to set a limit on amount of memory of sub-process. I do not want to set a limit on memory of sub-process. Here is what I want --
I'm using subprocess.Popen() to execute an external program. I can, very well, get standard output and error with process.stdout.readlines() and process.stderr.readlines() after the process is complete.
I have a problem when an erroneous program gets into an infinite loop and keeps producing output. Since subprocess.Popen() stores output data in memory this infinite loop quickly eats up entire memory and program slows down.
One solution is that I can run the command with timeout. But programs take variable time to complete. Large timeout, for a program taking small time and having an infinite loop, defeats the purpose of having it.
Is there any simple way where I can put an upper limit say 200MB on amount of data the command can produce? If it exceeds the limit command should get killed.

First: It is not subprocess.Popen() storing the data, but it is the pipe between "us" and "our" subprocess.
You shouldn't use readlines() in this case as this will indefinitely buffer the data and only at the end return them as a list (in this case, it is indeed this function which stores the data).
If you do something like
bytes = lines = 0
for line in process.stdout:
bytes += len(line)
lines += 1
if bytes > 200000000 or lines > 10000:
# handle the described situation
break
you can act as wanted in your question. But you shouldn't forget to kill the subprocess afterwards in order to stop it producing further data.
But if you want to take care of stderr as well, you'd have to try to replicate process.communicate()'s behaviour with select() etc., and act appropriately.

There doesn't seem to be an easy answer to what you want
http://linux.about.com/library/cmd/blcmdl2_setrlimit.htm
rlimit has a flag to limit memory, CPU or number of open files, but apparently nothing to limit the amount of I/O.
You should handle the case manually as already described.

When should I use `wait` instead of `communicate` in subprocess?

In the document of wait (http://docs.python.org/2/library/subprocess.html#subprocess.Popen.wait), it says:
Warning
This will deadlock when using stdout=PIPE and/or stderr=PIPE and the
child process generates enough output to a pipe such that it blocks
waiting for the OS pipe buffer to accept more data. Use communicate()
to avoid that.
From this, I think communicate could replace all usage of wait() if retcode is not need. And even when the stdout or stdin are not PIPE, I can also replace wait() by communicate().
Is that right? Thanks!

I suspect (the docs don't explicitly state it as of 2.6) in the case where you don't use PIPEs communicate() is reduced to wait(). So if you don't use PIPEs it should be OK to replace wait().
In the case where you do use PIPEs you can overflow memory buffer (see communicate() note) just as you can fill up OS pipe buffer, so either one is not going to work if you're dealing with a lot of output.
On a practical note I had communicate (at least in 2.4) give me one character per line from programs whose output is line-based, that wasn't useful to put it mildly.
Also, what do you mean by "retcode is not needed"? -- I believe it sets Popen.returncode just as wait() does.

Do python thread wait for standard output?

if you run couple of threads but they all have to print to the same stdout, does this mean they have to wait on each other? so say if all 4 threads have something to write, they have to pause and wait for the stdout to be free so they can get on with their work?

Deep deep (deep deep deep...) down in the OS's system calls, yes. Modern OSes have thread-safe terminal printing routines which usually just lock around the critical sections that do the actual device access (or buffer, depending on what you're writing into and what its settings are). These waits are very short, however. Keep in mind that this is IO you're dealing with here, so the wait times are likely to be negligible relatively to actual IO execution.

It depends. If stdout is a pipe, each pipe gets a 4KB buffer which you can override when the pipe is created. Buffers are flushed when the buffer is full or with a call to flush().
If stdout is a terminal, output is usually line buffered. So until you print a newline, all threads can write to their buffers. When the newline is written, the whole buffer is dumped on the console and all other threads that are writing newlines at the same time have to wait.
Since threads do other things than writing newlines, each thread gets some CPU. So even in the worst case, the congestion should be pretty small.
There is one exception, though: If you write a lot of data or if the console is slow (like the Linux kernel debug console which uses the serial port). When the console can't cope with the amount of data, more and more threads will hang in the write of the newline waiting for the buffers to flush.

run external program in a loop with max time limit

I wish to have a python script that runs an external program in a loop sequentially. I also want to limit each execution of the program to a max running time. If it is exceeded, then kill the program. What is the best way to accomplish this?
Thanks!

To run an external program from Python you'll normally want to use the subprocess module.
You could "roll your own" subprocess handling using os.fork() and os.execve() (or one of its exec* cousins) ... with any file descriptor plumbing and signal handling magic you like. However, the subprocess.Popen() function has implemented and exposed most of the features for what you'd want to do for you.
To arrange for the program to die after a given period of time you can have your Python script kill it after the timeout. Naturally you'll want to check to see if the process has already completed before then. Here's a dirt stupid example (using the split function from the shlex module for additional readability:
from shlex import split as splitsh
import subprocess
import time
TIMEOUT=10
cmd = splitsh('/usr/bin/sleep 60')
proc = subprocess.Popen(cmd)
time.sleep(TIMEOUT)
pstatus = proc.poll()
if pstatus is None:
proc.kill()
# Could use os.kill() to send a specific signal
# such as HUP or TERM, check status again and
# then resort to proc.kill() or os.kill() for
# SIGKILL only if necessary
As noted there are a few ways to kill your subprocess. Note that I check for "is None" rather than testing pstatus for truth. If your process completed with an exit value of zero (conventionally indicating that no error occurred) then a naïve test of the proc.poll() results would conflate that completion with the still running process status.
There are also a few ways to determine if sufficient time has passed. In this example we sleep, which is somewhat silly if there's anything else we could be doing. That just leaves our Python process (the parent of your external program) laying about idle.
You could capture the start time using time.time() then launch your subprocess, then do other work (launch other subprocesses, for example) and check the time (perhaps in a loop of other activity) until your desired timeout has been exceeded.
If any of your other activity involves file or socket (network) operations then you'd want to consider using the select module as a way to return a list of file descriptors which are readable, writable or ready with "exceptional" events. The select.select() function also takes an optional "timeout" value. A call to select.select([],[],[],x) is essentially the same as time.sleep(x) (in the case where we aren't providing any file descriptors for it to select among).
In lieu of select.select() it's also possible to use the fcntl module to set your file descriptor into a non-blocking mode and then use os.read() (NOT the normal file object .read() methods, but the lower level functionality from the os module). Again it's better to use the higher level interfaces where possible and only to resort to the lower level functions when you must. (If you use non-blocking I/O then all your os.read() or similar operations must be done within exception handling blocks, since Python will represent the "-EWOULDBLOCK" condition as an OSError (exception) like: "OSError: [Errno 11] Resource temporarily unavailable" (Linux). The precise number of the error might vary from one OS to another. However, it should be portable (at least for POSIX systems) to use the -EWOULDBLOCK value from the errno module.
(I realize I'm going down a rathole here, but information on how your program can do something useful while your child processes are running external programs is a natural extension of how to manage the timeouts for them).
Ugly details about non-blocking file I/O (including portability issues with MS Windows) have been discussed here in the past: Stackoverflow: non-blocking read on a stream in Python
As others have commented, it's better to provide more detailed questions and include short, focused snippets of code which show what effort you've already undertaken. Usually you won't find people here inclined to write tutorials rather than answers.

If you are able to use Python 3.3
From docs,
subprocess.call(args, *, stdin=None, stdout=None, stderr=None, shell=False, timeout=None)
subprocess.call(["ls", "-l"])
0
subprocess.call("exit 1", shell=True)
1
Should do the trick.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.