Understanding Python Subprocess.run Blocking

Understanding Python Subprocess.run Blocking - python

I see a number of posts here about subprocess module and it looks like it's changed quite a bit over the years. From reading documentation, I think I understand the answer to my question, but I'm asking for either confirmation or for someone to tell me what I'm missing. I'm using python 3.6.3.
I have a simple use case for subprocess. I need to build and execute a command. I need to capture stdout and stderr.
I do not need live results.
The size of stdout and stderr will be small relative to the server, so I don't think I need to be concerned about memory issues.
While stdout and std error will be small, I will expect multiple concurrent instances of the parent process to be invoked by a job scheduler.
I've been reading about deadlocks and pipe. Basically documentation states "don't do this stdout=PIPE b/c it can cause a problem with the OS pipe buffer".
But I think I can use subprocess.run and pass in
stdout=subprocess.PIPE
stderr=subprocess.PIPE
I think this option is different than stdout=PIPE that I'm told not to do in the documentation, b/c subprocess.run is using Popen.communicate() under the covers, which seems to handle making sure the os pipe doesn't fill up. Thus no deadlock problems, and after .run() finishes I can post process stdout and stderr.
Is this an accurate assessment or am I missing something?
Update #1:
process = Popen(command, stdout=PIPE, stderr=PIPE)
stdout, stderr = process.communicate()
In this case, are the following two things true?
the potential deadlock issues are eliminated by use of .communicate()
.communicate() will allow process to fully complete
This code seems pretty simple, but I've seen an array of other code that is far more complicated and I'm not sure if that code is solving a problem that I do not have.

Related

getting entire stdout of multiprocess.Process and display it in realtime in the current console

I am using Python 3 and the multiprocessing module. Is there any chance to grab the entire stdout and stderr of multiprocessing.Process and redirect it to the current sys.stdout of the calling process?
A common approach here on SO is to redirect it to a temporary stdout file. But they all focus on getting the stdout after the process got executed. I am trying to achieve both, displaying it in real time as well in the current console.
The subprocess module makes this much easier, on the other side, only the multiprocessing.Process allows me to pass certain objects to the 'function/process' without pickling the objects on my own.
Any hint is highly appreciated.

Terminate a process created with 'subprocess.run'

How can I terminate a process created with subprocess.run in Python 3?
The documentation of subprocess.run is here, but it doesn't specify it.
The documentation of the return-value is here, but there's no hint for it in there either.
With subprocess.Popen it's easy:
p = subprocess.Popen(...)
...
p.terminate()
How can I do the same when using subprocess.run?

you cannot, since the process returns to python interpreter only once it has ended.
You could try to get hold of the PID while running in a thread and kill it, but...
For those cases, Popen is the best solution as you can control input/output & end of your process.
From the documentation:
The underlying process creation and management in this module is handled by the Popen class. It offers a lot of flexibility so that developers are able to handle the less common cases not covered by the convenience functions.
Note that the documentation starts by describing run, then Popen, then the now deprecated check_call, check_output ... calls

Interleaving stdout from Popen with messages from ZMQ recv

Is there a best-practices approach to poll for the stdout/stderr from a subprocess.Popen as well as a zmq socket?
In my case, I have my main program spawning a Popen subprocess. The subprocess publishes messages via zmq which I then want to subscribe to in my main program.
Waiting on multiple zmq sockets is not complicated with the zmq.Poller but when I want to interleave this with the output from my subprocess itself, I am unsure how to do it in the best way without risking waits or having needless loops.
In the end, I would like to use it like so:
process = Popen([prog, '--publish-to', 'tcp://127.0.0.1:89890'],
stdout=subprocess.PIPE, stderr=subprocess.PIPE, ...)
for (origin, data) in interleave(process, 'tcp://127.0.0.1:89890'):
if origin == 'STDOUT': pass
if origin == 'STDERR': pass
if origin == 'ZMQ': pass
prog --publish-to tcp://127.0.0.1:89890 will then open a zmq.PUB socket and publish data, whereas the interleave function will subscribe to this and also poll for stdout and stderr, yielding whatever data reaches it first.
I know how to define interleave with multiple daemon threads and queues but I don’t know if this approach might have some caveats with regards to lazy reading (ie. stdout might not be processed until the end of the program?) or other things that I have not yet thought about (seems also to be quite a bit of overhead for such a task).
I will be thankful for all ideas or insights.
I aim for at least Python 3.3/3.4 but if this turns out to be much easier with the new async/await tools, I could also use Python 3.5 for the code.

Use zmq.Poller: http://pyzmq.readthedocs.io/en/latest/api/zmq.html#polling. You can register zmq sockets and native file descriptors (e.g. process.stdout.fileno() and process.stderr.fileno()) there, and it will wait until input is available on at least one of the registered sources.
I don't know if it works in Windows, you should try.

What is the best way to capture output from a process using python?

I am using python's subprocess module to start a new process. I would like to capture the output of the new process in real time so I can do things with it (display it, parse it, etc.). I have seen many examples of how this can be done, some use custom file-like objects, some use threading and some attempt to read the output until the process has completed.
File Like Objects Example (click me)
I would prefer not to use custom file-like objects because I want to allow users to supply their own values for stdin, stdout and stderr.
Threading Example (click me)
I do not really understand why threading is required so I am reluctant to follow this example. If someone can explain why the threading example makes sense I would be happy listen. However, this example also restricts users from supplying their own stdout and stderr values.
Read Output Example (see below)
The example which makes the most sense to me is to read the stdout, stderr until the process has finished. Here is some example code:
import subprocess
# Start a process which prints the options to the python program.
process = subprocess.Popen(
["python", "-h"],
bufsize=1,
stdin=subprocess.PIPE,
stdout=subprocess.PIPE,
stderr=subprocess.PIPE,
)
# While the process is running, display the output to the user.
while True:
# Read standard output data.
for stdout_line in iter(process.stdout.readline, ""):
# Display standard output data.
sys.stdout.write(stdout_line)
# Read standard error data.
for stderr_line in iter(process.stderr.readline, ""):
# Display standard error data.
sys.stderr.write(stderr_line)
# If the process is complete - exit loop.
if process.poll() != None:
break
My question is,
Q. Is there a recommended approach for capturing the output of a process using python?

First, your design is a bit silly, since you can do the same thing like this:
process = subprocess.Popen(
["python", "-h"],
bufsize=1,
stdout=sys.stdout,
stderr=sys.stderr
)
… or, even better:
process = subprocess.Popen(
["python", "-h"],
bufsize=1
)
However, I'll assume that's just a toy example, and you might want to do something more useful.
The main problem with your design is that it won't read anything from stderr until stdout is done.
Imagine you're driving an MP3 player that prints each track name to stdout, and logging info to stderr, and you want to play 10 songs. Do you really want to wait 30 minutes before displaying any of the logging to your users?
If that is acceptable, then you might as well just use communicate, which takes care of all of the headaches for you.
Plus, even if it's acceptable for your model, are you sure you can queue up that much unread data in the pipe without it blocking the child? On every platform?
Just breaking up the loop to alternate between the two won't help, because you could end up blocking on stdout.readline() for 5 minutes while stderr is piling up.
So that's why you need some way to read from both at once.
How do you read from two pipes at once?
This is the same problem (but smaller) as handling 1000 network clients at once, and it has the same solutions: threading, or multiplexing (and the various hybrids, like doing green threads on top of a multiplexor and event loop, or using a threaded proactor, etc.).
The best sample code for the threaded version is communicate from the 3.2+ source code. It's a little complicated, but if you want to handle all of the edge cases properly on both Windows and Unix there's really no avoiding a bit of complexity.
For multiplexing, you can use the select module, but keep in mind that this only works on Unix (you can't select on pipes on Windows), and it's buggy without 3.2+ (or the subprocess32 backport), and to really get all the edge cases right you need to add a signal handler to your select. Unless you really, really don't want to use threading, this is the harder answer.
But the easy answer is to use someone else's implementation. There are a dozen or more modules on PyPI specifically for async subprocesses. Alternatively, if you already have a good reason to write your app around an event loop, just about every modern event-loop-driven async networking library (including the stdlib's asyncio) includes subprocess support out of the box, that works on both Unix and Windows.
Is there a recommended approach for capturing the output of a process using python?
It depends on who you're asking; a thousand Python developers might have a thousand different answers… or at least half a dozen. If you're asking what the core devs would recommend, I can take a guess:
If you don't need to capture it asynchronously, use communicate (but make sure to upgrade to at least 3.2 for important bug fixes). If you do need to capture it asynchronously, use asyncio (which requires 3.4).

When should I use `wait` instead of `communicate` in subprocess?

In the document of wait (http://docs.python.org/2/library/subprocess.html#subprocess.Popen.wait), it says:
Warning
This will deadlock when using stdout=PIPE and/or stderr=PIPE and the
child process generates enough output to a pipe such that it blocks
waiting for the OS pipe buffer to accept more data. Use communicate()
to avoid that.
From this, I think communicate could replace all usage of wait() if retcode is not need. And even when the stdout or stdin are not PIPE, I can also replace wait() by communicate().
Is that right? Thanks!

I suspect (the docs don't explicitly state it as of 2.6) in the case where you don't use PIPEs communicate() is reduced to wait(). So if you don't use PIPEs it should be OK to replace wait().
In the case where you do use PIPEs you can overflow memory buffer (see communicate() note) just as you can fill up OS pipe buffer, so either one is not going to work if you're dealing with a lot of output.
On a practical note I had communicate (at least in 2.4) give me one character per line from programs whose output is line-based, that wasn't useful to put it mildly.
Also, what do you mean by "retcode is not needed"? -- I believe it sets Popen.returncode just as wait() does.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.