Alternative to subprocess.Popen.communicate() for streaming

Alternative to subprocess.Popen.communicate() for streaming - python

If I'm using subprocess.Popen I can use communicate() for small outputs.
But if the subprocess is going to take substantial time and produce substantial output, I want to access it as streaming data.
Is there a way to do this? The Python docs say
Warning: Use communicate() rather than .stdin.write, .stdout.read or .stderr.read to avoid deadlocks due to any of the other OS pipe buffers filling up and blocking the child process.
I would really like to access a process output as a file-like object:
with someMagicFunction(['path/to/some/command','arg1','arg2','arg3']) as outpipe:
# pass outpipe into some other function that takes a file-like object
but can't figure out how to do this.

communicate is a convenience method that starts background threads to read stdout and stderr. You can just read stdout yourself, but you need to figure out what to do with stderr. If you don't care about errors, you could add the param stderr=open(os.devnull, 'wb') or to a file stderr=open('somefile', 'wb'). Or, create your own background thread to do the read. It turns out that shutil already has such a function, so we can use it.
import subprocess
import threading
import shutil
import io
err_buf = io.BytesIO()
proc = subprocess.Popen(['ls', '-l'],
stdout=subprocess.PIPE, stderr=subprocess.PIPE)
err_thread = threading.Thread(target=shutil.copyfileobj,
args=(proc.stderr, err_buf))
err_thread.start()
for line in proc.stdout:
print(line.decode('utf-8'), end='')
retval = proc.wait()
err_thread.join()
print('error:', err_buf.getvalue())

Related

prevent unexpected stdin reads and lock in subprocess

A simple case I'm trying to solve for all situations.
I am running a subprocess for performing a certain task, and I don't expect it to ask for stdin, but in rare cases that I might not even expect, it might try to read.
I would like to prevent it from hanging in that case.
here is a classic example:
import subprocess
p = subprocess.Popen(["unzip", "-tqq", "encrypted.zip"])
p.wait()
This will hang forever.
I have already tried adding
stdin=open(os.devnull)
and such..
will post if I find a valuable solution.
would be enough for me to receive an exception in the parent process - instead of hanging on communicate/wait endlessly.
update: it seems the problem might be even more complicated than I initially expected, the subprocess (in password and other cases) reads from other file descriptors - like the /dev/tty to interact with the shell. might not be as easy to solve as I thought..

If your child process may ask for a password then it may do it outside of standard input/output/error streams if a tty is available, see the first reason in Q: Why not just use a pipe (popen())?
As you've noticed, creating a new session prevents the subprocess from using the parent's tty e.g., if you have ask-password.py script:
#!/usr/bin/env python
"""Ask for password. It defaults to working with a terminal directly."""
from getpass import getpass
try:
_ = getpass()
except EOFError:
pass # ignore
else:
assert 0
then to call it as a subprocess so that it would not hang awaiting for the password, you could use start_new_session=True parameter:
#!/usr/bin/env python3
import subprocess
import sys
subprocess.check_call([sys.executable, 'ask-password.py'],
stdin=subprocess.DEVNULL, start_new_session=True,
stderr=subprocess.DEVNULL)
stderr is redirected here too because getpass() uses it as a fallback, to print warnings and the prompt.
To emulate start_new_session=True on Unix on Python 2, you could use preexec_fn=os.setsid.
To emulate subprocess.DEVNULL on Python 2, you could use DEVNULL=open(os.devnull, 'r+b', 0) or pass stdin=PIPE and close it immediately using .communicate():
#!/usr/bin/env python2
import os
import sys
from subprocess import Popen, PIPE
Popen([sys.executable, 'ask-password.py'],
stdin=PIPE, preexec_fn=os.setsid,
stderr=PIPE).communicate() #NOTE: assume small output on stderr
Note: you don't need .communicate() unless you use subprocess.PIPE. check_call() is perfectly safe if you use an object with a real file descriptor (.fileno()) such as returned by open(os.devnull, ..). The redirection occurs before the child process is executed (after fork(), before exec()) -- there is no reason to use .communicate() instead of check_call() here.

Apparently the culprit is the direct usage of /dev/tty and such.
On linux at least, one solution is to add to the Popen call the following parameter:
preexec_fn=os.setsid
which causes a new session id to be set, and disallows reading from the tty directly. i will probably use the following code (stdin close is just in case):
import subprocess
import os
p = subprocess.Popen(["unzip", "-tqq", "encrypted.zip"],
stdin=subprocess.PIPE, preexec_fn=os.setsid)
p.stdin.close() #just in case
p.wait()
last two lines can be replaced by one call:
p.communicate()
since communicate() closes stdin file after sending all the input supplied.
Simple and elegant it seems.
Alternatively:
import subprocess
import os
p = subprocess.Popen(["unzip", "-tqq", "encrypted.zip"],
stdin=open(os.devnull), preexec_fn=os.setsid)
p.communicate()

Using subprocess to launch hadoop job but can't get log from stdout

To simplify my question, here'a a python script:
from subprocess import Popen, PIPE
proc = Popen(['./mr-task.sh'], shell=True, stdout=PIPE, stderr=PIPE)
while True:
out = proc.stdout.readline()
print(out)
Here's mr-task.sh, it starts a mapreduce job:
hadoop jar xxx.jar some-conf-we-don't-need-to-care
When I run ./mr-task, I could see log printed on the screen, something like:
14/12/25 14:56:44 INFO util.NativeCodeLoader: Loaded the native-hadoop library
14/12/25 14:56:44 INFO snappy.LoadSnappy: Snappy native library loaded
14/12/25 14:57:01 INFO mapred.JobClient: Running job: job_201411181108_16380
14/12/25 14:57:02 INFO mapred.JobClient: map 0% reduce 0%
14/12/25 14:57:28 INFO mapred.JobClient: map 100% reduce 0%
But I can't get these output running python script. I tried removing shell=True or fetch stderr, still got nothing.
Does anyone have any idea why this happens?

You could redirect stderr to stdout:
from subprocess import Popen, PIPE, STDOUT
proc = Popen(['./mr-task.sh'], stdout=PIPE, stderr=STDOUT, bufsize=1)
for line in iter(proc.stdout.readline, b''):
print line,
proc.stdout.close()
proc.wait()
See Python: read streaming input from subprocess.communicate().
in my real program I redirect stderr to stdout and read from stdout, so bufsize is not needed, is it?
The redirection of stderr to stdout and bufsize are unrelated. Changing bufsize might affect the time performance (the default bufsize=0 i.e., unbuffered on Python 2). Unbuffered I/O might be 10..100 times slower. As usual, you should measure the time performance if it is important.
Calling Popen.wait/communicate after the subprocess has terminated is just for clearing zombie process, and these two methods have no difference in such case, correct?
The difference is that proc.communicate() closes the pipes before reaping the child process. It releases file descriptors (a finite resource) to be used by a other files in your program.
about buffer, if output fill buffer maxsize, will subprocess hang? Does that mean if I use the default bufsize=0 setting I need to read from stdout as soon as possible so that subprocess don't block?
No. It is a different buffer. bufsize controls the buffer inside the parent that is filled/drained when you call .readline() method. There won't be a deadlock whatever bufsize is.
The code (as written above) won't deadlock no matter how much output the child might produce.
The code in #falsetru's answer can deadlock because it creates two pipes (stdout=PIPE, stderr=PIPE) but it reads only from one pipe (proc.stderr).
There are several buffers between the child and the parent e.g., C stdio's stdout buffer (a libc buffer inside child process, inaccessible from the parent), child's stdout OS pipe buffer (inside kernel, the parent process may read the data from here). These buffers are fixed they won't grow if you put more data into them. If stdio's buffer overflows (e.g., during a printf() call) then the data is pushed downstream into the child's stdout OS pipe buffer. If nobody reads from the pipe then then this OS pipe buffer fills up and the child blocks (e.g., on write() system call) trying to flush the data.
To be concrete, I've assumed C stdio's based program and POSIXy OS.
The deadlock happens because the parent tries to read from the stderr pipe that is empty because the child is busy trying to flush its stdout. Thus both processes hang.

One possible reaosn is that the output is printed to standard error instead of standard output.
Try to replace stdout with stderr:
from subprocess import Popen, PIPE
proc = Popen(['./mr-task.sh'], stdout=PIPE, stderr=PIPE)
while True:
out = proc.stderr.readline() # <----
if not out:
break
print(out)

execute a shell-script from Python subprocess

I need to call a shellscript from python.
The problem is that the shellscript will ask a couple of questions along the way until it is finished.
I can't find a way to do so using subprocess! (using pexpect seems a bit over-kill since I only need to start it and send a couple of YES to it)
PLEASE don't suggest ways that requires modification to the shell-script!

Using the subprocess library, you can tell the Popen class that you want to manage the standard input of the process like this:
import subprocess
shellscript = subprocess.Popen(["shellscript.sh"], stdin=subprocess.PIPE)
Now shellscript.stdin is a file-like object on which you can call write:
shellscript.stdin.write("yes\n")
shellscript.stdin.close()
returncode = shellscript.wait() # blocks until shellscript is done
You can also get standard out and standard error from a process by setting stdout=subprocess.PIPE and stderr=subprocess.PIPE, but you shouldn't use PIPEs for both standard input and standard output, because deadlock could result. (See the documentation.) If you need to pipe in and pipe out, use the communicate method instead of the file-like objects:
shellscript = subprocess.Popen(["shellscript.sh"], stdin=subprocess.PIPE, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
stdout, stderr = shellscript.communicate("yes\n") # blocks until shellscript is done
returncode = shellscript.returncode

Proper way of re-using and closing a subprocess object

I have the following code in a loop:
while true:
# Define shell_command
p1 = Popen(shell_command, shell=shell_type, stdout=PIPE, stderr=PIPE, preexec_fn=os.setsid)
result = p1.stdout.read();
# Define condition
if condition:
break;
where shell_command is something like ls (it just prints stuff).
I have read in different places that I can close/terminate/exit a Popen object in a variety of ways, e.g. :
p1.stdout.close()
p1.stdin.close()
p1.terminate
p1.kill
My question is:
What is the proper way of closing a subprocess object once we are done using it?
Considering the nature of my script, is there a way to open a subprocess object only once and reuse it with different shell commands? Would that be more efficient in any way than opening new subprocess objects each time?
Update
I am still a bit confused about the sequence of steps to follow depending on whether I use p1.communicate() or p1.stdout.read() to interact with my process.
From what I understood in the answers and the comments:
If I use p1.communicate() I don't have to worry about releasing resources, since communicate() would wait until the process is finished, grab the output and properly close the subprocess object
If I follow the p1.stdout.read() route (which I think fits my situation, since the shell command is just supposed to print stuff) I should call things in this order:
p1.wait()
p1.stdout.read()
p1.terminate()
Is that right?

What is the proper way of closing a subprocess object once we are done using it?
stdout.close() and stdin.close() will not terminate a process unless it exits itself on end of input or on write errors.
.terminate() and .kill() both do the job, with kill being a bit more "drastic" on POSIX systems, as SIGKILL is sent, which cannot be ignored by the application. Specific differences are explained in this blog post, for example. On Windows, there's no difference.
Also, remember to .wait() and to close the pipes after killing a process to avoid zombies and force the freeing of resources.
A special case that is often encountered are processes which read from STDIN and write their result to STDOUT, closing themselves when EOF is encountered. With these kinds of programs, it's often sensible to use subprocess.communicate:
>>> p = Popen(["sort"], stdin=PIPE, stdout=PIPE)
>>> p.communicate("4\n3\n1")
('1\n3\n4\n', None)
>>> p.returncode
0
This can also be used for programs which print something and exit right after:
>>> p = Popen(["ls", "/home/niklas/test"], stdin=PIPE, stdout=PIPE)
>>> p.communicate()
('file1\nfile2\n', None)
>>> p.returncode
0
Considering the nature of my script, is there a way to open a subprocess object only once and reuse it with different shell commands? Would that be more efficient in any way than opening new subprocess objects each time?
I don't think the subprocess module supports this and I don't see what resources could be shared here, so I don't think it would give you a significant advantage.

Considering the nature of my script, is there a way to open a subprocess object only once and reuse it with different shell commands?
Yes.
#!/usr/bin/env python
from __future__ import print_function
import uuid
import random
from subprocess import Popen, PIPE, STDOUT
MARKER = str(uuid.uuid4())
shell_command = 'echo a'
p = Popen('sh', stdin=PIPE, stdout=PIPE, stderr=STDOUT,
universal_newlines=True) # decode output as utf-8, newline is '\n'
while True:
# write next command
print(shell_command, file=p.stdin)
# insert MARKER into stdout to separate output from different shell_command
print("echo '%s'" % MARKER, file=p.stdin)
# read command output
for line in iter(p.stdout.readline, MARKER+'\n'):
if line.endswith(MARKER+'\n'):
print(line[:-len(MARKER)-1])
break # command output ended without a newline
print(line, end='')
# exit on condition
if random.random() < 0.1:
break
# cleanup
p.stdout.close()
if p.stderr:
p.stderr.close()
p.stdin.close()
p.wait()
Put while True inside try: ... finally: to perform the cleanup in case of exceptions. On Python 3.2+ you could use with Popen(...): instead.
Would that be more efficient in any way than opening new subprocess objects each time?
Does it matter in your case? Don't guess. Measure it.

The "correct" order is:
Create a thread to read stdout (and a second one to read stderr, unless you merged them into one).
Write commands to be executed by the child to stdin. If you're not reading stdout at the same time, writing to stdin can block.
Close stdin (this is the signal for the child that it can now terminate by itself whenever it is done)
When stdout returns EOF, the child has terminated. Note that you need to synchronize the stdout reader thread and your main thread.
call wait() to see if there was a problem and to clean up the child process
If you need to stop the child process for any reason (maybe the user wants to quit), then you can:
Close stdin if the child terminates when it reads EOF.
Kill the with terminate(). This is the correct solution for child processes which ignore stdin.
If the child doesn't respond, try kill()
In all three cases, you must call wait() to clean up the dead child process.

Depends on what you expect the process to do; you should always call p1.wait() in order to avoid zombies. Other steps depend on the behaviour of the subprocess; if it produces any output, you should consume the output (e.g. p1.read() ...but this would eat lots of memory) and only then call the p1.wait(); or you may wait for some timeout and call p1.terminate() to kill the process if you think it doesn't work as expected, and possible call p1.wait() to clean the zombie.
Alternatively, p1.communicate(...) would do the handling if io and waiting for you (not the killing).
Subprocess objects aren't supposed to be reused.

Alternatives to Python Popen.communicate() memory limitations?

I have the following chunk of Python code (running v2.7) that results in MemoryError exceptions being thrown when I work with large (several GB) files:
myProcess = Popen(myCmd, shell=True, stdout=PIPE, stderr=PIPE)
myStdout, myStderr = myProcess.communicate()
sys.stdout.write(myStdout)
if myStderr:
sys.stderr.write(myStderr)
In reading the documentation to Popen.communicate(), there appears to be some buffering going on:
Note The data read is buffered in memory, so do not use this method if the data size is large or unlimited.
Is there a way to disable this buffering, or force the cache to be cleared periodically while the process runs?
What alternative approach should I use in Python for running a command that streams gigabytes of data to stdout?
I should note that I need to handle output and error streams.

I think I found a solution:
myProcess = Popen(myCmd, shell=True, stdout=PIPE, stderr=PIPE)
for ln in myProcess.stdout:
sys.stdout.write(ln)
for ln in myProcess.stderr:
sys.stderr.write(ln)
This seems to get my memory usage down enough to get through the task.
Update
I have recently found a more flexible way of handing data streams in Python, using threads. It's interesting that Python is so poor at something that shell scripts can do so easily!

What I would probably do instead, if I needed to read the stdout for something that large, is send it to a file on creation of the process.
with open(my_large_output_path, 'w') as fo:
with open(my_large_error_path, 'w') as fe:
myProcess = Popen(myCmd, shell=True, stdout=fo, stderr=fe)
Edit: If you need to stream, you could try making a file-like object and passing it to stdout and stderr. (I haven't tried this, though.) You could then read (query) from the object as it's being written.

For those whose application hangs after a certain amount of time when using Popen, please look for my case below:
A Rule of thumb, if you're not going to use stderr and stdout streams then don't pass/init them in the parameters of Popen! because they will fill up and cause you a lot of problems.
If you need them for a certain amount of time and you need to keep the process running, then you can close those streams at any time.
try:
p = Popen(COMMAND, stdout=PIPE, stderr=PIPE)
# After using stdout and stderr
p.stdout.close()
p.stderr.close()
except Exception as e:
pass

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Alternative to subprocess.Popen.communicate() for streaming - python

Related

prevent unexpected stdin reads and lock in subprocess

Using subprocess to launch hadoop job but can't get log from stdout

execute a shell-script from Python subprocess

Proper way of re-using and closing a subprocess object

Alternatives to Python Popen.communicate() memory limitations?

Categories

Resources