python subprocess module hangs for spark-submit command when writing STDOUT - python

I have a python script that is used to submit spark jobs using the spark-submit tool. I want to execute the command and write the output both to STDOUT and a logfile in real time. i'm using python 2.7 on a ubuntu server.
This is what I have so far in my SubmitJob.py script
#!/usr/bin/python
# Submit the command
def submitJob(cmd, log_file):
with open(log_file, 'w') as fh:
process = subprocess.Popen(cmd, stdout=subprocess.PIPE, stderr=subprocess.STDOUT)
while True:
output = process.stdout.readline()
if output == '' and process.poll() is not None:
break
if output:
print output.strip()
fh.write(output)
rc = process.poll()
return rc
if __name__ == "__main__":
cmdList = ["dse", "spark-submit", "--spark-master", "spark://127.0.0.1:7077", "--class", "com.spark.myapp", "./myapp.jar"]
log_file = "/tmp/out.log"
exist_status = submitJob(cmdList, log_file)
print "job finished with status ",exist_status
The strange thing is, when I execute the same command direcly in the shell it works fine and produces output on screen as the proggram proceeds.
So it looks like something is wrong in the way I'm using the subprocess.PIPE for stdout and writing the file.
What's the current recommended way to use subprocess module for writing to stdout and log file in real time line by line? I see bunch of options on the internet but not sure which is correct or latest.
thanks

Figured out what the problem was.
I was trying to redirect both stdout n stderr to pipe to display on screen. This seems to block the stdout when stderr is present. If I remove the stderr=stdout argument from Popen, it works fine. So for spark-submit it looks like you don't need to redirect stderr explicitly as it already does this implicitly

To print the Spark log
One can call the commandList given by user330612
cmdList = ["spark-submit", "--spark-master", "spark://127.0.0.1:7077", "--class", "com.spark.myapp", "./myapp.jar"]
Then it can be printed by using subprocess, remember to use communicate() to prevent deadlocks https://docs.python.org/2/library/subprocess.html
Warning Deadlock when using stdout=PIPE and/or stderr=PIPE and the child process generates enough output to a pipe such that it blocks waiting for the OS pipe buffer to accept more data. Use communicate() to avoid that. Here below is the code to print the log.
import subprocess
p = subprocess.Popen(cmdList,stdout=subprocess.PIPE,stdout=subprocess.PIPE,stderr=subprocess.PIPE)
stdout, stderr = p.communicate()
stderr=stderr.splitlines()
stdout=stdout.splitlines()
for line in stderr:
print line #now it can be printed line by line to a file or something else, for the log
for line in stdout:
print line #for the output
More information about subprocess and printing lines can be found at:
https://pymotw.com/2/subprocess/

Related

python Popen print the output of the external application and save it (log it) at the same time

I am trying to run an external application using Popen and print the output in the console or separated console (better) and at the same time save the output to the file. There is no user interaction via console, app.bat just sends (writes) the data and should terminate automatically when the execution is finished.
Running the following command will result only in printing the results in the python console.
p = subprocess.Popen("app.bat --x --y", shell=False)
If I add stdout as file I can redirect the output to the file, but nothing is written in the console, which does not give users any feedback (and the feedback needs to be in real-time, not after the execution because app runs approximately 1-3min).
file_ = open("ouput.txt", "w+")
p = subprocess.Popen("app.bat --x --y", shell=False,stdout=file_)
Therefore, my question is how to run the external app and at the same time write in the console and in the file?
For what you want to do I'd encourage you to use the logging module.
A good starter here is https://docs.python.org/2/howto/logging-cookbook.html
It even describes your usecase almost exactly.
If you want to post-process the output of your Popen() call, you should typically redirect stdout to PIPE and then read the output from there. This will allow you to e.g. both write to file and to screen:
import subprocess
logfile ='output.txt'
command = ['app.bat', '--x', '--y']
p = subprocess.Popen(command, stdout=subprocess.PIPE, universal_newlines=True)
with open(logfile, 'w+') as f:
for line in p.stdout:
print(line.rstrip())
f.write(line)
Now, this will block until app.bat finishes, which may be exactly what you want. But, if you want your Python script to continue to run, and have app.bat run in te background, you can start a thread that will handle your subprocess stdout:
import subprocess
import threading
logfile ='output.txt'
command = ['app.bat', '--x', '--y']
def writer(p, logfile):
with open(logfile, 'w+') as f:
for line in p.stdout:
print(line.rstrip())
f.write(line)
p = subprocess.Popen(command, stdout=subprocess.PIPE, universal_newlines=True)
t = threading.Thread(target=writer, args=(p,logfile))
t.start()
# Other commands while app.bat runs
t.join()

How to debug subprocess call in Python/Django

In my Python/Django code I call gdalbuildvrt process. This process should create a .VRT file, but it does not. In order to check it, I write the subprocess output to a debug file. This is how I do it:
process = subprocess.Popen(["gdalbuildvrt", mosaic, folder], stdout=subprocess.PIPE)
stdout = process.communicate()[0]
with open(os.path.join(os.path.dirname(os.path.abspath(__file__)), "debug.txt"), 'w') as file:
file.write('{}'.format(stdout) + " -> " + mosaic)
As a result I see this in debug.txt file:
b'0...10...20...30...40...50...60...70...80...90...100 - done.\n' -> /var/www/vwrapper/accent/accent/layers/raw/mosaic.vrt
So, as you can see the first part of debug statement says, that it'ok:
0...10...20...30...40...50...60...70...80...90...100 - done.
And the second part says, that /var/www/vwrapper/accent/accent/layers/raw/mosaic.vrt should be created. However, when I go to the target folder and refresh it, I see no mosaic.vrt file there. So, what may be wrong with that and how can I fix it? I should add that on Windows machine it's 100% ok, but on CentOS it does not.
process = subprocess.Popen(["gdalbuildvrt", mosaic, folder],
stdout=subprocess.PIPE, stderr=subprocess.PIPE)
stdout, stderr = process.communicate()
ret = process.returncode
or
process = subprocess.Popen(["gdalbuildvrt", mosaic, folder],
stdout=subprocess.PIPE, stderr=subprocess.STDOUT)
which will redirect stderr to stdout.
Then log those two things:
All error logging should be on stderr, not stdout. And any return code will appear via process.returncode.
You could also probably use one of the higher processes, like subprocess.check_call()

python subprocess.call output is not interleaved

I have a python (v3.3) script that runs other shell scripts. My python script also prints message like "About to run script X" and "Done running script X".
When I run my script I'm getting all the output of the shell scripts separate from my print statements. I see something like this:
All of script X's output
All of script Y's output
All of script Z's output
About to run script X
Done running script X
About to run script Y
Done running script Y
About to run script Z
Done running script Z
My code that runs the shell scripts looks like this:
print( "running command: " + cmnd )
ret_code = subprocess.call( cmnd, shell=True )
print( "done running command")
I wrote a basic test script and do *not* see this behaviour. This code does what I would expect:
print("calling")
ret_code = subprocess.call("/bin/ls -la", shell=True )
print("back")
Any idea on why the output is not interleaved?
Thanks. This works but has one limitation - you can't see any output until after the command completes. I found an answer from another question (here) that uses popen but also lets me see the output in real time. Here's what I ended up with this:
import subprocess
import sys
cmd = ['/media/sf_git/test-automation/src/SalesVision/mswm/shell_test.sh', '4', '2']
print('running command: "{0}"'.format(cmd)) # output the command.
# Here, we join the STDERR of the application with the STDOUT of the application.
process = subprocess.Popen(cmd, bufsize=1, universal_newlines=True, stdout=subprocess.PIPE, stderr=subprocess.STDOUT)
for line in iter(process.stdout.readline, ''):
line = line.replace('\n', '')
print(line)
sys.stdout.flush()
process.wait() # Wait for the underlying process to complete.
errcode = process.returncode # Harvest its returncode, if needed.
print( 'Script ended with return code of: ' + str(errcode) )
This uses Popen and allows me to see the progress of the called script.
It has to do with STDOUT and STDERR buffering. You should be using subprocess.Popen to redirect STDOUT and STDERR from your child process into your application. Then, as needed, output them. Example:
import subprocess
cmd = ['ls', '-la']
print('running command: "{0}"'.format(cmd)) # output the command.
# Here, we join the STDERR of the application with the STDOUT of the application.
process = subprocess.Popen(cmd, stdout=subprocess.PIPE, stderr=subprocess.STDOUT)
process.wait() # Wait for the underlying process to complete.
out, err = process.communicate() # Capture what it outputted on STDOUT and STDERR
errcode = process.returncode # Harvest its returncode, if needed.
print(out)
print('done running command')
Additionally, I wouldn't use shell = True unless it's really required. It forces subprocess to fire up a whole shell environment just to run a command. It's usually better to inject directly into the env parameter of Popen.

python, iterate on subprocess.Popen() stdout/stderr

There are a lot of similar posts, but I didn't find answer.
On Gnu/Linux, with Python and subprocess module, I use the following code to iterate over the
stdout/sdterr of a command launched with subprocess:
class Shell:
"""
run a command and iterate over the stdout/stderr lines
"""
def __init__(self):
pass
def __call__(self,args,cwd='./'):
p = subprocess.Popen(args,
cwd=cwd,
stdout = subprocess.PIPE,
stderr = subprocess.STDOUT,
)
while True:
line = p.stdout.readline()
self.code = p.poll()
if line == '':
if self.code != None:
break
else:
continue
yield line
#example of use
args = ["./foo"]
shell = Shell()
for line in shell(args):
#do something with line
print line,
This works fine... except if the command executed is python, for example `args = ['python','foo.py'], in which case the output is not flushed but printed only when the command is finished.
Is there a solution?
Check out How to flush output of Python print?.
You need to run the python subprocess with the -u option:
-u Force stdin, stdout and stderr to be totally unbuffered. On sys‐
tems where it matters, also put stdin, stdout and stderr in binary
mode. Note that there is internal buffering in xreadlines(),
readlines() and file-object iterators ("for line in sys.stdin")
which is not influenced by this option. To work around this, you
will want to use "sys.stdin.readline()" inside a "while 1:" loop.
Or, if you have control over the python sub-process script you can use sys.stdout.flush() to flush the output every time you print.
import sys
sys.stdout.flush()

Subprocess.communicate prints newline to standard output

I have a script that calls ffprobe, parses its output and outputs it to the console.
Here's a stripped-down version of it without the parsing code and command-line options:
"""Invoke ffprobe to query a video file and parse the output"""
def ffprobe(fname):
import subprocess as sub
import re
p = sub.Popen(['ffprobe', fname], stderr=sub.PIPE)
stdout, stderr = p.communicate()
def main():
ffprobe("foo.mp4")
#print options.formatstr % locals()
if __name__ == '__main__':
main()
You can see that the only print statement in my code is commented out, so the program shouldn't really output anything. However, this is what I get:
mpenkov#misha-desktop:~/co/youtube$ python ffprobe.py foo.mp4
mpenkov#misha-desktop:~/co/youtube$ python ffprobe.py foo.mp4
mpenkov#misha-desktop:~/co/youtube$ python ffprobe.py foo.mp4
A newline is mysteriously output by each invocation. Where is it coming from, and how can I deal with it?
There appears to be a similar SO question, except it's not using the communicate call (http://stackoverflow.com/questions/7985334/python-subprocess-proc-stderr-read-introduce-extra-lines).
I cannot reproduce the problem, so maybe it depends on the file you're passing to ffprobe.
Anyway, from what I see, stdout isn't being captured, so maybe the problem is just that ffprobe is printing a new line character to stdout.
To confirm this, please replace:
p = sub.Popen(['ffprobe', fname], stderr=sub.PIPE)
stdin, stderr = p.communicate()
with:
p = sub.Popen(['ffprobe', fname], stdout=sub.PIPE, stderr=sub.PIPE)
stdout, stderr = p.communicate()
In the new version, stdout is captured and the output from p.communicate is correctly named since it returns stdout not stdin.

Categories