writing output to a file with subprocess in python

writing output to a file with subprocess in python - python

I have a code which spawns at the max 4 processes at a go. It looks for any news jobs submitted and if it exists it runs the python code
for index,row in enumerate(rows):
if index < 4:
dirs=row[0]
dirName=os.path.join(homeFolder,dirs)
logFile=os.path.join(dirName,(dirs+".log"))
proc=subprocess.Popen(["python","test.py",dirs],stdout=open(logFile,'w'))
I have few questions:
When I try to write the output or errors in log file it does not write into the file until the process finishes.Is it possible to write the output in the file as the process runs as this will help to know at what stage it is running.
When one process finishes, I want the next job in the queue to be run rather than waiting for all child processes to finish and then the daemon starts any new.
Any help will be appreciated.Thanks!

For 2. you can take a look at http://docs.python.org/library/multiprocessing.html

Concerning point 1, try to adjust the buffering used for the log file:
open(logFile,'w', 1) # line-buffered (writes to the file after each logged line)
open(logFile,'w', 0) # unbuffered (should immediately write to the file)
If it suits your need, you should choose line-buffered instead of unbuffered.
Concerning your general problem, as #Tichodroma suggests, you should have a try with Python's multiprocessing module.

Related

Read from pty without endless hanging

I have a script, that prints colored output if it is on tty. A bunch of them executes in parallel, so I can't put their stdout to tty. I don't have control over the script code either (to force coloring), so I want to fake it via pty. My code:
invocation = get_invocation()
master, slave = pty.openpty()
subprocess.call(invocation, stdout=slave)
print string_from_fd(master)
And I can't figure out, what should be in string_from_fd. For now, I have something like
def string_from_fd(fd):
return os.read(fd, 1000)
It works, but that number 1000 looks strange . I think output can be quiet large, and any number there could be not sufficient. I tried a lot of solutions from stack overflow, but none of them works (it prints nothing or hanging forever).
I am not very familiar with file descriptors and all that, so any clarification if I'm doing something wrong would be much appreciated.
Thanks!

This won't work for long outputs: subprocess.call will block once the PTY's buffer is full. That's why subprocess.communicate exists, but that won't work with a PTY.
The standard/easiest solution is to use the external module pexpect, which uses PTYs internally: For example,
pexpect.spawn("/bin/ls --color=auto").read()
will give you the ls output with color codes.
If you'd like to stick to subprocess, then you must use subprocess.Popen for the reason stated above. You are right in your assumption that by passing 1000, you read at most 1000 bytes, so you'll have to use a loop. os.read blocks if there is nothing to read and waits for data to appear. The catch is how to recognize when the process terminated: In this case, you know that no more data will arrive. The next call to os.read will block forever. Luckily, the operating system helps you detect this situation: If all file descriptors to the pseudo terminal that could be used for writing are closed, then os.read will either return an empty string or return an error, depending on the OS. You can check for this condition and exit the loop when this happens. Now the final piece to understanding the following code is to understand how open file descriptors and subprocess go together: subprocess.Popen internally calls fork(), which duplicates the current process including all open file descriptors, and then within one of the two execution paths calls exec(), which terminates the current process in favour of a new one. In the other execution path, control returns to your Python script. So after calling subprocess.Popen there are two valid file descriptors for the slave end of the PTY: One belongs to the spawned process, one to your Python script. If you close yours, then the only file descriptor that could be used to send data to the master end belongs to the spawned process. Upon its termination, it is closed, and the PTY enters the state where calls to read on the master end fail.
Here's the code:
import os
import pty
import subprocess
master, slave = pty.openpty()
process = subprocess.Popen("/bin/ls --color", shell=True, stdout=slave,
stdin=slave, stderr=slave, close_fds=True)
os.close(slave)
output = []
while True:
try:
data = os.read(master, 1024)
except OSError:
break
if not data:
break
output.append(data) # In Python 3, append ".decode()" to os.read()
output = "".join(output)

Linux pipe finished reading but want to discard rest

I have a piece of code that is starting a process then reading from stdout to see if it has loaded OK.
After that, I'd ideally like to redirect the output to /dev/null or something that discards it. I was (A) what is the best practice in this situation and (B) what will happen to the writing process if the pipe becomes full? Will it ever block when the pipe becomes full and is not being read/cleared?
If the aim is to redirect to /dev/null would it be possible to show me how to to this with python and subprocess.Popen?
proc = subprocess.Popen(command, stderr=subprocess.PIPE)
while True:
if init_string in proc.stderr.readline():
break;
proc.stderr.redirect ??

As far as I know, there is no way to close and reopen file descriptors of a child process after it has started executing. And yes, there is a limited buffer in the OS, so if you don't consume anything from the pipe, eventually the child process will block. That means you'll just have to keep reading from the pipe until it's closed from the write end.
If you want your program to continue doing something useful in the meantime, consider moving the data-consuming part to a separate thread (untested):
def read_all_from_pipe(pipe):
for line in pipe: # assuming it's line-based
pass
Thread(lambda: read_all_from_pipe(proc.stderr)).start()
There may be other ways to solve your problem, though. Why do you need to wait for some particular output in the first place? Shouldn't the child just die with a nonzero exit code if it didn't "load OK"? Can you instead check that the child is doing what it should, rather than that it's printing some arbitrary output?

If you would like to discard all the output:
python your_script.py > /dev/null
However, if you want to do it from Python you can use:
import sys
sys.stdout = open('file', 'w')
print 'this goes to file'
Everytime you print, the standard output has been redirected to the file "file", change that to /dev/null or any file you want and you will obtain the wanted results.

Popen disk write cache

I have kind of a newbie question regarding python and disk writes. Basically I am executing some popen processes in sequence where the second process takes output from the first as its input file. For example:
p = subprocess.Popen(["mysqldump", "--single-transaction", "-u",
database_username, "--password="+database_password, "--databases",
"--host", server_address, database_name, ],
stdout = open( outputfile, 'w') , stderr=subprocess.PIPE)
error = p.stderr.read()
Then
p2 = subprocess.Popen(["tar", "-C", backup_destination,
"--remove-files", "--force-local", "-czf", gzipoutputfile,
mysqlfilename ], stderr=subprocess.PIPE)
error2 = p2.stderr.read()
This usually finishes fine in sequence without any problems. Note that the second process reads from the file the first process generates. Every once in a while I'll get an error on the second subprocess that says "tar: host-ucpsom_2012-2014-08-05-0513.mysql: file changed as we read it".
I am assuming this is because there are some cached disk writes from the first process, and that the file is actually being finished written to disk after the first process is actually terminated and no longer in memory.
So, my question is; is there an elegant way to wait for cached disk writes to be completed before actually reading from this file? One thing I thought was to read the size of the file on disk, wait a couple of seconds, then check the size of the file again, then if they are the same assume it's being done written, but I feel that there has to be a more elegant way to solve this problem. Would anybody be able to advise in this regard? I appreciate you taking the time to answer my question.

Call p.wait() (or another call which indirectly waits for exit, such as communicate()) before invoking p2.
Calling only p.stderr.read() waits for p to close its stderr channel; however, a program can close its stderr before closing the rest of its file descriptors (which is, for each individual file handle, the step that triggers flush to the VFS layer) and exiting.
If your filesystem is NFS on Linux, ensure that that the sync flag is in use (contrast w/ the default async), such that operations are complete on the remote end before the local end proceeds.

Try to use a file-blocking flag. Once shut down the first process to free the flag file and it will be a sign that the work of the first process is completed.

Best way to fork multiple shell commands/processes in Python?

Most of the examples I've seen with os.fork and the subprocess/multiprocessing modules show how to fork a new instance of the calling python script or a chunk of python code. What would be the best way to spawn a set of arbitrary shell command concurrently?
I suppose, I could just use subprocess.call or one of the Popen commands and pipe the output to a file, which I believe will return immediately, at least to the caller. I know this is not that hard to do, I'm just trying to figure out the simplest, most Pythonic way to do it.
Thanks in advance

All calls to subprocess.Popen return immediately to the caller. It's the calls to wait and communicate which block. So all you need to do is spin up a number of processes using subprocess.Popen (set stdin to /dev/null for safety), and then one by one call communicate until they're all complete.
Naturally I'm assuming you're just trying to start a bunch of unrelated (i.e. not piped together) commands.

I like to use PTYs instead of pipes. For a bunch of processes where I only want to capture error messages I did this.
RNULL = open('/dev/null', 'r')
WNULL = open('/dev/null', 'w')
logfile = open("myprocess.log", "a", 1)
REALSTDERR = sys.stderr
sys.stderr = logfile
This next part was in a loop spawning about 30 processes.
sys.stderr = REALSTDERR
master, slave = pty.openpty()
self.subp = Popen(self.parsed, shell=False, stdin=RNULL, stdout=WNULL, stderr=slave)
sys.stderr = logfile
After this I had a select loop which collected any error messages and sent them to the single log file. Using PTYs meant that I never had to worry about partial lines getting mixed up because the line discipline provides simple framing.

There is no best for all possible circumstances. The best depends on the problem at hand.
Here's how to spawn a process and save its output to a file combining stdout/stderr:
import subprocess
import sys
def spawn(cmd, output_file):
on_posix = 'posix' in sys.builtin_module_names
return subprocess.Popen(cmd, close_fds=on_posix, bufsize=-1,
stdin=open(os.devnull,'rb'),
stdout=output_file,
stderr=subprocess.STDOUT)
To spawn multiple processes that can run in parallel with your script and each other:
processes, files = [], []
try:
for i, cmd in enumerate(commands):
files.append(open('out%d' % i, 'wb'))
processes.append(spawn(cmd, files[-1]))
finally:
for p in processes:
p.wait()
for f in files:
f.close()
Note: cmd is a list everywhere.

I suppose, I could just us subprocess.call or one of the Popen
commands and pipe the output to a file, which I believe will return
immediately, at least to the caller.
That's not a good way to do it if you want to process the data.
In this case, better do
sp = subprocess.Popen(['ls', '-l'], stdout=subprocess.PIPE)
and then sp.communicate() or read directly from sp.stdout.read().
If the data shall be processed in the calling program at a later time, there are two ways to go:
You can retrieve the data ASAP, maybe via a separate thread, reading them and storing them somewhere where the consumer can get them.
You can have the producing subprocess have block and retrieve the data from it when you need them. The subprocess produces as many data as fit in the pipe buffer (usually 64 kiB) and then blocks on further writes. As soon as you need the data, you read() from the subprocess object's stdout (maybe stderr as well) and use them - or, again, you use sp.communicate() at that later time.
Way 1 would the way to go if producing the data needs much time, so that your wprogram would have to wait.
Way 2 would be to be preferred if the size of the data is quite huge and/or the data is produced so fast that buffering would make no sense.

See an older answer of mine including code snippets to do:
Uses processes not threads for blocking I/O because they can more reliably be p.terminated()
Implements a retriggerable timeout watchdog that restarts counting whenever some output happens
Implements a long-term timeout watchdog to limit overall runtime
Can feed in stdin (although I only need to feed in one-time short strings)
Can capture stdout/stderr in the usual Popen means (Only stdout is coded, and stderr redirected to stdout; but can easily be separated)
It's almost realtime because it only checks every 0.2 seconds for output. But you could decrease this or remove the waiting interval easily
Lots of debugging printouts still enabled to see whats happening when.
For spawning multiple concurrent commands, you would need to alter the class RunCmd to instantiate mutliple read output/write input queues and to spawn mutliple Popen subprocesses.

Python program using os.pipe and os.fork() issue

I've recently needed to write a script that performs an os.fork() to split into two processes. The child process becomes a server process and passes data back to the parent process using a pipe created with os.pipe(). The child closes the 'r' end of the pipe and the parent closes the 'w' end of the pipe, as usual. I convert the returns from pipe() into file objects with os.fdopen.
The problem I'm having is this: The process successfully forks, and the child becomes a server. Everything works great and the child dutifully writes data to the open 'w' end of the pipe. Unfortunately the parent end of the pipe does two strange things:
A) It blocks on the read() operation on the 'r' end of the pipe.
Secondly, it fails to read any data that was put on the pipe unless the 'w' end is entirely closed.
I immediately thought that buffering was the problem and added pipe.flush() calls, but these didn't help.
Can anyone shed some light on why the data doesn't appear until the writing end is fully closed? And is there a strategy to make the read() call non blocking?
This is my first Python program that forked or used pipes, so forgive me if I've made a simple mistake.

Are you using read() without specifying a size, or treating the pipe as an iterator (for line in f)? If so, that's probably the source of your problem - read() is defined to read until the end of the file before returning, rather than just read what is available for reading. That will mean it will block until the child calls close().
In the example code linked to, this is OK - the parent is acting in a blocking manner, and just using the child for isolation purposes. If you want to continue, then either use non-blocking IO as in the code you posted (but be prepared to deal with half-complete data), or read in chunks (eg r.read(size) or r.readline()) which will block only until a specific size / line has been read. (you'll still need to call flush on the child)
It looks like treating the pipe as an iterator is using some further buffer as well, for "for line in r:" may not give you what you want if you need each line to be immediately consumed. It may be possible to disable this, but just specifying 0 for the buffer size in fdopen doesn't seem sufficient.
Heres some sample code that should work:
import os, sys, time
r,w=os.pipe()
r,w=os.fdopen(r,'r',0), os.fdopen(w,'w',0)
pid = os.fork()
if pid: # Parent
w.close()
while 1:
data=r.readline()
if not data: break
print "parent read: " + data.strip()
else: # Child
r.close()
for i in range(10):
print >>w, "line %s" % i
w.flush()
time.sleep(1)

Using
fcntl.fcntl(readPipe, fcntl.F_SETFL, os.O_NONBLOCK)
Before invoking the read() solved both problems. The read() call is no longer blocking and the data is appearing after just a flush() on the writing end.

I see you have solved the problem of blocking i/o and buffering.
A note if you decide to try a different approach: subprocess is the equivalent / a replacement for the fork/exec idiom. It seems like that's not what you're doing: you have just a fork (not an exec) and exchanging data between the two processes -- in this case the multiprocessing module (in Python 2.6+) would be a better fit.

The "parent" vs. "child" part of fork in a Python application is silly. It's a legacy from 16-bit unix days. It's an affectation from a day when fork/exec and exec were Important Things to make the most of a tiny little processor.
Break your Python code into two separate parts: parent and child.
The parent part should use subprocess to run the child part.
A fork and exec may happen somewhere in there -- but you don't need to care.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.