How to handle piped `head` on huge input with python subprocess?

How to handle piped `head` on huge input with python subprocess? - python

If a series of commands are piped in linux, it handles it efficiently, ie. it terminates the previous subprocess if the last subprocess has terminated. For instance,
cat filename | head -n 1
zcat filename | head -n 1
hadoop fs -cat /some/path | head -n 1
In each of the above, the cat command would take considerable time, but the combined command performs fast. How is it done internally? Are the first commands (cat commands) given SIGTERM, SIGKILL by the OS as soon as the head terminates?
I wanted to do something similar in Python and was wondering what should be the best way to do it. I am trying to do the following:
p1 = Popen(['hadoop','fs','-cat',path], stdout=PIPE)
p2 = Popen(['head','-n',str(num_lines)], stdin=p1.stdout,stdout=PIPE)
p2.communicate()
p1.kill() or p1.terminate()
Is this efficient?

Actually, I believe that the process is being sent SIGPIPE when head closes. From Wikipedia:
SIGPIPE
The SIGPIPE signal is sent to a process when it attempts to write to a pipe without a process connected to the other end.
Also, from a few answers from a question on SIGPIPE:
...
You see, when the file descriptor with the pending write is closed, the SIGPIPE happens right then. While the write will return -1 eventually, the whole point of the signal is to notify you asynchronously that the write is no longer possible. This is part of what makes the whole elegant co-routine structure of pipes work in UNIX.
...
https://stackoverflow.com/a/8369516/2334407
...
https://www.gnu.org/software/libc/manual/html_mono/libc.html
This link says:
A pipe or FIFO has to be open at both ends simultaneously. If you read from a pipe or FIFO file that doesn't have any processes writing to it (perhaps because they have all closed the file, or exited), the read returns end-of-file. Writing to a pipe or FIFO that doesn't have a reading process is treated as an error condition; it generates a SIGPIPE signal, and fails with error code EPIPE if the signal is handled or blocked.
...
https://stackoverflow.com/a/18971899/2334407
I think it is to get the error handling correct without requiring a lot of code in everything writing to a pipe.
Some programs ignore the return value of write(); without SIGPIPE they would uselessly generate all output.
Programs that check the return value of write() likely print an error message if it fails; this is inappropriate for a broken pipe as it is not really an error for the whole pipeline.
https://stackoverflow.com/a/8370870/2334407
Now, to answer your question on what the best way to do it would be, I'd say not to send any signals. Instead, read as much data as you need to, and then simply close the pipe. The OS kernel will then automatically clean up after you and send SIGPIPE to the necessary processes.

Related

Python other way to wait for an event

I want my program to wait until a specific file will contain text instead of empty string. Another program writes data to the file. When I run the first program my computer starts overheating because of the while loop that continously checks the file content. What can I do instead of that loop?

A better solution would be to start that process from within your Python script:
from subprocess import call
retcode = call(['myprocess', 'arg1', 'arg2', 'argN'])
Check if retcode is zero, this means success--your process ran successfully with no problems. You could also use os.system instead of subprocess.call. Once the process is finished, you would know now you can read the file.
Why this method is better than monitoring files?
The process might fail and there might be no output in the file you're trying to read from.
In this case scenario, your process will check the file again and again, looking for data, this wastes kernel I/O operation time. There's nothing that could guarantee that the process will succeed at all times.
The process may receive signals, (i,e. STOP and CONT), if the process received the STOP signal, the kernel will stop the process and there might be nothing that you could read from the output file, especially if you intend to read all the data at once like when you're sorting a file. Once the process receives CONT signal, there the process will start again. Basically, this means your Python script will be trying to read simultaneously from the file while the process is stopped.
The disadvantage of this method is that, the process needs to finish first before your Python script process the output from the file. The subprocess.call blocks, the next line won't be executed by Python interpreter until the spawned process finishes first, you could instead use subprocess.Popen which is non-blocking. Even better and if possible, redirect the output of the process to stdout and use Popen to read the output of your process from its stdout and then write the output from the Python script to a file.

Should I always close stdout explicitly?

I am trying to integrate a small Win32 C++ program which reads from stdin and writes the decoded result (˜128 kbytes)to the output stream.
I read entire input into buffer with
while (std::cin.get(c)) { }
After I write entire output to the stdout.
Everything works fine when I run the application from command line eg test.exe < input.bin > output.bin, however this small app is supposed to be run from Python.
I expect that Python subprocess.communicate is supposed to be used, the docs say:
Interact with process: Send data to stdin. Read data from stdout and
stderr, until end-of-file is reached. Wait for process to terminate.
So communicate() will wait until the end-of-file before waiting my app to finish - is EOF supposed to happen when my application exits? Or should I explicitly do fclose(stderr) and fclose(stdout)?

Don't close stdout
In the general case, it is actually wrong, since it is possible to register a function with atexit() which tries to write to stdout, and this will break if stdout is closed.
When the process terminates, all handles are closed by the operating system automatically. This includes stdout, so you are not responsible for closing it manually.
(Technically, the C++ runtime will normally try to flush and close all C++ streams before the OS even has a chance to get involved, but the OS absolutely must close any handles which the runtime, for whatever reason, misses.)
In specialized circumstances, it may be useful to close standard streams (for example, when daemonizing), but it should be done with great care. It's usually a good idea to redirect to or from the null device (/dev/null on Unix, nul on Windows) so that code expecting to interact with those streams will still work. On Unix, this is done with freopen(3); Windows has an equivalent function, but it's part of the POSIX API and may not work well with standard Windows I/O.

Python subprocess.call() spawning new process for every call

I am trying to send .mp4 files to an mp4 tagging application. My problem is that on Windows everytime I call subprocess.call() or subprocess.Popen() a new process is spawned.
What I want is to open the file in the existing process if the process is already running... is this possible or will it depend on how the process being called handles new process calls?
here is what I have:
def sendToTagger(self, file):
msg = "-- " + self.getDateStamp() + "-- Sending " + os.path.basename(file) + " to Tagger...\r\n"
self.logFile.write(msg)
print(msg)
p = subprocess.Popen(['C:\\Program Files (x86)\\Tagger\\Tagger.exe', file], shell=False, stdin=None, stdout=None)

It has to spawn a new process as you are calling external command not native to your python code. But you can, if you wish wait for the process to complete by calling p.wait()

subprocess.Popen always opens a new process (that is it's purpose). You need to determine how Tagger.exe allows another program to programmatically request it to open a new file. In the simplest case you can simply communicate with it over stdin (in which cause you need to set stdin and possibly stdout to PIPE). However, your program may require some other method of inter-process communication (IPC). Such as sockets, shared memory, etc. I am not familiar with the methods on Windows, but if Tagger is a graphical program there is a good chance that you will need to do something more sophisticated.

Python process dies on Ubuntu: what does the error code mean and is there a fix?

I am running multiple copies of the same python script on an Amazon EC2 Ubuntu instance. Each copy in turn launches the same child Python script using the solution proposed here
From time to time some of these child processes die. subprocess.check_output throws an exception and returns the error code -9. I ran the child process directly from the prompt and after running for some time, the process dies with a not-so-detailed message Killed.
Questions:
What does -9 mean?
How can I find out more about what went wrong? Specifically, my suspicion is that it might be caused by the machine getting overloaded by the several copies of the same script running at the same time. At the same time, the specific child process that I ran directly appears to be dying every time it's launched, directly or not, and more or less at the same moment (i.e. after processing more or less the same amount of input data). Python is not producing any error messages.
Assuming I have no bugs in the Python code, what can I do to try to prevent the crashes?

check_output() accumulates output from the subprocess in memory. If the process generates enough output it might be killed by oom killer due to the large RAM consumption.
If you don't need the output, you could use check_call() instead and discard the output:
import os
from subprocess import check_call, STDOUT
DEVNULL = open(os.devnull, "r+b")
check_call([command], stdout=DEVNULL, stderr=STDOUT)

-9 means kill signal that is not catchable or ignorable, or just quit immediately.
For example if you're trying to kill a process you could enter in your terminal:
ps aux | grep processname
or just this to get a list of all processes: ps aux
Once you have the pid of the process you want to terminate, you'd type kill -9 followed by the pid:
kill -9 1234
My memory is a little foggy when it comes to logs, but I'd cat around in /var/log/ and see if you find anything, or dmesg.
As far as preventing crashes in your Python code, have you tried any exception handling?
Exceptions in Python

Starting and Controlling an External Process via STDIN/STDOUT with Python

I need to launch an external process that is to be controlled via messages sent back and forth via stdin and stdout. Using subprocess.Popen I am able to start the process but am unable to control the execution via stdin as I need to.
The flow of what I'm trying to complete is to:
Start the external process
Iterate for some number of steps
Tell the external process to complete the next processing step by writing a new-line character to it's stdin
Wait for the external process to signal it has completed the step by writing a new-line character to it's stdout
Close the external process's stdin to indicate to the external process that execution has completed.
I have come up with the following so far:
process = subprocess.Popen([PathToProcess], stdin=subprocess.PIPE, stdout=subprocess.PIPE)
for i in xrange(StepsToComplete):
print "Forcing step # %s" % i
process.communicate(input='\n')
When I run the above code the '\n' is not communicated to the external process, and I never get beyond step #0. The code blocks at process.communicate() and does not proceed any further. I am using the communicate() method incorrectly?
Also how would I implement the "wait until the external process writes a new line" piece of functionality?

process.communicate(input='\n') is wrong. If you will notice from the Python docs, it writes your string to the stdin of the child, then reads all output from the child until the child exits. From doc.python.org:
Popen.communicate(input=None) Interact
with process: Send data to stdin. Read
data from stdout and stderr, until
end-of-file is reached. Wait for
process to terminate. The optional
input argument should be a string to
be sent to the child process, or None,
if no data should be sent to the
child.
Instead, you want to just write to the stdin of the child. Then read from it in your loop.
Something more like:
process=subprocess.Popen([PathToProcess],stdin=subprocess.PIPE,stdout=subprocess.PIPE);
for i in xrange(StepsToComplete):
print "Forcing step # %s"%i
process.stdin.write("\n")
result=process.stdout.readline()
This will do something more like what you want.

You could use Twisted, by using reactor.spawnProcess and LineReceiver.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.