Iterate through linux sort output in python

Iterate through linux sort output in python - python

I am having trouble finding a solution to utilize linux sort command as an input to my python script.
For example I would like to iterate through the result of sort -mk1 <(cat file1.txt) <(cat file2.txt))
Normally I would use Popen and iterate through it using next and stdout.readline(), something like:
import os
import subprocess
class Reader():
def __init__(self):
self.proc = subprocess.Popen(['sort -mk1', '<(', 'cat file1.txt', ')', '<(', 'cat file2.txt', ')'], stdout=subprocess.PIPE)
def __iter__(self):
return self
def __next__(self):
while True:
line = self.proc.stdout.readline()
if not line:
raise StopIteration
return line
p = Reader()
for line in p:
# only print certain lines based on some filter
With the above, I would get an error: No such file or directory: 'sort -mk1'
After doing some research, I guess I cant use Popen, and have to use os.execl to utilize bin/bash
So now I try below:
import os
import subprocess
class Reader():
def __init__(self):
self.proc = os.execl('/bin/bash', '/bin/bash', '-c', 'set -o pipefail; sort -mk1 <(cat file1.txt) <(cat file2.txt)')
def __iter__(self):
return self
def __next__(self):
while True:
line = self.proc.stdout.readline()
if not line:
raise StopIteration
return line
p = Reader()
for line in p:
# only print certain lines based on some filter
The problem with this is that it actually prints all the lines right away. I guess one solution is to just pipe its results to a file, then in python I iterate through that file. But I dont really want to save it to a file then filter it, seems unneccessary. Yes I can use other linux commands such as awk, but I would like to use python for further processing.
So questions are:
Is there a way to make solution one with Popen to work?
How can I iterate through the output of sort using the second solution?

If you want to use shell features, you have to use shell=True. If you want to use Bash features, you have to make sure the shell you run is Bash.
self.proc = subprocess.Popen(
'sort -mk1 <(cat file1.txt) <(cat file2.txt)',
stdout=subprocess.PIPE,
shell=True,
executable='/bin/bash')
Notice how with shell=True the first argument to Popen and friends is a single string (and vice versa; if you don't have shell=True you have to parse the command line into tokens yourself).
Of course, the cats are useless but if you replace them with something which the shell performs easily and elegantly and which you cannot easily replace with native Python code, this is probably the way to go.
In brief, <(command) is a Bash process substitution; the shell will run command in a subprocess, and replace the argument with the device name of the open file handle where the process generates its output. So sort will see something like
sort -mk /dev/fd/63 /dev/fd/64
where /dev/fd/63 is a pipe where the first command's output is available, and /dev/fd/64 is the read end of the other command's standard output.

Quite a lot of problem in your scripts.
First, your Popen won't work because of several reasons:
First argument is supposed to be the command to run, and you passed sort -mk and there is no such file. You should simply pass sort, and pass -mk as arguments.
Process substituion <( command ) is something handled by the shell, for which it is doing something like running a command, create a FIFO and substitute it as the name of the FIFO. Passing these directly to sort is not going to work. sort will probably just treat <( as a filename.
Your second way using os.exec* won't work either because os.exec* is going to replace your current process. Hence it will never continue to next statement in your Python script.
In your case, there seems no reason using process substitution. Why can't you simply do somethng like subprocess.Popen(['sort', '-mk', 'filename1', 'filename2']) ?

I do not understand why you are doing sort -mk1 $(cat file), sort can operate on file. look at check_output. That will make your life simple
output=subprocess.check_output('ls')
for line in output:
print(line)
you will, of course, have to deal with the exceptions, the man page has the details

Related

Calling Python subprocess with variable in for loop

I am attempting to call a bash script via the subprocess Popen function passes in a for loop. My intent is that with each iteration, a new string commit from an array out is passed as an argument to the Popen command. The command invokes a bash script that outputs a text identified by the variable commit and greps certain lines from that particular text. However, I can't get the output to flush out in the Python for loop. Right now, only the grepped data from the final commit in out is being passed into my final data structure (a pandas dataframe).
accuracy_dictionary = {}
for commit in out:
accuracy_dictionary.setdefault(commit, {})
p2 = subprocess.Popen(['~/Desktop/find_accuracies.sh', commit], encoding='utf-8', shell=True, stdout=subprocess.PIPE)
outputstring = p2.stdout.read()
# This part below is less critical to the problem at hand
# I'm putting the data from each file in a dictionary
for acc_type_line in outputstring.split('\n'):
accuracy = acc_type_line.split(': ')
if accuracy != ['']:
acc_type = accuracy[0]
value = accuracy[1]
accuracy_dictionary[commit][acc_type] = float(value)
acc_data = pd.DataFrame.from_dict(accuracy_dictionary).T
Here is the bash script that is being called:
"find_accuracies.sh":
#!/bin/sh
COMMIT=$1
git show $COMMIT:blahblahfolder/blahblah.txt | grep --line-buffered 'accuracy'
acc_data returns a dataframe of nrows=len(out) populated by unique commits, but the value is the exact same for all rows for each acc_type
For example, my output looks like this:
How can I call the file "find_accuracies.sh" with the subprocess command and have it flush the unique values of each file for each commit?

I hope this help addressing the immediate problem you're seeing: Here you should really use communicate with subprocess.PIPE as it waits for the command to finish and give give you all of its output:
outputstring = p2.communicate()[0]
You can also use convenient method like check_output to the same effect:
outputstring = subprocess.check_output(['~/Desktop/find_accuracies.sh', commit],
encoding='utf-8', shell=True)
Or also in py3 use run should also do:
p2 = subprocess.run(['~/Desktop/find_accuracies.sh', commit],
encoding='utf-8', shell=True, stdout=subprocess.PIPE)
outputstring = p2.stdout
Now few more comments, hints and suggestions:
I am a little surprised it works for you as using shell=True and list of arguments should (see the paragraph starting with "On POSIX with shell=True") make your commit argument of the underlying sh wrapped around your script call and not of the script itself. In any case you can (and I would suggest to) actually drop the shell and leave HOME resolution to python:
from pathlib import Path
executable = Path.home().joinpath('Desktop/find_accuracies.sh')
p2 = subprocess.run([executable, commit],
encoding='utf-8', stdout=subprocess.PIPE)
outputstring = p2.stdout
You can (or must for py <3.5) also use os.path.expanduser('~/Desktop/find_accuracies.sh') instead of Path.home() to get script executable. On the other hand for >=3.7 you could replace stdout=subprocess.PIPE with capture_output=True.
And last but not least. It seems a bit unnecessary to call a bash script (esp. double wrapped in sh call like in the original example) just to run git through grep when we already have a python script to process the information. I would actually try to run the corresponding git command directly getting the bulk of its output and process its output in the python script itself to get the bits of interest.

How to manipulate input in bash program with python [duplicate]

I'm trying to write a Python script that starts a subprocess, and writes to the subprocess stdin. I'd also like to be able to determine an action to be taken if the subprocess crashes.
The process I'm trying to start is a program called nuke which has its own built-in version of Python which I'd like to be able to submit commands to, and then tell it to quit after the commands execute. So far I've worked out that if I start Python on the command prompt like and then start nuke as a subprocess then I can type in commands to nuke, but I'd like to be able to put this all in a script so that the master Python program can start nuke and then write to its standard input (and thus into its built-in version of Python) and tell it to do snazzy things, so I wrote a script that starts nuke like this:
subprocess.call(["C:/Program Files/Nuke6.3v5/Nuke6.3", "-t", "E:/NukeTest/test.nk"])
Then nothing happens because nuke is waiting for user input. How would I now write to standard input?
I'm doing this because I'm running a plugin with nuke that causes it to crash intermittently when rendering multiple frames. So I'd like this script to be able to start nuke, tell it to do something and then if it crashes, try again. So if there is a way to catch a crash and still be OK then that'd be great.

It might be better to use communicate:
from subprocess import Popen, PIPE, STDOUT
p = Popen(['myapp'], stdout=PIPE, stdin=PIPE, stderr=PIPE)
stdout_data = p.communicate(input='data_to_write')[0]
"Better", because of this warning:
Use communicate() rather than .stdin.write, .stdout.read or .stderr.read to avoid deadlocks due to any of the other OS pipe buffers filling up and blocking the child process.

To clarify some points:
As jro has mentioned, the right way is to use subprocess.communicate.
Yet, when feeding the stdin using subprocess.communicate with input, you need to initiate the subprocess with stdin=subprocess.PIPE according to the docs.
Note that if you want to send data to the process’s stdin, you need to create the Popen object with stdin=PIPE. Similarly, to get anything other than None in the result tuple, you need to give stdout=PIPE and/or stderr=PIPE too.
Also qed has mentioned in the comments that for Python 3.4 you need to encode the string, meaning you need to pass Bytes to the input rather than a string. This is not entirely true. According to the docs, if the streams were opened in text mode, the input should be a string (source is the same page).
If streams were opened in text mode, input must be a string. Otherwise, it must be bytes.
So, if the streams were not opened explicitly in text mode, then something like below should work:
import subprocess
command = ['myapp', '--arg1', 'value_for_arg1']
p = subprocess.Popen(command, stdin=subprocess.PIPE, stdout=subprocess.PIPE, stderr=subprocess.STDOUT)
output = p.communicate(input='some data'.encode())[0]
I've left the stderr value above deliberately as STDOUT as an example.
That being said, sometimes you might want the output of another process rather than building it up from scratch. Let's say you want to run the equivalent of echo -n 'CATCH\nme' | grep -i catch | wc -m. This should normally return the number characters in 'CATCH' plus a newline character, which results in 6. The point of the echo here is to feed the CATCH\nme data to grep. So we can feed the data to grep with stdin in the Python subprocess chain as a variable, and then pass the stdout as a PIPE to the wc process' stdin (in the meantime, get rid of the extra newline character):
import subprocess
what_to_catch = 'catch'
what_to_feed = 'CATCH\nme'
# We create the first subprocess, note that we need stdin=PIPE and stdout=PIPE
p1 = subprocess.Popen(['grep', '-i', what_to_catch], stdin=subprocess.PIPE, stdout=subprocess.PIPE)
# We immediately run the first subprocess and get the result
# Note that we encode the data, otherwise we'd get a TypeError
p1_out = p1.communicate(input=what_to_feed.encode())[0]
# Well the result includes an '\n' at the end,
# if we want to get rid of it in a VERY hacky way
p1_out = p1_out.decode().strip().encode()
# We create the second subprocess, note that we need stdin=PIPE
p2 = subprocess.Popen(['wc', '-m'], stdin=subprocess.PIPE, stdout=subprocess.PIPE)
# We run the second subprocess feeding it with the first subprocess' output.
# We decode the output to convert to a string
# We still have a '\n', so we strip that out
output = p2.communicate(input=p1_out)[0].decode().strip()
This is somewhat different than the response here, where you pipe two processes directly without adding data directly in Python.
Hope that helps someone out.

Since subprocess 3.5, there is the subprocess.run() function, which provides a convenient way to initialize and interact with Popen() objects. run() takes an optional input argument, through which you can pass things to stdin (like you would using Popen.communicate(), but all in one go).
Adapting jro's example to use run() would look like:
import subprocess
p = subprocess.run(['myapp'], input='data_to_write', capture_output=True, text=True)
After execution, p will be a CompletedProcess object. By setting capture_output to True, we make available a p.stdout attribute which gives us access to the output, if we care about it. text=True tells it to work with regular strings rather than bytes. If you want, you might also add the argument check=True to make it throw an error if the exit status (accessible regardless via p.returncode) isn't 0.
This is the "modern"/quick and easy way to do to this.

One can write data to the subprocess object on-the-fly, instead of collecting all the input in a string beforehand to pass through the communicate() method.
This example sends a list of animals names to the Unix utility sort, and sends the output to standard output.
import sys, subprocess
p = subprocess.Popen('sort', stdin=subprocess.PIPE, stdout=sys.stdout)
for v in ('dog','cat','mouse','cow','mule','chicken','bear','robin'):
p.stdin.write( v.encode() + b'\n' )
p.communicate()
Note that writing to the process is done via p.stdin.write(v.encode()). I tried using
print(v.encode(), file=p.stdin), but that failed with the message TypeError: a bytes-like object is required, not 'str'. I haven't figured out how to get print() to work with this.

You can provide a file-like object to the stdin argument of subprocess.call().
The documentation for the Popen object applies here.
To capture the output, you should instead use subprocess.check_output(), which takes similar arguments. From the documentation:
>>> subprocess.check_output(
... "ls non_existent_file; exit 0",
... stderr=subprocess.STDOUT,
... shell=True)
'ls: non_existent_file: No such file or directory\n'

subprocess.Popen is not running shell command

I am trying to use subprocess.Popen to run 'cat test.txt | grep txt', but it's not working. In my code I have executed the subprocess.Popen command twice.
1: First time I used it to run a tshark command which redirectes the command output to a text (test.txt) file (which works fine). (defined in function get_all_tshark_out in below code)
2: Second time used subprocess.Popen to run 'cat test.txt | grep txt' command to extract txt from this file to perform some validation. This didn't work for me. (defined in function get_uniq_sessions in below code)
To make sure it's not because of buffer overflow I am flushing stdout also, but didn't get any help. Below is my code:
import subprocess
import logging
def get_all_tshark_out(logger, tcpdump, port):
command = """tshark -r "%s" -odiameter.tcp.ports:"%s" -R 'diameter.cmd.code == 272 and diameter.flags.request==0 and !tcp.analysis.retransmission and diameter.flags.T == 0' -Tpdml -Tfields -ediameter.Session-Id | sort > test.txt""" %(tcpdump, port)
p_out = subprocess.Popen(command, stdout=subprocess.PIPE, stderr=subprocess.PIPE, shell=True)
sys.stdout.flush()
sys.stderr.flush()
return 1
def get_uniq_sessions(logger, id='1234', uniqlog_file='test.txt'):
command = "cat "+ uniqlog_file +" | grep "+ id
print command
p_out = subprocess.Popen(command, stdout=subprocess.PIPE, stderr=subprocess.PIPE, shell=True)
print "PPPPP", p_out
output = p_out.stdout.read()
p_out.wait()
command_out_list = (output.strip().split("\n"))
sys.stdout.flush()
print "%%%", output, p_out.stderr.read()
print len(command_out_list)
if p_out.stderr.read():
logger.error("\"%s\" Error happened while trying to execute \"%s\"" (p_out.stderr.read().strip(), command))
sys.exit(1)
elif command_out_list[0] == '' and len(command_out_list) == 1:
logger.error("No Sessions belongs to %s campaign ID please provide proper input as Campaign ID" %id)
sys.exit(1)
else:
return command_out_list
How do I fix this?

TL;DR Both of your subprocess.Popen() calls are broken; use one of the wrapper methods in subprocess instead, and/or use Python's built-in facilities instead of external tools.
Is there a particular reason you use a useless use of cat? Just subprocess.Popen(['grep', id, uniqlog_file]) would be much simpler, and not require shell=True -- but of course, Python by itself is excellent at reading files and checking whether each line contains a string.
def get_uniq_sessions(logger, id='1234', uniqlog_file='test.txt'):
matches = []
with open(uniqlog_file, 'r') as handle:
for line in handle:
if id in line:
matches.append(line)
return matches
Your functions should probably not call sys.exit(); instead, raise an exception, or just return None -- that way, the calling code can decide how to handle errors and exceptions.
Your remaining subprocess.Popen() only coincidentally works as long as there is a limited amount of output. You should probably use subprocess.call instead, which exists precisely for the purpose of running a subprocess under controlled conditions while checking for errors.
The key observation here is that Popen() itself merely spawns the subprocess. You need to interact with it and wait() for it in order to make sure it succeeds and returns all its output. The call and various check_* methods in the subprocess module do this for you; Popen() is useful mainly when you outgrow the facilities of those canned wrappers, but also somewhat more tricky to get right, especially the first time.
The tshark command would not require a shell=True if you picked it apart into a list yourself, and did the sorting and writing to a file in Python. If the sole purpose of the output file is to open it again from Python, I would recommend reading the raw output into a Python string and doing all the remaining processing in Python.
def get_all_tshark_out(logger, tcpdump, port):
output = subprocess.check_output(['tshark', '-r', str(tcpdump),
'-odiameter.tcp.ports:{0}'.format(port), '-R',
'diameter.cmd.code == 272 and diameter.flags.request==0 '
'and !tcp.analysis.retransmission and diameter.flags.T == 0',
'-Tpdml', '-Tfields', '-ediameter.Session-Id'])
return sorted(output)
... and now your get_uniq_sessions function is basically a one-liner:
session = [x for x in get_all_tshark_out() if '1234' in x]

Python TypeError: iteration over non-sequence on simple list

import os
test = os.system("ls /etc/init.d/ | grep jboss- | grep -vw jboss-")
for row in test:
print row
For some reason this gives the TypeError: iteration over non-sequence error on this.
When I do a print test without the for loop, it gives a list of the jboss instances, plus a "0" at the bottom.. The heck?

os.system() returns the exit code of the process, not the result of the grep commands. This is always an integer. In the meantime, the output of the process itself is not redirected, so it writes to stdout directly (bypassing Python).
You cannot iterate over over an integer.
You should use the subprocess.check_output() function instead if you wanted to retrieve the stdout output of the command.
In this case, you'd be better off using os.listdir() and code the whole search in Python instead:
for filename in os.listdir('/etc/init.d/'):
if 'jboss-' in filename and not filename.startswith('jboss-'):
print filename
I've interpreted the grep -vw jboss- command as filtering out filenames that start with jboss; adjust as needed.

The problem is, that os.system returns the exit code. If you want to capture the output, you can use subprocess.Popen:
import subprocess
p = subprocess.Popen("ls", stdout=subprocess.PIPE),
out, err = p.communicate()
files = out.split('\n')
Also note that the use of the subprocess module is encouraged:
The subprocess module provides more powerful facilities for spawning new processes and retrieving their results; using that module is preferable to using this [os.system] function.
If you do not have to resort to the shell, a pure python solution, as #Martijn Pieters suggests, seems preferable.

Interactive Python script output stored in some file

How do I perform logging of all activities that are done by a Python script and all scripts that are called from it?
I had several Bash scripts but now wrote a Python script which call all of these Bash scripts. I would like to have all output produced from these scripts stored in some file.
The script is interactive Python script, i.e contains raw_input lines, so I couldn't do like 'python script.py | tee log.txt' for overall the Python script since for some reasons questions are not seen on the screen.
Here is an excerpt from the script which calls one of the shell scripts.
cmd = "somescript.sh"
try:
retvalue = subprocess.check_call(cmd, shell=True)
except subprocess.CalledProcessError:
print ("script command has been failed")
sys.exit("exit from script")
What do you think could be done here?
Edit
Two subquestions based on Alex's answer:
How to make the answers on the questions stored in the output file as well? For example on line ok = raw_input(prompt) the user will be asked for the question and I would like to the answer logged as well.
I read about Popen and communicate and didn't use since it buffers the data in memory. Here the amount of output is big and I need to care about standard-error with standard-output as well. Do you know if this is possible to handle with Popen and communicate method as well?

Making Python's own prints go to both the terminal and a file is not hard:
>>> import sys
>>> class tee(object):
... def __init__(self, fn='/tmp/foo.txt'):
... self.o = sys.stdout
... self.f = open(fn, 'w')
... def write(self, s):
... self.o.write(s)
... self.f.write(s)
...
>>> sys.stdout = tee()
>>> print('hello world!')
hello world!
>>>
$ cat /tmp/foo.txt
hello world!
This should work both in Python 2 and Python 3.
To similarly direct the output from subcommands, don't use
retvalue = subprocess.check_call(cmd, shell=True)
which lets cmd's output go to its regular "standard output", but rather grab and re-emit it yourself, as follows:
p = subprocess.Popen(cmd, shell=True, stdout=subprocess.PIPE)
so, se = p.communicate()
print(so)
retvalue = p.returncode
assuming you don't care about standard-error (only standard-output) and the amount of output from cmd is reasonably small (since .communicate buffers that data in memory) -- it's easy to tweak if either assumption doesn't correspond to what you exactly want.
Edit: the OP has now clarified the specs in a long comment to this answer:
How to make the answers on the
questions stored in the output file
as well? For example on line ok =
raw_input(prompt) the user will be
asked for the question and I would
like to the answer logged as well.
Use a function such as:
def echoed_input(prompt):
response = raw_input(prompt)
sys.stdout.f.write(response)
return response
instead of just raw_input in your application code (of course, this is written specifically to cooperate with the tee class I showed above).
I read about Popen and communicate
and didn't use since it buffers the
data in memory. Here amount of output
is big and I need to care about
standard-error with standard-output
as well. Do you know if this is
possible to handle with Popen and
communicate method as well?
communicate is fine as long as you don't get more output (and standard-error) than comfortably fits in memory, say a few gigabytes at most depending on the kind of machine you're using.
If this hypothesis is met, just recode the above as, instead:
p = subprocess.Popen(cmd, shell=True, stdout=subprocess.PIPE,
stderr=subprocess.STDOUT)
so, se = p.communicate()
print(so)
retvalue = p.returncode
i.e., just redirect the subcommand's stderr to get mixed into its stdout.
If you DO have to worry about gigabytes (or whatever) coming at you, then
p = subprocess.Popen(cmd, shell=True, stdout=subprocess.PIPE,
stderr=subprocess.STDOUT)
for line in p.stdout:
sys.stdout.write(p)
p.wait()
retvalue = p.returncode
(which gets and emits one line at a time) may be preferable (this depends on cmd not expecting anything from its standard input, of course... because, if it is expecting anything, it's not going to get it, and the problem starts to become challenging;-).

Python has a tracing module: trace. Usage: python -m trace --trace file.py

If you want to capture the output of any script, then on a *nix-y system you can redirect stdout and stderr to a file:
./script.py >> /tmp/outputs.txt 2>> /tmp/outputs.txt
If you want everything done by the scripts, not just what they print, then the python trace module won't trace things done by external scripts that your python executes. The only thing that can trace every action done by a program would be something like DTrace, if you are lucky enough to have a system that supports it. (OS X Instruments are based on DTrace)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Iterate through linux sort output in python - python

I do not understand why you are doing sort -mk1 $(cat file), sort can operate on file. look at check_output. That will make your life simple output=subprocess.check_output('ls') for line in output: print(line) you will, of course, have to deal with the exceptions, the man page has the details

Related

Calling Python subprocess with variable in for loop

How to manipulate input in bash program with python [duplicate]

subprocess.Popen is not running shell command

Python TypeError: iteration over non-sequence on simple list

Interactive Python script output stored in some file

Categories

Resources