grep: write error: Broken pipe with subprocess - python

I get couple of grep:write errors when I run this code.
What am I missing?
This is only part of it:
while d <= datetime.datetime(year, month, daysInMonth[month]):
day = d.strftime("%Y%m%d")
print day
results = [day]
first=subprocess.Popen("grep -Eliw 'Algeria|Bahrain' "+ monthDir +"/"+day+"*.txt | grep -Eliw 'Protest|protesters' "+ monthDir +"/"+day+"*.txt", shell=True, stdout=subprocess.PIPE, )
output1=first.communicate()[0]
d += delta
day = d.strftime("%Y%m%d")
second=subprocess.Popen("grep -Eliw 'Algeria|Bahrain' "+ monthDir +"/"+day+"*.txt | grep -Eliw 'Protest|protesters' "+ monthDir +"/"+day+"*.txt", shell=True, stdout=subprocess.PIPE, )
output2=second.communicate()[0]
articleList = (output1.split('\n'))
articleList2 = (output2.split('\n'))
results.append( len(articleList)+len(articleList2))
w.writerow(tuple(results))
d += delta

When you do
A | B
in a shell, process A's output is piped into process B as input. If process B shuts down before reading all of process A's output (e.g. because it found what it was looking for, which is the function of the -l option), then process A may complain that its output pipe was prematurely closed.
These errors are basically harmless, and you can work around them by redirecting stderr in the subprocesses to /dev/null.
A better approach, though, may simply be to use Python's powerful regex capabilities to read the files:
def fileContains(fn, pat):
with open(file) as f:
for line in f:
if re.search(pat, line):
return True
return False
first = []
for file in glob.glob(monthDir +"/"+day+"*.txt"):
if fileContains(file, 'Algeria|Bahrain') and fileContains(file, 'Protest|protesters'):
file.append(first)

To find the files matching two patterns, the command structure should be:
grep -l pattern1 $(grep -l pattern2 files)
$(command) substitutes the output of the command into the command line.
So your script should be:
first=subprocess.Popen("grep -Eliw 'Algeria|Bahrain' $("+ grep -Eliw 'Protest|protesters' "+ monthDir +"/"+day+"*.txt)", shell=True, stdout=subprocess.PIPE, )
and similarly for second

If you are just looking for whole words, you could use the count() member function;
# assuming names is a list of filenames
for fn in names:
with open(fn) as infile:
text = infile.read().lower()
# remove puntuation
text = text.replace(',', '')
text = text.replace('.', '')
words = text.split()
print "Algeria:", words.count('algeria')
print "Bahrain:", words.count('bahrain')
print "protesters:", words.count('protesters')
print "protest:", words.count('protest')
If you want more powerful filtering, use re.

Add stderr args in the Popen function based on the python version the stderr value will change. This will support if the python version is less than 3
first=subprocess.Popen("grep -Eliw 'Algeria|Bahrain' "+ monthDir +"/"+day+".txt | grep -Eliw 'Protest|protesters' "+ monthDir +"/"+day+".txt", shell=True, stdout=subprocess.PIPE, stderr = subprocess.STDOUT)

Related

Python - Reading Powershell script output, using subprocess, want to receive stdout line by line

Been Trying to read stdout from a Powershell script that runs for a little bit, generating output based on the number of computers it's pinging.
Trying to get the data to stream into the text box, but after all I've tried, I only seem to be able to get it to give all the output at once.
Been trying not to use subprocess.communicate(), as it also seems to give all the output at once.
Here is the Code:
from tkinter import *
import os
from subprocess import Popen, PIPE
window = Tk()
window.title( 'PowerShell Script Temp' )
frame = Frame(window)
fldrPath = r"C:/Users/firstname.lastname/Downloads/Powershell Development/Monthly Scans/"
listbox = Listbox(frame)
listbox.configure(width=50)
for name in os.listdir(fldrPath):
listbox.insert('end', name)
def selection():
fileList = listbox.curselection()
for file in fileList:
os.chdir(fldrPath)
# Right here is the problematic section
with Popen(["powershell.exe", '-File', fldrPath + '\\' + listbox.get(file)], stdout=PIPE, bufsize=1,
universal_newlines=True) as p:
for line in p.stdout:
output.insert('end', line)
print(line, end='')
output = Text(window, width=75, height=6, wrap=WORD, background="white")
btn = Button(frame, text='Run Script', command=selection)
btn.pack(side=RIGHT, padx=5)
listbox.pack(side=LEFT)
frame.pack(padx=30, pady=30)
output.pack(fill=BOTH, expand=1, padx=5, pady=5)
window.mainloop()
Doesn't filter line-by-line - just vomits it all after the PS script has finished.
I've tried quite a few other things:
Attempt 1
proc = subprocess.Popen(["powershell.exe", '-File', fldrPath + '\\' + listbox.get(file)], shell=True, stdout=PIPE)
while proc.poll() is None:
data = proc.stdout.readline() # Alternatively proc.stdout.read(1024)
print(data)
text_box.insert('end', data)
Attempt 2
outpipe = subprocess.Popen(["powershell.exe", '-File', fldrPath + '\\' + listbox.get(file)],
stdin=subprocess.PIPE,
stdout=subprocess.PIPE,
stderr=subprocess.PIPE).communicate()[0]
text_box.insert("end", outpipe)
Attempt 3
with Popen(["powershell.exe", '-File', fldrPath + '\\' + listbox.get(file)], stdout=PIPE, bufsize=1,
universal_newlines=True) as p, StringIO() as buf:
for line in p.stdout:
print(line, end='')
buf.write(line)
outpipe = buf.getvalue()
output.insert('end', outpipe)
Attempt 4
proc = subprocess.Popen(["powershell.exe", '-File', fldrPath + '\\' + listbox.get(file)], shell=True,
stdout=subprocess.PIPE)
while proc.poll() is None:
p = proc.stdout.readline()
output.insert('end', p)
I've actually attempted far more, but can't seem to remember them. Lots of fors and withs...I'm a bit tired of it at this point 😔. I can get this thing to Print() in a manner that I desire, but not stream into the text box; almost like .insert() is too slow.
If someone could lead me in the right direction, I'd greatly appreciate it. Tried almost every similar thread on here that I could manage, just doesn't seem to be doing the trick.
Trying to get the data to stream into the text box, but after all I've
tried, I only seem to be able to get it to give all the output at
once.
This code here:
def selection():
fileList = listbox.curselection()
for file in fileList:
os.chdir(fldrPath)
# Right here is the problematic section
with Popen(["powershell.exe", '-File', fldrPath + '\\' + listbox.get(file)], stdout=PIPE, bufsize=1,
universal_newlines=True) as p:
for line in p.stdout:
output.insert('end', line)
print(line, end='')
is executed and after its finished you can see what it has done. To update it you could use the after method of tkinter and check for changes in the given time.
Like in this exampel here.
The tkinter mainloop works like this
Start
|
|<----------------------------------------------------------+
v ^
Do I have No[*] Calculate how Sleep for at |
work to do? -----> long I may sleep -----> most that much --->|
| time |
| Yes |
| |
v |
Do one callback |
| |
+-----------------------------------------------------------+
Another suggestion would be to use another thread.
Also you should read that answer here for .communicate().
Please try following :
command1 = "please type command whose output you want to achieve"
output1 = os.popen(command1).read()
print(output1)
Not sure man but i hope it works out for you

Custom Popen.communicate method gives wrong output

Let's start by considering this code:
proc_stdin.py
import sys
if __name__ == '__main__':
for i, line in enumerate(sys.stdin):
sys.stdout.write(line)
test.py
import subprocess
def run_bad(target, input=None):
proc = subprocess.Popen(
target,
universal_newlines=True,
shell=True,
stderr=subprocess.STDOUT,
stdin=subprocess.PIPE if input else subprocess.DEVNULL,
stdout=subprocess.PIPE,
)
if input:
proc.stdin.write(input)
proc.stdin.flush()
proc.stdin.close()
lines = []
for line in iter(proc.stdout.readline, ""):
line = line.rstrip("\n")
lines.append(line)
proc.stdout.close()
ret_code = proc.wait()
return "\n".join(lines)
def run_good(target, input):
return subprocess.Popen(
target,
universal_newlines=True,
shell=True,
stderr=subprocess.STDOUT,
stdin=subprocess.PIPE,
stdout=subprocess.PIPE,
).communicate(input=input)[0]
if __name__ == '__main__':
lst = [
"",
"token1",
"token1\n",
"token1\r\n",
"token1\n\n",
"token1\r\n\ntoken2",
"token1 token2",
"token1\ntoken2",
"token1\r\ntoken2",
"token1\n\ntoken2",
"token1\r\n\ntoken2",
"token1 \ntoken2\ntoken2\n"
]
cmd = "python proc_stdin.py"
for inp in lst:
a, b = run_bad(cmd, inp), run_good(cmd, inp)
if a != b:
print("Error: {} vs {}".format(repr(a), repr(b)))
else:
print("ok: {}".format(repr(a)))
Output:
ok: ''
ok: 'token1'
Error: 'token1' vs 'token1\n'
Error: 'token1\n' vs 'token1\n\n'
Error: 'token1\n' vs 'token1\n\n'
ok: 'token1\n\n\ntoken2'
ok: 'token1 token2'
ok: 'token1\ntoken2'
ok: 'token1\n\ntoken2'
ok: 'token1\n\ntoken2'
ok: 'token1\n\n\ntoken2'
Error: 'token1 \ntoken2\ntoken2' vs 'token1 \ntoken2\ntoken2\n'
My question is, why is the output of both run_bad & run_good not equal in all cases? How would you change the run_bad function so the output becomes equal than run_good?
You also may wonder, why are you not using directly Popen.communicate for this particular case or other helpers from subprocess module? Well, in the real world case I'm creating a plugin for SublimeText3 which is forcing me to stick to python3.3 (can't use many of the modern subprocess goodies) plus I'd like to inject some callbacks while reading the lines from stdout and that's something I can't do by using the Popen.communicate method (as far as I know).
Thanks in advance.
If you strip newlines from every line and then add them back between the lines, what happens to the last newline (if any)? (There’s no final, empty line after a final newline because your iter discards it.) This is why Python’s readline (or line iteration) function includes the newlines: they’re necessary to represent the end of the file accurately.

How to solve a subprocess that contains a '|'

This code does not work.
I wrote like this.
str = "curl -s 'URL_ADDRESS' | tail -1".split()
p = subprocess.Popen(str,stdout=subprocess.PIPE).stdout
data = p.read()
p.close()
print(data)
But the result is b''.
What's the problem with this?
If you use subprocess, use instead of '|' like this.
This will solve the problem.
str = "curl -s 'URL_ADDRESS'".split()
tail = "tail -1".split()
temp = subprocess.Popen(str, stdout=subprocess.PIPE).stdout
temp1 = subprocess.Popen(tail, stdin=temp, stdout=subprocess.PIPE).stdout
temp.close()
data = temp1.read()
temp1.close()

Comparing two files and removing all whitespaces

Is there a more elegant way of comparing these two files?
Right now I am getting the following error message: syntax error near unexpected token (... diff <( tr -d ' '.
result = Popen("diff <( tr -d ' \n' <" + file1 + ") <( tr -d ' \n' <"
+ file2 + ") | wc =l", shell=True, stdout=PIPE).stdout.read()
Python seems to read "\n" as a literal character.
The constructs you are using are interpreted by bash and do not form a standalone statement that you can pass to system() or exec().
<( ${CMD} )
< ${FILE}
${CMD1} | ${CMD2}
As such, you will need to wire-up the redirection and pipelines yourself, or call on bash to interpret the line for you (as #wizzwizz4 suggests).
A better solution would be to use something like difflib that will perform this internally to your process rather than calling on system() / fork() / exec().
Using difflib.unified_diff will give you a similar result:
import difflib
def read_file_no_blanks(filename):
with open(filename, 'r') as f:
lines = f.readlines()
for line in lines:
if line == '\n':
continue
yield line
def count_differences(diff_lines):
diff_count = 0
for line in diff_lines:
if line[0] not in [ '-', '+' ]:
continue
if line[0:3] in [ '---', '+++' ]:
continue
diff_count += 1
return diff_count
a_lines = list(read_file_no_blanks('a'))
b_lines = list(read_file_no_blanks('b'))
diff_lines = difflib.unified_diff(a_lines, b_lines)
diff_count = count_differences(diff_lines)
print('differences: %d' % ( diff_count ))
This will fail when you fix the syntax error because you are attempting to use bash syntax in what is implemented as a C system call.
If you wish to do this in this way, either write a shell script or use the following:
result = Popen(['bash', '-c',
"diff <( tr -d ' \n' <" + file1 + ") <( tr -d ' \n' <"
+ file2 + ") | wc =l"], shell=True, stdout=PIPE).stdout.read()
This is not an elegant solution, however, since it is relying on the GNU coreutils and bash. A more elegant solution would be pure Python. You could do this with the difflib module and the re module.

subprocess open file and pipe awk command [duplicate]

This question already has answers here:
How to use `subprocess` command with pipes
(7 answers)
Closed 5 years ago.
This is my input file format:
#SRR2056440.1 1 length=100
TGTAGGTCTGAGCAGCTTGTCCTGGCTGTGTCCATGTCAGAGCAACGGCCCAAGTCTGGGTCTGGGGGGGAAGGTGTCATGGAGCCCCCTACGATTCCCA
+SRR2056440.1 1 length=100
BCBFFFEFHHHHHJJJJJJIJJJJJJJJIJHHIJJIIJJJJJIJJIJJJJJJJJFHIJJJHHHHHHFDDDBDDD>>ACDEDDDDDDDDDDDDDDDDDEDD
#SRR2056440.2 2 length=100
CTGCCGCCACCGCAGCAGCCACAGGCAGAGGAGGACGAGGACGACTGGGAATCGTAGGGGGCTCCATGACACCTTCCCCCCCAGACCCAGACTTGGGCCA
+SRR2056440.2 2 length=100
CCCFFFFFHHHHHJJJJJJJJJJJIJIJIGJGGIGGJIJJEHFEDDDDDDDDDDABDDDDDDDDDDDDDDADDDDDDDDDDDCDDDDDDBBDDCDDBDD#
#SRR2056440.3 3 length=100
TCTGCCGCCACCGCAGCAGCCACAGGCAGAGGAGGACGAGGACGACTGGGAATCGTAGGGGGCTCCATGACACCTTCCCCCCCAGACCCAGACTTGGGCC
+SRR2056440.3 3 length=100
CCCFFFFFHGHHHJJJJJIJJJJJJIJJIJJJIJJIIIGIJ<CDBCDDDDDDDDDDDDDDDDDDDDDDDDDDDDDCDDDDDDDDDDDDDDDDDDCDCBDD
This is the command I want to execute:
cat input.fq | awk 'NR%4==2{sum+=length($0);nr++;sumsq+=length($0)*length($0)}END{printf"%.1f\t%.1f\n",sum/nr,sqrt(sumsq/nr-(sum/nr)**2)}'
And the output of the command:
100.0 0.0
I want to execute that command inside a python script using subprocess. I have done several attempts but I can't figure out, this is my last try:
awk_comm = r"""'NR%4==2{sum+=length($0);nr++;sumsq+=length($0)*length($0)}END{printf"%.1f\t%.1f\n",sum/nr,sqrt(sumsq/nr-(sum/nr)**2)}'"""
cmd = ['cat', 'input.fq', '|', 'awk', awk_comm]
p2 = subprocess.Popen(cmd, stdout=subprocess.PIPE, stderr=subprocess.PIPE, shell=True)
out1, err = p2.communicate()
EDIT:
I can't see any error in the output. It gets stuck, running forever.
The following works for me.
>>> awk_comm = r"""cat input.fq | awk 'NR%4==2{sum+=length($0);nr++;sumsq+=length($0)*length($0)}END{printf"%.1f\t%.1f\n",sum/nr,sqrt(sumsq/nr-(sum/nr)**2)}'"""
>>> p2 = subprocess.Popen(awk_comm, stdout=subprocess.PIPE,shell=True)
>>> res = p2.communicate()
>>> res
('100.0\t0.0\n', None)
There's no point to shell=True here. Just set up your subprocess.Popen object to do everything you'd otherwise use the shell for:
# the original awk code, with whitespace added for readability
awk_command = r"""
NR%4==2 {
sum+=length($0);
nr++;
sumsq+=length($0)*length($0)
}
END {
printf "%.1f\t%.1f\n", sum/nr, sqrt(sumsq/nr-(sum/nr)**2)
}
"""
p2 = subprocess.Popen(
['awk', awk_command],
stdin=open('input.fq', 'r'), # pass a file handle to input.fq directly on awk's stdin
stdout=subprocess.PIPE,
stderr=subprocess.PIPE)
out1, err = p2.communicate()
By default, Python doesn't use the shell to run commands...but pipes are evaluated by the shell!! You need to pass shell=True:
p2 = subprocess.Popen(cmd, stdout=subprocess.PIPE, stderr=subprocess.PIPE, shell=True)
You can use the commands module to achieve this:
import commands
awk_comm = r"""'NR%4==2{sum+=length($0);nr++;sumsq+=length($0)*length($0)}END{printf"%.1f\t%.1f\n",sum/nr,sqrt(sumsq/nr-(sum/nr)**2)}'"""
p1 = commands.getoutput('cat input.fq | awk ' + awk_comm)
print p1
Hope this helps

Categories