Python subprocess library: Running grep command from Python

Python subprocess library: Running grep command from Python - python

I am trying to run grep command from my Python module using the subprocess library. Since, I am doing this operation on the doc file, I am using Catdoc third party library to get the content in a plan text file. I want to store the content in a file. I don't know where I am going wrong but the program fails to generate a plain text file and eventually to get the grep result. I have gone through the error log but its empty. Thanks for all the help.
def search_file(name, keyword):
#Extract and save the text from doc file
catdoc_cmd = ['catdoc', '-w' , name, '>', 'testing.txt']
catdoc_process = subprocess.Popen(catdoc_cmd, stdout=subprocess.PIPE,stderr=subprocess.PIPE, shell=True)
output = catdoc_process.communicate()[0]
grep_cmd = []
#Search the keyword through the text file
grep_cmd.extend(['grep', '%s' %keyword , 'testing.txt'])
print grep_cmd
p = subprocess.Popen(grep_cmd,stdout=subprocess.PIPE,stderr=subprocess.PIPE, shell=True)
stdoutdata = p.communicate()[0]
print stdoutdata

On UNIX, specifying shell=True will cause the first argument to be treated as the command to execute, with all subsequent arguments treated as arguments to the shell itself. Thus, the > won't have any effect (since with /bin/sh -c, all arguments after the command are ignored).
Therefore, you should actually use
catdoc_cmd = ['catdoc -w "%s" > testing.txt' % name]
A better solution, though, would probably be to just read the text out of the subprocess' stdout, and process it using re or Python string operations:
catdoc_cmd = ['catdoc', '-w' , name]
catdoc_process = subprocess.Popen(catdoc_cmd, stdout=subprocess.PIPE,stderr=subprocess.PIPE)
for line in catdoc_process.stdout:
if keyword in line:
print line.strip()

I think you're trying to pass the > to the shell, but that's not going to work the way you've done it. If you want to spawn a process, you should arrange for its standard out to be redirected. Fortunately, that's really easy to do; all you have to do is open the file you want the output to go to for writing and pass it to popen using the stdout keyword argument, instead of PIPE, which causes it to be attached to a pipe which you can read with communicate().

Related

subprocess.Popen or proc.stdout.read() corrupting JSON data

I am running a PHP script through Python Django as it contains legacy code for a client.
Data is passed to through to the PHP script via JSON and after the script is computed a string is returned for display like so.
proc = subprocess.Popen(["php -f script.php " + json.dumps(data_for_php, sort_keys=True)], shell=True, stdout=subprocess.PIPE)
script_response = proc.stdout.read()
return HttpResponse(script_response)
The issue I am having is something in this process is corrupting the data.
i.e. one JSON field from data_for_php has the key and value 'xxx_amount': u'$350,000.00', returns ,000.00, as the value in script_response.
It's not doing this for anything else.
I have been doing a bit of debugging and have determined that json.dumps(data_for_php, sort_keys=True) is not causing the issue, also data_for_php is good too.
It leads me to believe that this command proc.stdout.read() is some how mutating $350 to (space).
Note: same thing is occurring for other dictionary values.
Update
I have been lead to believe that the process I am using is a command line script inside Python. When the command is called the JSON variables are being passed inside the command line script. This is probably the issue. Looking for a solution.

$350 in bash is a variable, in the shell it gets replaced by its value, which is not defined. Adding a single quote aroud the the dump should do the trick to avoir interpreting special characters:
proc = subprocess.Popen(["php -f script.php '" + json.dumps(data_for_php, sort_keys=True) + "'"], shell=True, stdout=subprocess.PIPE)
script_response = proc.stdout.read()
return HttpResponse(script_response)

python subprocess with ffmpeg give no output

I m want to extract the scene change timestamp using the scene change detection from ffmpeg. I have to run it on a few hundreds of videos , so i wanted to use a python subprocess to loop over all the content of a folder.
My problem is that the command that i was using for getting these values on a single video involve piping the output to a file which seems to not be an option from inside a subprocess call.
this is my code :
p=subprocess.check_output(["ffmpeg", "-i", sourcedir+"/"+name+".mpg","-filter:v", "select='gt(scene,0.4)',showinfo\"","-f","null","-","2>","output"])
this one tell ffmpeg need an output
output = "./result/"+name
p=subprocess.check_output(["ffmpeg", "-i", sourcedir+"/"+name+".mpg","-filter:v", "select='gt(scene,0.4)',metadata=print:file=output","-an","-f","null","-"])
this one give me no error but doesn't create the file
this is the original command that i use directly with ffmpeg:
ffmpeg -i input.flv -filter:v "select='gt(scene,0.4)',showinfo" -f null - 2> ffout
I just need the ouput of this command to be written to a file, anyone see how i could make it work?
is there a better way then subprocess ? or just another way ? will it be easier in C?

You can redirect the stderr output directly from Python without any need for shell=True which can lead to shell injection.
It's as simple as:
with open(output_path, 'w') as f:
subprocess.check_call(cmd, stderr=f)

Things are easier in your case if you use the shell argument of the subprocess command and it should behave the same. When using the shell command, you can pass in a string as the command rather then a list of args.
cmd = "ffmpeg -i {0} -filter:v \"select='gt(scene,0.4)',showinfo\" -f {1} - 2> ffout".format(inputName, outputFile)
p=subprocess.check_output(cmd, shell=True)
If you want to pass arguments, you can easily format your string

How to subprocess the files on console directly (with or without using StringIO)?

I am trying to read a gtf file and then edit it (using subprocess, grep and awk) before loading into pandas.
I have a file name that has header info (indicated by #), so I need to grep that and remove it first. I can do it in python but I want to introduce grep into my pipeline to make processing more efficient.
I tried doing:
import subprocess
from io import StringIO
gtf_file = open('chr2_only.gtf', 'r').read()
gtf_update = subprocess.Popen(["grep '^#' " + StringIO(gtf_file)], shell=True)
and
gtf_update = subprocess.Popen(["grep '^#' " + gtf_file], shell=True)
Both of these codes throw an error, for the 1st attempt it was:
Traceback (most recent call last):
File "/home/everestial007/PycharmProjects/stitcher/pHASE-Stitcher-Markov/markov_final_test/phase_to_vcf.py", line 39, in <module> gtf_update = subprocess.Popen(["grep '^#' " + StringIO(gtf_file)], shell=True)
TypeError: Can't convert '_io.StringIO' object to str implicitly
However, if I specify the filename directly it works:
gtf_update = subprocess.Popen(["grep '^#' chr2_only.gtf"], shell=True)
and the output is:
<subprocess.Popen object at 0x7fc12e5ea588>
#!genome-build v.1.0
#!genome-version JGI8X
#!genome-date 2008-12
#!genome-build-accession GCA_000004255.1
#!genebuild-last-updated 2008-12
Could someone please provide different examples for problem like this, and also explain why am I getting the error and why/how it would be possible to run subprocess directly on files loaded on console/memory?
I also tried using subprocess with call, check_call, check_output, etc., but I've gotten several different error messages, like these:
OSError: [Errno 7] Argument list too long
and
Subprocess in Python: File Name too long

Here is a possible solution that allows you to send a string to grep. Essentially, you declare in the Popen constructor that you want to communicate with the called program via stdin and stdout. You then send the input via communicate and receive the output as return value from communicate.
#!/usr/bin/python
import subprocess
gtf_file = open('chr2_only.gtf', 'r').read()
gtf_update = subprocess.Popen(["grep '^#' "], shell=True, stdin=subprocess.PIPE, stdout=subprocess.PIPE)
# stdout, stderr (but latter is empty)
gtf_filtered, _ = gtf_update.communicate(gtf_file)
print gtf_filtered
Note that it is wise not to use shell=True. Therefore, the Popen line should be written as
gtf_update = subprocess.Popen(["grep", '^#'], shell=False, stdin=subprocess.PIPE, stdout=subprocess.PIPE)
The rationale is that you don't need the shell to parse the arguments to a single executable. So you avoid unnecessary overhead. It is also better from a security point of view, at least if some argument is potentially unsafe as it comes from a user (think of a filename containing |). (This is obviously not the case here.)
Note that from a performance point of view, I expect that reading the file directly with grep is faster than first reading the file with python, and then sending it to grep.

Running bash command on server

I am trying to run the bash command pdfcrack in Python on a remote server. This is my code:
bashCommand = "pdfcrack -f pdf123.pdf > myoutput.txt"
import subprocess
process = subprocess.Popen(bashCommand.split(), stdout=subprocess.PIPE)
output, error = process.communicate()
I, however, get the following error message:
Non-option argument myoutput2.txt
Error: file > not found
Can anybody see my mistake?

The first argument to Popen is a list containing the command name and its arguments. > is not an argument to the command, though; it is shell syntax. You could simply pass the entire line to Popen and instruct it to use the shell to execute it:
process = subprocess.Popen(bashCommand, shell=True)
(Note that since you are redirecting the output of the command to a file, though, there is no reason to set its standard output to a pipe, because there will be nothing to read.)
A better solution, though, is to let Python handle the redirection.
process = subprocess.Popen(['pdfcrack', '-f', 'pdf123.pdf'], stdout=subprocess.PIPE)
with open('myoutput.txt', 'w') as fh:
for line in process.stdout:
fh.write(line)
# Do whatever else you want with line
Also, don't use str.split as a replacement for the shell's word splitting. A valid command line like pdfcrack -f "foo bar.pdf" would be split into the incorrect list ['pdfcrack', '-f', '"foo', 'bar.pdf"'], rather than the correct list ['pdfcrack', '-f', 'foo bar.pdf'].

> is interpreted by shell, but not valid otherwise.
So, that would work (don't split, use as-is):
process = subprocess.Popen(bashCommand, shell=True)
(and stdout=subprocess.PIPE isn't useful since all output is redirected to the output file)
But it could be better with native python for redirection to output file and passing arguments as list (handles quote protection if needed)
with open("myoutput.txt","w") as f:
process = subprocess.Popen(["pdfcrack","-f","pdf123.pdf"], stdout=subprocess.PIPE)
f.write(process.read())
process.wait()

Your mistake is > in command.
It doesn't treat this as redirection to file because normally bash does it and now you run it without using bash.
Try with shell=True if you whan to use bash. And then you don't have to split command into list.
subprocess.Popen("pdfcrack -f pdf123.pdf > myoutput.txt", shell=True)

Avoid subprocess.Popen auto escaping my backslashes in grep

I'm trying to write an svn pre-commit hook in python. Part of this involves checking the diff file to see if there are any actual file changes (as opposed to just property changes).
I have a working grep command which I can execute fine on the shell
grep "^\(Added: \|Modified: \|Deleted: \)" diff filename | grep -v 'svn:'
However when I put it through subprocess.POpen it escapes all my backslashes, which knackers the regexp.
Executing command: ['grep', '"^\\Added: \\|Modified: \\|Deleted: \\)", ...]
How do I avoid this?
NB: I'm aware that I can pipe results between subprocesses and I can do the two greps that way. I need help getting the first one working first though :/
NB2: I also tried using filterdiff --clean instead and couldn't get it to work. Searching for Added, Modified or Deleted lines, removing those with 'svn:' in and checking I had some results seemed to work though.
Python code:
command = ['grep', '"^\(Added: \|Modified: \|Deleted: \)"', filename]
sys.stdout.write('Executing command: %s\n' % (command))
p = subprocess.Popen(command,
stdin = subprocess.PIPE
stdout = subprocess.PIPE
stderr = subprocess.STDOUT
shell = True)
data = p.stdout.read()
if len(data) == 0:
sys.stdout.write("Diff does not contain any file modifications./n")
exit(0)

You need to consider what you want grep to see in its command line arguments.
The first argument needs to be the literal string "^\(Added: \|Modified: \|Deleted: \)", so that means that it shouldn't include the double quotes but should include the backslashes.
The way to express this kind of string is to use Python raw strings:
command = ['grep', r'^\(Added: \|Modified: \|Deleted: \)', filename]
A good way to check what you're actually running is to replace grep by echo so you can at least see what you're passing to the command.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python subprocess library: Running grep command from Python - python

Related

subprocess.Popen or proc.stdout.read() corrupting JSON data

python subprocess with ffmpeg give no output

How to subprocess the files on console directly (with or without using StringIO)?

Running bash command on server

Avoid subprocess.Popen auto escaping my backslashes in grep

Categories

Resources