Unwanted new lines using python Popen and pandoc to parse html? - python

I am trying to convert several pieces of html to latex using python and pandoc and I have got stuck with a couple of problems.
To communicate my python script with pandoc I use subprocess.Popen, redirecting stdout to a file I am saving for including it in a latex template.
If I use the classic way of implementing Popen
from subprocess import Popen, PIPE, STDOUT
filedesc = open('myfile.tex','w')
args = ['pandoc', '-f', 'html', '-t', 'latex']
p = Popen(args, stdout=PIPE, stdin=PIPE, stderr=STDOUT)
outp, err = p.communicate(input=html)
filedesc.write(outp)
I get the lines with an additional new line where there shouldn't be any:
> \textbf{M. John Harrison} (Rugby, Warckwickshire, 1945) is a contemporary
>
> English writer.
This is (misteriously?) easy to solve by changing the stdout=PIPE to the file descriptor:
from subprocess import Popen, PIPE, STDOUT
filedesc = open('myfile.tex','w')
args = ['pandoc', '-f', 'html', '-t', 'latex']
p = Popen(args, stdout=filedesc, stdin=PIPE, stderr=STDOUT)
outp, err = p.communicate(input=html)
# not needed
# filedesc.write(outp)
But if I want to use a string buffer, the same problem occurs, since i cannot use it as the stdout parameter.
Any idea on how to stop Popen/pandoc from doing this?
Thanks!

Well, it seems to be a "kind of bug" in python's PIPE (???).
I am executing this code in a Windows system. This means that when a new line is entered, they are in the CR+LF (\r\n) style rather than the (cleaner) LF (\n) new line in unix-style.
At the time I introduce a large html text to be converted by pandoc, the output is returned by the pipe to the command line. Thus, every time the standard column width is reached, an ugly "new line" character is introduced. In my case, a CR+LF. This was making my output look so weird.
The dirty solution I have implemented is to add a replace('\r\n','\n') before writing the output but I am not sure if it's the most elegant one.
from subprocess import Popen, PIPE, STDOUT
html = '<p><b>Some random html code</b> longer than 80 columns ... </p>'
filedesc = open('myfile.tex','w')
args = ['pandoc', '-f', 'html', '-t', 'latex']
p = Popen(args, stdout=PIPE, stdin=PIPE, stderr=STDOUT)
outp, err = p.communicate(input=html)
filedesc.write(outp.replace('\r\n','\n'))**strong text**

Related

closing python command subprocesses

I want to continue with commands after closing subprocess. I have following code but fsutil is not executed. how can I do it?
import os
from subprocess import Popen, PIPE, STDOUT
os.system('mkdir c:\\temp\\vhd')
p = Popen( ["diskpart"], stdin=PIPE, stdout=PIPE )
p.stdin.write("create vdisk file=c:\\temp\\vhd\\test.vhd maximum=2000 type=expandable\n")
p.stdin.write("attach vdisk\n")
p.stdin.write("create partition primary size=10\n")
p.stdin.write("format fs=ntfs quick\n")
p.stdin.write("assign letter=r\n")
p.stdin.write("exit\n")
p.stdout.close
os.system('fsutil file createnew r:\dummy.txt 6553600') #this doesn´t get executed
At the least, I think you need to change your code to look like this:
import os
from subprocess import Popen, PIPE
os.system('mkdir c:\\temp\\vhd')
p = Popen(["diskpart"], stdin=PIPE, stdout=PIPE, stderr=PIPE)
p.stdin.write("create vdisk file=c:\\temp\\vhd\\test.vhd maximum=2000 type=expandable\n")
p.stdin.write("attach vdisk\n")
p.stdin.write("create partition primary size=10\n")
p.stdin.write("format fs=ntfs quick\n")
p.stdin.write("assign letter=r\n")
p.stdin.write("exit\n")
results, errors = p.communicate()
os.system('fsutil file createnew r:\dummy.txt 6553600')
From the documentation for Popen.communicate():
Interact with process: Send data to stdin. Read data from stdout and stderr, until end-of-file is reached. Wait for process to terminate. The optional input argument should be a string to be sent to the child process, or None, if no data should be sent to the child.
You could replace the p.communicate() with p.wait(), but there is this warning in the documentation for Popen.wait()
Warning This will deadlock when using stdout=PIPE and/or stderr=PIPE and the child process generates enough output to a pipe such that it blocks waiting for the OS pipe buffer to accept more data. Use communicate() to avoid that.

Substitution of subprocess.PIPE in Python?

I am using subprocess module to interact with output of the linux commands. below is my code.
import subprocess
import sys
file_name = 'myfile.txt'
p = subprocess.Popen("grep \"SYSTEM CONTROLLER\" "+ file_name, stdout=subprocess.PIPE, shell=True)
(output, err) = p.communicate()
print output.strip()
p = subprocess.Popen("grep \"controller\|worker\" "+ file_name, stdout=subprocess.PIPE, shell=True)
(output, err) = p.communicate()
lines = output.rstrip().split("\n")
print lines
My program hangs while executing second subprocess i.e.
p = subprocess.Popen("grep \"controller\|worker\""+ file_name,stdout=subprocess.PIPE, shell=True)
I got to know that the reason of process hang is buffer redirected to subprocess.PIPE is getting filled, which blocks the process from writing further.
I want to know if there is any way to avoid the buffer full situation so that my program keeps on executing without any hang issue ?
The actual issue is that there is a whitespace missing between the pattern and the filename and therefore grep waits for input on the standard input (stdin).
"buffer full" (.communicate() is not susceptible) or p.stdout.read() (it fixes nothing: it loads the output into memory and unlike .communicate() it fails if more than one pipe is used) are a red herring here.
Drop shell=True and use a list argument for the command:
#!/usr/bin/env python
from subprocess import Popen, PIPE
p = Popen(["grep", r"controller\|worker", file_name], stdout=PIPE)
output = p.communicate()[0]
if p.returncode == 0:
print('found')
elif p.returncode == 1:
print('not found')
else:
print('error')
As it says at https://docs.python.org/3/library/subprocess.html#subprocess.Popen.communicate:
Note: The data read is buffered in memory, so do not use this method
if the data size is large or unlimited.
Instead, use the file objects to read the text as it is produced:
output = p.stdout.read()
As long as no other pipes (e.g. stderr) fill up while you are reading, the process shouldn't be blocked.

Concatenate command of an external shell software in python [duplicate]

If I do the following:
import subprocess
from cStringIO import StringIO
subprocess.Popen(['grep','f'],stdout=subprocess.PIPE,stdin=StringIO('one\ntwo\nthree\nfour\nfive\nsix\n')).communicate()[0]
I get:
Traceback (most recent call last):
File "<stdin>", line 1, in ?
File "/build/toolchain/mac32/python-2.4.3/lib/python2.4/subprocess.py", line 533, in __init__
(p2cread, p2cwrite,
File "/build/toolchain/mac32/python-2.4.3/lib/python2.4/subprocess.py", line 830, in _get_handles
p2cread = stdin.fileno()
AttributeError: 'cStringIO.StringI' object has no attribute 'fileno'
Apparently a cStringIO.StringIO object doesn't quack close enough to a file duck to suit subprocess.Popen. How do I work around this?
Popen.communicate() documentation:
Note that if you want to send data to
the process’s stdin, you need to
create the Popen object with
stdin=PIPE. Similarly, to get anything
other than None in the result tuple,
you need to give stdout=PIPE and/or
stderr=PIPE too.
Replacing os.popen*
pipe = os.popen(cmd, 'w', bufsize)
# ==>
pipe = Popen(cmd, shell=True, bufsize=bufsize, stdin=PIPE).stdin
Warning Use communicate() rather than
stdin.write(), stdout.read() or
stderr.read() to avoid deadlocks due
to any of the other OS pipe buffers
filling up and blocking the child
process.
So your example could be written as follows:
from subprocess import Popen, PIPE, STDOUT
p = Popen(['grep', 'f'], stdout=PIPE, stdin=PIPE, stderr=STDOUT)
grep_stdout = p.communicate(input=b'one\ntwo\nthree\nfour\nfive\nsix\n')[0]
print(grep_stdout.decode())
# -> four
# -> five
# ->
On Python 3.5+ (3.6+ for encoding), you could use subprocess.run, to pass input as a string to an external command and get its exit status, and its output as a string back in one call:
#!/usr/bin/env python3
from subprocess import run, PIPE
p = run(['grep', 'f'], stdout=PIPE,
input='one\ntwo\nthree\nfour\nfive\nsix\n', encoding='ascii')
print(p.returncode)
# -> 0
print(p.stdout)
# -> four
# -> five
# ->
I figured out this workaround:
>>> p = subprocess.Popen(['grep','f'],stdout=subprocess.PIPE,stdin=subprocess.PIPE)
>>> p.stdin.write(b'one\ntwo\nthree\nfour\nfive\nsix\n') #expects a bytes type object
>>> p.communicate()[0]
'four\nfive\n'
>>> p.stdin.close()
Is there a better one?
There's a beautiful solution if you're using Python 3.4 or better. Use the input argument instead of the stdin argument, which accepts a bytes argument:
output_bytes = subprocess.check_output(
["sed", "s/foo/bar/"],
input=b"foo",
)
This works for check_output and run, but not call or check_call for some reason.
In Python 3.7+, you can also add text=True to make check_output take a string as input and return a string (instead of bytes):
output_string = subprocess.check_output(
["sed", "s/foo/bar/"],
input="foo",
text=True,
)
I'm a bit surprised nobody suggested creating a pipe, which is in my opinion the far simplest way to pass a string to stdin of a subprocess:
read, write = os.pipe()
os.write(write, "stdin input here")
os.close(write)
subprocess.check_call(['your-command'], stdin=read)
I am using python3 and found out that you need to encode your string before you can pass it into stdin:
p = Popen(['grep', 'f'], stdout=PIPE, stdin=PIPE, stderr=PIPE)
out, err = p.communicate(input='one\ntwo\nthree\nfour\nfive\nsix\n'.encode())
print(out)
Apparently a cStringIO.StringIO object doesn't quack close enough to
a file duck to suit subprocess.Popen
I'm afraid not. The pipe is a low-level OS concept, so it absolutely requires a file object that is represented by an OS-level file descriptor. Your workaround is the right one.
from subprocess import Popen, PIPE
from tempfile import SpooledTemporaryFile as tempfile
f = tempfile()
f.write('one\ntwo\nthree\nfour\nfive\nsix\n')
f.seek(0)
print Popen(['/bin/grep','f'],stdout=PIPE,stdin=f).stdout.read()
f.close()
"""
Ex: Dialog (2-way) with a Popen()
"""
p = subprocess.Popen('Your Command Here',
stdout=subprocess.PIPE,
stderr=subprocess.STDOUT,
stdin=PIPE,
shell=True,
bufsize=0)
p.stdin.write('START\n')
out = p.stdout.readline()
while out:
line = out
line = line.rstrip("\n")
if "WHATEVER1" in line:
pr = 1
p.stdin.write('DO 1\n')
out = p.stdout.readline()
continue
if "WHATEVER2" in line:
pr = 2
p.stdin.write('DO 2\n')
out = p.stdout.readline()
continue
"""
..........
"""
out = p.stdout.readline()
p.wait()
On Python 3.7+ do this:
my_data = "whatever you want\nshould match this f"
subprocess.run(["grep", "f"], text=True, input=my_data)
and you'll probably want to add capture_output=True to get the output of running the command as a string.
On older versions of Python, replace text=True with universal_newlines=True:
subprocess.run(["grep", "f"], universal_newlines=True, input=my_data)
Beware that Popen.communicate(input=s)may give you trouble ifsis too big, because apparently the parent process will buffer it before forking the child subprocess, meaning it needs "twice as much" used memory at that point (at least according to the "under the hood" explanation and linked documentation found here). In my particular case,swas a generator that was first fully expanded and only then written tostdin so the parent process was huge right before the child was spawned,
and no memory was left to fork it:
File "/opt/local/stow/python-2.7.2/lib/python2.7/subprocess.py", line 1130, in _execute_child
self.pid = os.fork()
OSError: [Errno 12] Cannot allocate memory
This is overkill for grep, but through my journeys I've learned about the Linux command expect, and the python library pexpect
expect: dialogue with interactive programs
pexpect: Python module for spawning child applications; controlling them; and responding to expected patterns in their output.
import pexpect
child = pexpect.spawn('grep f', timeout=10)
child.sendline('text to match')
print(child.before)
Working with interactive shell applications like ftp is trivial with pexpect
import pexpect
child = pexpect.spawn ('ftp ftp.openbsd.org')
child.expect ('Name .*: ')
child.sendline ('anonymous')
child.expect ('Password:')
child.sendline ('noah#example.com')
child.expect ('ftp> ')
child.sendline ('ls /pub/OpenBSD/')
child.expect ('ftp> ')
print child.before # Print the result of the ls command.
child.interact() # Give control of the child to the user.
p = Popen(['grep', 'f'], stdout=PIPE, stdin=PIPE, stderr=STDOUT)
p.stdin.write('one\n')
time.sleep(0.5)
p.stdin.write('two\n')
time.sleep(0.5)
p.stdin.write('three\n')
time.sleep(0.5)
testresult = p.communicate()[0]
time.sleep(0.5)
print(testresult)

python subprocess missing arguments

Have been trying to get something like this to work for a while, the below doesn't seem to be sending the correct arg to the c program arg_count, which outputs argc = 1. When I'm pretty sure I would like it to be 2. ./arg_count -arg from the shell outputs 2...
I have tried with another arg (so it would output 3 in the shell) and it still outputs 1 when calling via subprocess.
import subprocess
pipe = subprocess.Popen(["./args/Release/arg_count", "-arg"], shell=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
out, err = pipe.communicate()
result = out.decode()
print "Result : ",result
print "Error : ",err
Any idea where im falling over? I'm running linux btw.
From the documentation:
The shell argument (which defaults to False) specifies whether to use
the shell as the program to execute. If shell is True, it is
recommended to pass args as a string rather than as a sequence.
Thus,
pipe = subprocess.Popen("./args/Release/arg_count -arg", shell=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
should give you what you want.
If shell=True then your call is equivalent to:
from subprocess import Popen, PIPE
proc = Popen(['/bin/sh', '-c', "./args/Release/arg_count", "-arg"],
stdout=PIPE, stderr=PIPE)
i.e., -arg is passed to the shell itself and not your program. Drop shell=True to pass -arg to the program:
proc = Popen(["./args/Release/arg_count", "-arg"],
stdout=PIPE, stderr=PIPE)
If you don't need to capture stderr separately from stdout then you could use check_output():
from subprocess import check_output, STDOUT
output = check_output(["./args/Release/arg_count", "-arg"]) # or
output_and_errors = check_output(["./args/Release/arg_count", "-arg"],
stderr=STDOUT)

Popen.communicate escapes a string I send to stdin

I am trying to spawn a process using Popen and send it a particular string to its stdin.
I have:
pipe = subprocess.Popen(cmd, shell=True, stdin=subprocess.PIPE)
pipe.communicate( my_stdin_str.encode(encoding='ascii') )
pipe.stdin.close()
However, the second line actually escapes the whitespace in my_stdin_str. For example, if I have:
my_stdin_str="This is a string"
The process will see:
This\ is\ a\ string
How can I prevent this behaviour?
I can't reproduce it on Ubuntu:
from subprocess import Popen, PIPE
shell_cmd = "perl -pE's/.\K/-/g'"
p = Popen(shell_cmd, shell=True, stdin=PIPE)
p.communicate("This $PATH is a string".encode('ascii'))
In this case shell=True is unnecessary:
from subprocess import Popen, PIPE
cmd = ["perl", "-pE" , "s/.\K/-/g"]
p = Popen(cmd, stdin=PIPE)
p.communicate("This $PATH is a string".encode('ascii'))
Both produce the same output:
T-h-i-s- -$-P-A-T-H- -i-s- -a- -s-t-r-i-n-g-
Unless you know you need it for some reason, don't run with "shell=True" in general (which, without testing, sounds like what's going on here).

Categories