subprocess.run does run properly pandoc - python

If you run pandoc directly with minimal example no problem:
$ cat workfile.md
This is a **test** see ![alt text](https://github.com/adam-p/markdown-here/raw/master/src/common/images/icon48.png)
$ pandoc workfile.md
<p>This is a <strong>test</strong> see <img src="https://github.com/adam-p/markdown-here/raw/master/src/common/images/icon48.png" alt="alt text" /></p>
But if you call it via subprocess.run, then it fails. This minimal example:
import subprocess, os
path = 'workfile.md'
contents = "This is a **test** see ![alt text](https://github.com/adam-p/markdown-here/raw/master/src/common/images/icon48.png)"
with open(path, 'w') as f:
f.write(contents)
pbody = subprocess.run(["pandoc", "{}".format(path)], check=True, stdout=subprocess.PIPE)
print("**** pbody: ", pbody)
gives us
**** pbody: CompletedProcess(args=['pandoc', 'workfile.md'], returncode=0, stdout=b'\n')

One of the things that Python (and all other programming languages) do to increase performance of common operations is to maintain a buffer for things like file printing. Depending on how much you write to a file, not all of it will get written immediately, which lets the language reduce the amount of (slow) operations it has to do to actually get things on the disk. If you call f.flush() after f.write(contents), you should see pandoc pick up the actual contents of the file.
There's a further layer of buffering that's also worth noting- your operating system may have an updated version of the file in memory, but may not have actually written it to disk. If you're writing a server, you may also want to call os.fsync, which will force the OS to write it to disk.

Related

tail and less commands not monitoring file in real time

I'm looking for a way to monitor a file that is written to by a program on Linux. I found the tail -F command in here, and also recommended was less +FG. I tested it by running tail -F file in one terminal, and a simple python script:
import time
for i in range(20):
print i
time.sleep(0.5)
in another. I redirected the output to the file:
python script.py >> file
I expected that tail would track the file contents and update the display in fixed intervals, instead it only shows what was written to the file after the command terminates.
The same thing happens with less +FG and also if I watch the output from cat. I've also tried using the usual redirect which truncates the file > instead of >>. Here it says the file was truncated, but still does not track it in real time.
Any idea why this doesn't work? (It's suggested here that it might be due to buffered writes, but since my script runs over 10 seconds, I suspect this might not be the cause)
Edit: In case it matters, I'm running Linux Mint 18.1
Python's standard out is buffered. If when you close the script / script is done, you see all the output - that's definitely buffer issue.
You can use this instead:
import time
import sys
for i in range(20):
sys.stdout.write('%d\n' % i)
sys.stdout.flush()
time.sleep(0.5)
I've tested it and it prints values in real time. To overcome buffer issue, after each .write() method I use .flush() force "flushing" the buffer.
Additional options from the comments:
Use the original print statement with sys.stdout.flush() afterwords
Run the python script with python -u for unbuffered binary stdout and stderr
Regarding jon1467 answer (sorry can't comment your answer), your understanding of redirection is wrong.
Try this :
dd if=/dev/urandom > test.txt
while looking at the file size with :
ls -l test.txt
You'll see the file grow while dd is running.
Vinny's answer is correct, python standard output is buffered.
The more common way to the "buffering effect" you notice is by flushing the stdout as Vinny showed you.
You could also use -u option to disable buffering for the whole python process, or you could just reopen standard output with a buffer size of 0 as below (in python2 at least):
sys.stdout = os.fdopen(sys.stdout.fileno(), 'w', 0)

How do I pipe to a file or shell program via Pythons subprocess?

I am working with some fairly large gzipped text files that I have to unzip, edit and re-zip. I use Pythons gzip module for unzipping and zipping, but I have found that my current implementation is far from optimal:
input_file = gzip.open(input_file_name, 'rb')
output_file = gzip.open(output_file_name, 'wb')
for line in input_file:
# Edit line and write to output_file
This approach is unbearably slow – probably because there is a huge overhead involved in doing per line iteration with the gzip module: I initially also run a line-count routine where I - using the gzip module - read chunks of the file and then count the number of newline chars in each chunk and that is very fast!
So one of the optimizations should definitely be to read my files in chunks and then only do per line iterations once the chunks have been unzipped.
As an additional optimization, I have seen a few suggestions to unzip in a shell command via subprocess. Using this approach, the equivalent of the first line in the above could be:
from subprocess import Popen, PIPE
file_input = Popen(["zcat", fastq_filename], stdout=PIPE)
input_file = file_input.stdout
Using this approach input_file becomes a file-like object. I don't know exactly how it is different to a real file object in terms of available attributes and methods, but one difference is that you obviously cannot use seek since it is a stream rather than a file.
This does run faster and it should - unless you run your script in a single core machine the claim is. The latter must mean that subprocess automatically ships different threads to different cores if possible, but I am no expert there.
So now to my current problem: I would like to zip my output in a similar fashion. That is, instead of using Pythons gzip module, I would like to pipe it to a subprocess and then call the shell gzip. This way I could potentially get reading, editing and writing in separate cores, which sounds wildly effective to me.
I have made a puny attempt at this, but attempting to write to output_file resulted in an empty file. Initially, I create an empty file using the touch command because Popen fails if the file does not exist:
call('touch ' + output_file_name, shell=True)
output = Popen(["gzip", output_file_name], stdin=PIPE)
output_file = output.stdin
Any help is greatly appreciated, I am using Python 2.7 by the way. Thanks.
Here is a working example of how this can be done:
#!/usr/bin/env python
from subprocess import Popen, PIPE
output = ['this', 'is', 'a', 'test']
output_file_name = 'pipe_out_test.txt.gz'
gzip_output_file = open(output_file_name, 'wb', 0)
output_stream = Popen(["gzip"], stdin=PIPE, stdout=gzip_output_file) # If gzip is supported
for line in output:
output_stream.stdin.write(line + '\n')
output_stream.stdin.close()
output_stream.wait()
gzip_output_file.close()
If our script only wrote to console and we wanted the output zipped, a shell command equivalent of the above could be:
script_that_writes_to_console | gzip > output.txt.gz
You meant output_file = gzip_process.stdin. After that you can use output_file as you've used gzip.open() object previously (no-seeking).
If the result file is empty then check that you call output_file.close() and gzip_process.wait() at the end of your Python script. Also, the usage of gzip may be incorrect: if gzip writes the compressed output to its stdout then pass stdout=gzip_output_file where gzip_output_file = open(output_file_name, 'wb', 0).

How does multitail buffer its output?

This may not be the best wording for the question. I am trying to see 2 files at once on my screen.
I run:
multitail ~/path/to/somefile.err ~/path/to/somefile.out
I have a python script with the following lines:
sys.stdout = open('~/path/to/somefile.out', 'a')
sys.stderr = open('~/path/to/somefile.err', 'a')
My multitail command seems to only output my .out file, regardless of which order I put the files in the command.
I verified that my script is indeed writing to the files. What is also interesting, was that when I run the following command:
echo "text" >> ~/path/to/somefile.err
All of a sudden I see all the output from the .err file in the multitail screen (including that which didn't show up before)!
What is going on here that I can not see?
P.S. this is my first time using multitail so maybe I overlooked something simple. If it means anything, I am using CentOS 7.
You need to pass either buffering=0 (for unbuffered) or buffering=1 (for line-buffered - probably what you want) in your call to open.
The default is buffering=-1, which is equivalent to something like buffering=512 with a value depending on the system, so nothing will be written to the file until 512 (or whatever) bytes are written.
Alternatively, you could leave buffering set to its default value, and call .flush() every time you want the data to appear in the file.
When you use >> in the shell, that will close the file when the command exits, and closing implies a flush. (You can defer the close by using exec >> file.txt)

Writing append only gzipped log files in Python

I am building a service where I log plain text format logs from several sources (one file per source). I do not intend to rotate these logs as they must be around forever.
To make these forever around files smaller I hope I could gzip them in fly. As they are log data, the files compress very well.
What is a good approach in Python to write append-only gzipped text files, so that the writing can be later resumed when service goes on and off? I am not that worried about losing few lines, but if gzip container itself breaks down and the file becomes unreadable that's no no.
Also, if it's no go, I can simply write them in as plain text without gzipping if it's not worth of the hassle.
Note: On unix systems you should seriously consider using an external program, written for this exact task:
logrotate (rotates, compresses, and mails system logs)
You can set the number of rotations so high, that the first file would be deleted in 100 years or so.
In Python 2, logging.FileHandler takes an keyword argument encoding that can be set to bz2 or zlib.
This is because logging uses the codecs module, which in turn treats bz2 (or zlib) as encoding:
>>> import codecs
>>> with codecs.open("on-the-fly-compressed.txt.bz2", "w", "bz2") as fh:
... fh.write("Hello World\n")
$ bzcat on-the-fly-compressed.txt.bz2
Hello World
Python 3 version (although the docs mention bz2 as alias, you'll actually have to use bz2_codec - at least w/ 3.2.3):
>>> import codecs
>>> with codecs.open("on-the-fly-compressed.txt.bz2", "w", "bz2_codec") as fh:
... fh.write(b"Hello World\n")
$ bzcat on-the-fly-compressed.txt.bz2
Hello World

Running a command line from python and piping arguments from memory

I was wondering if there was a way to run a command line executable in python, but pass it the argument values from memory, without having to write the memory data into a temporary file on disk. From what I have seen, it seems to that the subprocess.Popen(args) is the preferred way to run programs from inside python scripts.
For example, I have a pdf file in memory. I want to convert it to text using the commandline function pdftotext which is present in most linux distros. But I would prefer not to write the in-memory pdf file to a temporary file on disk.
pdfInMemory = myPdfReader.read()
convertedText = subprocess.<method>(['pdftotext', ??]) <- what is the value of ??
what is the method I should call and how should I pipe in memory data into its first input and pipe its output back to another variable in memory?
I am guessing there are other pdf modules that can do the conversion in memory and information about those modules would be helpful. But for future reference, I am also interested about how to pipe input and output to the commandline from inside python.
Any help would be much appreciated.
with Popen.communicate:
import subprocess
out, err = subprocess.Popen(["pdftotext", "-", "-"], stdout=subprocess.PIPE).communicate(pdf_data)
os.tmpfile is useful if you need a seekable thing. It uses a file, but it's nearly as simple as a pipe approach, no need for cleanup.
tf=os.tmpfile()
tf.write(...)
tf.seek(0)
subprocess.Popen( ... , stdin = tf)
This may not work on Posix-impaired OS 'Windows'.
Popen.communicate from subprocess takes an input parameter that is used to send data to stdin, you can use that to input your data. You also get the output of your program from communicate, so you don't have to write it into a file.
The documentation for communicate explicitly warns that everything is buffered in memory, which seems to be exactly what you want to achieve.

Categories