Running a command line from python and piping arguments from memory - python

I was wondering if there was a way to run a command line executable in python, but pass it the argument values from memory, without having to write the memory data into a temporary file on disk. From what I have seen, it seems to that the subprocess.Popen(args) is the preferred way to run programs from inside python scripts.
For example, I have a pdf file in memory. I want to convert it to text using the commandline function pdftotext which is present in most linux distros. But I would prefer not to write the in-memory pdf file to a temporary file on disk.
pdfInMemory = myPdfReader.read()
convertedText = subprocess.<method>(['pdftotext', ??]) <- what is the value of ??
what is the method I should call and how should I pipe in memory data into its first input and pipe its output back to another variable in memory?
I am guessing there are other pdf modules that can do the conversion in memory and information about those modules would be helpful. But for future reference, I am also interested about how to pipe input and output to the commandline from inside python.
Any help would be much appreciated.

with Popen.communicate:
import subprocess
out, err = subprocess.Popen(["pdftotext", "-", "-"], stdout=subprocess.PIPE).communicate(pdf_data)

os.tmpfile is useful if you need a seekable thing. It uses a file, but it's nearly as simple as a pipe approach, no need for cleanup.
tf=os.tmpfile()
tf.write(...)
tf.seek(0)
subprocess.Popen( ... , stdin = tf)
This may not work on Posix-impaired OS 'Windows'.

Popen.communicate from subprocess takes an input parameter that is used to send data to stdin, you can use that to input your data. You also get the output of your program from communicate, so you don't have to write it into a file.
The documentation for communicate explicitly warns that everything is buffered in memory, which seems to be exactly what you want to achieve.

Related

Python's sub-process returns truncated output when using PIPE to read very long outputs

We have a rasterization utility developed in NodeJS that converts HTML string to the Base64 of the rendered HTML page. The way we are using it is by using sub-process module to run the utility and then reading its STDOUT by using PIPE. The basic code that implements this is as follows:
from subprocess import run, PIPE
result = run(['capture', tmp_file.name, '--type', 'jpeg'], stdout=PIPE, stderr=PIPE, check=True)
output = result.stdout.decode('utf-8')
The output contains the Base64 string of the rendered HTML page. As Base64 is very large for large pages, I have noticed that for some HTML pages, the output is truncated and is not complete. But, this happens randomly so Base64 could be correct for a page one time but truncated next time. It is important to mention here that I'm currently using threading (10 threads) to convert HTML to Base64 images concurrently so that might play a role here.
I analyzed this in detail and found out that, under the hood, the subprocess.run method uses the _communicate method which in turn uses the os.read() method to read from the PIPE. I printed its output and found out that it's also truncated and that's why STDOUT is truncated. Strange behavior altogether.
Finally, I was able to solve this by using a file handle instead of the PIPE and it works perfectly.
with open(output_filename, 'w+') as out_file:
result = run(['capture', tmp_file.name, '--type', 'jpeg'], stdout=out_file, stderr=PIPE, check=True)
I'm just curious why the PIPE fails to handle complete output and that too, randomly.
When you run subprocess, the command gets executed on bash.
When you use PIPE as stdout, internally bash stores data in a temp variable, which has hard limit of 128 Kb. anything that spills over 128kb gets truncated.
Best way to handle large data is to capture the output in a file.

In Python 3 on Windows, how can I set NTFS compression on a file? Nothing I've googled has gotten me even close to an answer

(Background: On an NTFS partition, files and/or folders can be set to "compressed", like it's a file attribute. They'll show up in blue in Windows Explorer, and will take up less disk space than they normally would. They can be accessed by any program normally, compression/decompression is handled transparently by the OS - this is not a .zip file. In Windows, setting a file to compressed can be done from a command line with the "Compact" command.)
Let's say I've created a file called "testfile.txt", put some data in it, and closed it. Now, I want to set it to be NTFS compressed. Yes, I could shell out and run Compact, but is there a way to do it directly in Python code instead?
In the end, I ended up cheating a bit and simply shelling out to the command line Compact utility. Here is the function I ended up writing. Errors are ignored, and it returns the output text from the Compact command, if any.
def ntfscompress(filename):
import subprocess
_compactcommand = 'Compact.exe /C /I /A "{}"'.format(filename)
try:
_result = subprocess.run(_compactcommand, timeout=86400,
stdout=subprocess.PIPE,
stderr=subprocess.STDOUT,text=True)
return(_result.stdout)
except:
return('')

Capturing output file of ffmpeg with vid.stab in python into a variable

I'm trying to write a python script to stabilize videos using ffmpeg and the vid.stab library.
My problem is that the output file doesn't seem to go through stdout, so using subprocess.Popen() returns an empty variable.
cmd1=["ffmpeg", "-i","./input.MOV", "-vf", "vidstabdetect=stepsize=6:shakiness=10:accuracy=15", "-f","null","pipe:1"]
p = subprocess.Popen(cmd1, stdout=subprocess.PIPE)
vectors, err = p.communicate()
The issues is that vibstabdetect takes a filter called result, and outputs a file to whatever's specified there, and stdout remains empty. (If there's no result specified it defaults to transforms.trf.)
Is there a way to get the contents of the result file?
When running the script with the code above it executes without error, but the file is created with the default name and the variable remains empty.
You need to specify stdout for the filter logging data, not the transcoded output from ffmpeg, which is what your current -f null pipe:1 does.
However, the vidstabdetect filter uses the POSIX fopen to open the destination for the transform data, unlike most other filters which use the internal avio_open. For fopen, pipe:1 is not acceptable. For Windows, CON, and for linux, /dev/stdout, as you confirmed, is required.

How do I pipe to a file or shell program via Pythons subprocess?

I am working with some fairly large gzipped text files that I have to unzip, edit and re-zip. I use Pythons gzip module for unzipping and zipping, but I have found that my current implementation is far from optimal:
input_file = gzip.open(input_file_name, 'rb')
output_file = gzip.open(output_file_name, 'wb')
for line in input_file:
# Edit line and write to output_file
This approach is unbearably slow – probably because there is a huge overhead involved in doing per line iteration with the gzip module: I initially also run a line-count routine where I - using the gzip module - read chunks of the file and then count the number of newline chars in each chunk and that is very fast!
So one of the optimizations should definitely be to read my files in chunks and then only do per line iterations once the chunks have been unzipped.
As an additional optimization, I have seen a few suggestions to unzip in a shell command via subprocess. Using this approach, the equivalent of the first line in the above could be:
from subprocess import Popen, PIPE
file_input = Popen(["zcat", fastq_filename], stdout=PIPE)
input_file = file_input.stdout
Using this approach input_file becomes a file-like object. I don't know exactly how it is different to a real file object in terms of available attributes and methods, but one difference is that you obviously cannot use seek since it is a stream rather than a file.
This does run faster and it should - unless you run your script in a single core machine the claim is. The latter must mean that subprocess automatically ships different threads to different cores if possible, but I am no expert there.
So now to my current problem: I would like to zip my output in a similar fashion. That is, instead of using Pythons gzip module, I would like to pipe it to a subprocess and then call the shell gzip. This way I could potentially get reading, editing and writing in separate cores, which sounds wildly effective to me.
I have made a puny attempt at this, but attempting to write to output_file resulted in an empty file. Initially, I create an empty file using the touch command because Popen fails if the file does not exist:
call('touch ' + output_file_name, shell=True)
output = Popen(["gzip", output_file_name], stdin=PIPE)
output_file = output.stdin
Any help is greatly appreciated, I am using Python 2.7 by the way. Thanks.
Here is a working example of how this can be done:
#!/usr/bin/env python
from subprocess import Popen, PIPE
output = ['this', 'is', 'a', 'test']
output_file_name = 'pipe_out_test.txt.gz'
gzip_output_file = open(output_file_name, 'wb', 0)
output_stream = Popen(["gzip"], stdin=PIPE, stdout=gzip_output_file) # If gzip is supported
for line in output:
output_stream.stdin.write(line + '\n')
output_stream.stdin.close()
output_stream.wait()
gzip_output_file.close()
If our script only wrote to console and we wanted the output zipped, a shell command equivalent of the above could be:
script_that_writes_to_console | gzip > output.txt.gz
You meant output_file = gzip_process.stdin. After that you can use output_file as you've used gzip.open() object previously (no-seeking).
If the result file is empty then check that you call output_file.close() and gzip_process.wait() at the end of your Python script. Also, the usage of gzip may be incorrect: if gzip writes the compressed output to its stdout then pass stdout=gzip_output_file where gzip_output_file = open(output_file_name, 'wb', 0).

using a python list as input for linux command that uses stdin as input

I am using python scripts to load data to a database bulk loader.
The input to the loader is stdin. I have been unable to get the correct syntax to call the unix based bulk loader passing the contents of a python list to be loaded.
I have been reading about Popen and PIPE but they have not been behaving as i expect.
The python list contains database records to be bulkloaded. In linux it would look similar to this:
echo "this is the string being written to the DB" | sql -c "COPY table FROM stdin"
What would be the correct way replace the echo statement with a python list to be used with this command ?
I do not have sample code for this process as i have been experimenting with the features of Popen and PIPE with some very simple syntax and not obtaining the desired result.
Any help would be very much appreciated.
Thanks
If your data is short and simple, you could preformat the entire list and do it simple with subprocess like this:
import subprocess
data = ["list", "of", "stuff"]
proc = subprocess.Popen(["sql", "-c", "COPY table FROM stdin"], stdin=subprocess.PIPE)
proc.communicate("\n".join(data))
If the data is too big to preformat like this, then you can attempt to use the stdin pipe directly, though subprocess module is flaky when using the pipes if you need to read from stdout/stderr too.
for line in data:
print >>proc.stdin, line

Categories