Avoid temp file in call to suprocess.run() - python

In a Python project, I need the output of an external (non-Python) command.
Let's call it identify_stuff.*
Command-line scenario
When called from command-line, this command requires a file name as an argument.
If its input is generated dynamically, we cannot pipe it into the command – this doesn't work:
cat input/* | ./identify_stuff > output.txt
cat input/* | ./identify_stuff - > output.txt
... it strictly requires a file name it can open, so one needs to create a temporary file on disk for the output of the first command, from where the second command can read the data.
However, the identify_stuff program really iterates over the input lines only once, no seeking or re-reading is involved.
So in Bash we can avoid the temporary file with the <(...) construct.
This works:
./identify_stuff <(cat input/*) > output.txt
This pipes the output of the first command to some device at a path /dev/fdX, which can be used for opening a stream like the path to a regular file on disk.
Actual scenario: call from within Python
Now, instead of just cat input/*, the input text is created inside a Python program, which continues to run after the output of identify_stuff has been captured.
The natural choice for calling an external command is the standard-library's subprocess.run().
For performance reasons, I would like to avoid creating a file on-disk.
Is there any way to do this with the subprocess tools?
The stdin and input parameters of subprocess.run won't work, because the external command ignores STDIN and specifically requires a file-name argument.
*Actually, it's this tool: https://github.com/jakelever/Ab3P/blob/master/identify_abbr.C

Related

feed uncompressed file to command line argument

Let's say I have a gzipped file, but my script only takes in an uncompressed file.
Without modifying the script to take in a compressed file, could I uncompress the file on the fly with bash?
For example:
python ../scripts/myscript.py --in (gunzip compressed_file.txt.gz)
You can use a process substitution, as long as the Python script doesn't try to seek backwards in the file:
python ../scripts/myscript.py --in <(gunzip compressed_file.txt.gz)
Python receives a file name as an argument; the name just doesn't refer to a simple file on disk. It can only be opened in read-only mode, and attempts to use the seek method will fail.
If you were using zsh instead of bash, you could use
python ../scripts/myscript.py --in =(gunzip compressed_file.txt.gz)
and Python would receive the name of an actual (temporary) file that could be used like any other file. Said file would be deleted by the shell after python exits, though.

Python to read stdin from other source continously

Is it possible to allow Python to read from stdin from another source such as a file continually? Basically I'm trying to allow my script to use stdin to echo input and I'd like to use a file or external source to interact with it (while remaining open).
An example might be (input.py):
#!/usr/bin/python
import sys
line = sys.stdin.readline()
while line:
print line,
line = sys.stdin.readline()
Executing this directly I can continuously enter text and it echos back while the script remains alive. If you want to use an external source though such as a file or input from bash then the script exits immediately after receiving input:
$ echo "hello" | python input.py
hello
$
Ultimately what I'd like to do is:
$ tail -f file | python input.py
Then if the file updates have input.py echo back anything that is added to file while remaining open. Maybe I'm approaching this the wrong way or I'm simply clueless, but is there a way to do it?
Use the -F option to tail to make it reopen the file if it gets renamed or deleted and a new file is created with the original name. Some editors write the file this way, and logfile rotation scripts also usually work this way (they rename the original file to filename.1, and create a new log file).
$ tail -F file | python input.py

Python Uudecode Call Corruption

I am working on extracting PDFs from SEC filings. They usually come like this:
SEC Filing Example
For whatever reason when I save the raw PDF to a .text file, and then try to run
uudecode -o output_file.pdf input_file.txt
from the python subprocess.call() function or any other python function that allows commands to be executed from the command line, the PDF files that are generated are corrupted. If I run this same command from the command line directly there is no corruption.
When taking a closer look at the PDF file being output from the python script, it looks like the file ends prematurely. Is there some sort of output limit when executing a command line command from python?
Thanks!
This script worked fine for me running under Python 3.4.1 on Fedora 21 x86_64 with uudecode 4.15.2:
import subprocess
subprocess.call("uudecode -o output_file.pdf input_file.txt", shell=True)
Using the linked SEC filing (length: 173,141 B; sha1: e4f7fa2cbb3422411c2f2968d954d6bb9808b884), the decoded PDF (length: 124,557 B; sha1: 1676320e1d9923e14d19451c16688198bc93ca0d) appears correct when viewed.
There may be something else in your environment causing the problem. You may want to add additional details to your question.
Is there some sort of output limit when executing a command line command from python?
If by "output limit" you mean the size of the file being written by uudecode, then no. The only type of "output limit" you need to worry about when using the subprocess module is when you pass stdout=PIPE or stderr=PIPE when creating a child process. If the child process writes enough data to either of these streams, and your script does not regularly drain them, the child process will block (see the subprocess module documentation). In my test, uudecode wrote nothing to stdout or stderr.

Why doesn't my bash script read lines from a file when called from a python script?

I am trying to write a small program in bash and part of it needs to be able to get some values from a txt file where the different files are separated by a line, and then either add each line to a variable or add each line to one array.
So far I have tried this:
FILE=$"transfer_config.csv"
while read line
do
MYARRAY[$index]="$line"
index=$(($index+1))
done < $FILE
echo ${MYARRAY[0]}
This just produces a blank line though, and not what was on the first line of the config file.
I am not returned with any errors which is why I am not too sure why this is happening.
The bash script is called though a python script using os.system("$HOME/bin/mcserver_config/server_transfer/down/createRemoteFolder"), but if I simply call it after the python program has made the file which the bash script reads, it works.
I am almost 100% sure it is not an issue with the directories, because pwd at the top of the bash script shows it in the correct directory, and the python program is also creating the data file in the correct place.
Any help is much appreciated.
EDIT:
I also tried the subprocess.call("path_to_script", shell=True) to see if it would make a difference, I know it is unlikely but it didn't.
I suspect that when calling the bash script from python, having just created the file, you are not really finished with that file: you should either explicitly close the file or use a with construct.
Otherwise, the written data is still in any buffer (from the file object, or in the OS, or wherever). Only closing (or at least flushing) the file makes sure the data is indeed in the file.
BTW, instead of os.system, you should use the subprocess module...

use two pipelines for python input file argument and stdin streaming

Is there a one-liner approach to running the following python script in linux bash, without saving any temporary file (except /dev/std* ) ?
my python script test.py takes in a filename as an argument, but also sys.stdin as a streaming input.
#test.py
#!/usr/bin/python
import sys
fn=sys.argv[1]
checkofflist=[]
with open(fn,'r') as f:
for line in f.readlines():
checkofflist.append(line)
for line in sys.stdin:
if line in checkofflist:
# do something to line
I would like to do something like
hadoop fs -cat inputfile.txt > /dev/stdout | cat streamingfile.txt | python test.py /dev/stdin
But of course this doesn't work since the middle cat corrupts the intended /dev/stdin content. Being able to do this is nice since then I don't need to save hdfs files locally every time I need to work with them.
I think what you're looking for is:
python test.py <( hadoop fs -cat inputfile.txt ) <streamingfile.txt
In bash, <( ... ) is Process Substitution. The command inside the parentheses is run with its output connected to a fifo or equivalent, and the name of the fifo (or /dev/fd/n if bash is able to use an unnamed pipe) is substituted as an argument. The tool sees a filename, which it can just open and use normally. (>(...) is also available, with input connected to a fifo, in case you want a named streaming output.)
Without relying on bash process substitution, you might also try
hadoop fs -cat inputfile.txt | python test.py streamingfile.txt
This provides streamingfile.txt as a command-line argument for test.py to use as a file name to open, as well as providing the contents of inputfile.txt on standard input.

Categories