I am working on extracting PDFs from SEC filings. They usually come like this:
SEC Filing Example
For whatever reason when I save the raw PDF to a .text file, and then try to run
uudecode -o output_file.pdf input_file.txt
from the python subprocess.call() function or any other python function that allows commands to be executed from the command line, the PDF files that are generated are corrupted. If I run this same command from the command line directly there is no corruption.
When taking a closer look at the PDF file being output from the python script, it looks like the file ends prematurely. Is there some sort of output limit when executing a command line command from python?
Thanks!
This script worked fine for me running under Python 3.4.1 on Fedora 21 x86_64 with uudecode 4.15.2:
import subprocess
subprocess.call("uudecode -o output_file.pdf input_file.txt", shell=True)
Using the linked SEC filing (length: 173,141 B; sha1: e4f7fa2cbb3422411c2f2968d954d6bb9808b884), the decoded PDF (length: 124,557 B; sha1: 1676320e1d9923e14d19451c16688198bc93ca0d) appears correct when viewed.
There may be something else in your environment causing the problem. You may want to add additional details to your question.
Is there some sort of output limit when executing a command line command from python?
If by "output limit" you mean the size of the file being written by uudecode, then no. The only type of "output limit" you need to worry about when using the subprocess module is when you pass stdout=PIPE or stderr=PIPE when creating a child process. If the child process writes enough data to either of these streams, and your script does not regularly drain them, the child process will block (see the subprocess module documentation). In my test, uudecode wrote nothing to stdout or stderr.
Related
In a Python project, I need the output of an external (non-Python) command.
Let's call it identify_stuff.*
Command-line scenario
When called from command-line, this command requires a file name as an argument.
If its input is generated dynamically, we cannot pipe it into the command – this doesn't work:
cat input/* | ./identify_stuff > output.txt
cat input/* | ./identify_stuff - > output.txt
... it strictly requires a file name it can open, so one needs to create a temporary file on disk for the output of the first command, from where the second command can read the data.
However, the identify_stuff program really iterates over the input lines only once, no seeking or re-reading is involved.
So in Bash we can avoid the temporary file with the <(...) construct.
This works:
./identify_stuff <(cat input/*) > output.txt
This pipes the output of the first command to some device at a path /dev/fdX, which can be used for opening a stream like the path to a regular file on disk.
Actual scenario: call from within Python
Now, instead of just cat input/*, the input text is created inside a Python program, which continues to run after the output of identify_stuff has been captured.
The natural choice for calling an external command is the standard-library's subprocess.run().
For performance reasons, I would like to avoid creating a file on-disk.
Is there any way to do this with the subprocess tools?
The stdin and input parameters of subprocess.run won't work, because the external command ignores STDIN and specifically requires a file-name argument.
*Actually, it's this tool: https://github.com/jakelever/Ab3P/blob/master/identify_abbr.C
I am attempting to write a (Bash) shell script that wraps around a third-party python script and captures all output (errors and stdout) into a log file, and also restarts the script with a new batch of data each time it completes successfully. I'm doing this on a standard Linux distribution, but hopefully this solution can be platform-independent.
So here's a simplified version of the shell script, omitting everything except the logging:
#!/bin/bash
/home/me/script.py &>> /home/me/logfile
The problem is the third-party python script's output is mostly on a single line, which is being refreshed periodically (~every 90 seconds) by use of a carriage return ("\r"). Here's an example of the type of output I mean:
#!/usr/bin/env python3
import time
tracker = 1
print("This line is captured in the logfile because it ends with a newline")
while tracker < 5:
print(" This output isn't captured in the log file. Tracker = " + str(tracker),end="\r")
tracker += 1
time.sleep(1)
print("This line does get captured. Script is done. ")
How can I write a simple shell script to capture the output each time it is refreshed, or at least to periodically capture the current output as it would appear on the screen if I were running the script in the terminal?
Obviously I could try to modify the python script to change its output behavior, but the actual script I'm using is very complex and I think beyond my abilities to do that easily.
The program should have disabled this behavior when output is not a tty.
The output is already captured completely, it's just that you see all the updates at once when you cat the file. Open it in a text editor and see for yourself.
To make the file easier to work with, you can just replace the carriage returns with line feeds:
/home/me/script.py | tr '\r' '\n'
If the process normally produces output right away, but not with this command, you can disable Python's output buffering.
I'm looking for a way to monitor a file that is written to by a program on Linux. I found the tail -F command in here, and also recommended was less +FG. I tested it by running tail -F file in one terminal, and a simple python script:
import time
for i in range(20):
print i
time.sleep(0.5)
in another. I redirected the output to the file:
python script.py >> file
I expected that tail would track the file contents and update the display in fixed intervals, instead it only shows what was written to the file after the command terminates.
The same thing happens with less +FG and also if I watch the output from cat. I've also tried using the usual redirect which truncates the file > instead of >>. Here it says the file was truncated, but still does not track it in real time.
Any idea why this doesn't work? (It's suggested here that it might be due to buffered writes, but since my script runs over 10 seconds, I suspect this might not be the cause)
Edit: In case it matters, I'm running Linux Mint 18.1
Python's standard out is buffered. If when you close the script / script is done, you see all the output - that's definitely buffer issue.
You can use this instead:
import time
import sys
for i in range(20):
sys.stdout.write('%d\n' % i)
sys.stdout.flush()
time.sleep(0.5)
I've tested it and it prints values in real time. To overcome buffer issue, after each .write() method I use .flush() force "flushing" the buffer.
Additional options from the comments:
Use the original print statement with sys.stdout.flush() afterwords
Run the python script with python -u for unbuffered binary stdout and stderr
Regarding jon1467 answer (sorry can't comment your answer), your understanding of redirection is wrong.
Try this :
dd if=/dev/urandom > test.txt
while looking at the file size with :
ls -l test.txt
You'll see the file grow while dd is running.
Vinny's answer is correct, python standard output is buffered.
The more common way to the "buffering effect" you notice is by flushing the stdout as Vinny showed you.
You could also use -u option to disable buffering for the whole python process, or you could just reopen standard output with a buffer size of 0 as below (in python2 at least):
sys.stdout = os.fdopen(sys.stdout.fileno(), 'w', 0)
So I've got a python script that, at it's core, makes .7z archives of selected directories for the purpose of backing up data. For simplicty sake I've simply invoked 7-zip through the windows command line, like so:
def runcompressor(target, contents):
print("Compressing {}...".format(contents))
archive = currentmodule
archive += "{}\\{}.7z".format(target, target)
os.system('7z u "{}" "{}" -mx=9 -mmt=on -ssw -up1q0r2x2y2z1w2'.format(archive, contents))
print("Done!")
Which creates a new archive if one doesn't exist and updates the old one if it does, but if something goes wrong the archive will be corrupted, and if this command hits an existing, corrupted archive, it just gives up. Now 7zip has a command for testing the integrity of an archive, but the documentation says nothing about giving an output, and then comes the trouble of capturing that output in python.
Is there a way I can test the archives first, to determine if they've been corrupted?
The 7z executable returns a value of two or greater if it encounters a problem. In a batch script, you would generally use errorlevel to detect this. Unfortunately, os.system() under Windows gives the return value of the command interpreter used to run your program, not the exit value of your program itself.
If you want the latter, you'll probably going to have to get your hands a little dirtier with the subprocess module, rather than using the os.system() call.
If you have version 3.5 (or better), this is as simple as:
import subprocess as sp
x = sp.run(['7z', 'a', 'junk.7z', 'junk.txt'], stdout=sp.PIPE, stderr=sp.STDOUT)
print(x.returncode)
That junk.txt in my case is a real file but junk.7z is just a copy of one of my text files, hence an invalid archive. The output from the program is 2 so it's easily detectable if something went wrong.
If you print out x rather than just x.returncode, you'll see something like (reformatted and with \r\n sequences removed for readability):
CompletedProcess(
args=['7z', 'a', 'junk.7z', 'junk.txt'],
returncode=2,
stdout=b'
7-Zip [64] 9.20 Copyright (c) 1999-2010 Igor Pavlov 2010-11-18
Error: junk.7z is not supported archive
System error:
Incorrect function.
'
)
(on os x 10.10.1)I am trying to use a paired-end merger (Casper) within a python script. i'm using os.system (don't want to use subprocess or pexpect modules). In my script here is the line that doesn't work:
os.system("casper %s %s -o %s"%(filein[0],filein[1],fileout))
#filein[0]: input file 1
#filein[1]: input file 2
#fileout: output prefix (default==casper)
Once my script is launched only the 2 first string parameters of this command are interpreted but not the third one, causing an output file with the default prefix name. Since my function is iterating through a lot of fastq files, they are all merged in a single "casper.fastq" file.
I tried to mess up with the part of the command that doesn't work (right after -o), putting meaningless string and still it is executed with no error and the default output, here is the "messed up line":
os.system("casper %s %s -ldkfnlqdskgfno %s"%(filein[0],filein[1],fileout))
Could anybody help in understanding what the heck is going on?
Print the command before execute it to check if your command wrapped correctly(like file name need to be quoted)
Execute your assumed output command directly to see if it is misinterpreted.