Automating PDF generation - python

What would be a solid tool to use for generating PDF reports? Particularly, we are interested in creating interactive PDFs that have video, like the example found here.
Right now we are using Python and reportlab to generate PDFs, but have not explored the library completely (mostly because the license pricing is a little prohibitive)
We have been looking at the Adobe's SDK and iText libraries but it's hard to say what the capabilities of either are.
The ability to generate a document from a template PDF would be a plus.
Any pointers or comments will be appreciated.
Thanks,

Recently, I needed to create PDF reports for a Django application; a ReportLab license was available, but I ended up choosing LaTeX. The benefit of this approach is that we could use Django templates to generate the LaTeX source, and not get over encumbered writing lots of code for the many reports we needed to create. Plus, we could take advantage of the relatively much more concise LaTeX syntax (which does have it's many quirks and is not suitable for every purpose).
This snippet provides a general overview of the approach. I found it necessary to make some changes, which I have provided at the end of this question. The main addition is detection for Rerun LaTeX messages, which indicates an additional pass is required. Usage is as simple as:
def my_view(request):
pdf_stream = process_latex(
'latex_template.tex',
context=RequestContext(request, {'context_obj': context_obj})
)
return HttpResponse(pdf_stream, content_type='application/pdf')
It is possible to embed videos in LaTeX generated PDFs, however I do not have any experience with it. Here is a top Google result.
This solution does require spawning a new process (pdflatex), so if you want a pure Python solution keep looking.
import os
from subprocess import Popen, PIPE
from tempfile import NamedTemporaryFile
from django.template import loader, Context
class LaTeXException(Exception):
pass
def process_latex(template, context={}, type='pdf', outfile=None):
"""
Processes a template as a LaTeX source file.
Output is either being returned or stored in outfile.
At the moment only pdf output is supported.
"""
t = loader.get_template(template)
c = Context(context)
r = t.render(c)
tex = NamedTemporaryFile()
tex.write(r)
tex.flush()
base = tex.name
names = dict((x, '%s.%s' % (base, x)) for x in (
'log', 'aux', 'pdf', 'dvi', 'png'))
output = names[type]
stdout = None
if type == 'pdf' or type == 'dvi':
stdout = pdflatex(base, type)
elif type == 'png':
stdout = pdflatex(base, 'dvi')
out, err = Popen(
['dvipng', '-bg', '-transparent', names['dvi'], '-o', names['png']],
cwd=os.path.dirname(base), stdout=PIPE, stderr=PIPE
).communicate()
os.remove(names['log'])
os.remove(names['aux'])
# pdflatex appears to ALWAYS return 1, never returning 0 on success, at
# least on the version installed from the Ubuntu apt repository.
# so instead of relying on the return code to determine if it failed,
# check if it successfully created the pdf on disk.
if not os.path.exists(output):
details = '*** pdflatex output: ***\n%s\n*** LaTeX source: ***\n%s' % (
stdout, r)
raise LaTeXException(details)
if not outfile:
o = file(output).read()
os.remove(output)
return o
else:
os.rename(output, outfile)
def pdflatex(file, type='pdf'):
out, err = Popen(
['pdflatex', '-interaction=nonstopmode', '-output-format', type, file],
cwd=os.path.dirname(file), stdout=PIPE, stderr=PIPE
).communicate()
# If the output tells us to rerun, do it by recursing over ourself.
if 'Rerun LaTeX.' in out:
return pdflatex(file, type)
else:
return out

I suggest to use https://github.com/mreiferson/py-wkhtmltox to render HTML to PDF.
And use any tool you choose to render reports as HTML. I like http://www.makotemplates.org/

Related

How can I retrieve a file's details in Windows using the Standard Library for Python 2

I need to read the details for a file in Windows so that I can interrogate the file's 'File version' as displayed in the Details tab of the File Properties window.
I haven't found anything in the standard library that makes this very easy to accomplish but figured if I could find the right windows function, I could probably accomplish it using ctypes.
Does anyone have any exemplary code or can they point me to a Windows function that would let me read this info. I took a look a GetFileAttributes already, but that wasn't quite right as far as I could tell.
Use the win32 api Version Information functions from ctypes. The api is a little fiddly to use, but I also want this so have thrown together a quick script as an example.
usage: version_info.py [-h] [--lang LANG] [--codepage CODEPAGE] path
Can also use as a module, see the VersionInfo class. Checked with Python 2.7 and 3.6 against a few files.
import array
from ctypes import *
def get_file_info(filename, info):
"""
Extract information from a file.
"""
# Get size needed for buffer (0 if no info)
size = windll.version.GetFileVersionInfoSizeA(filename, None)
# If no info in file -> empty string
if not size:
return ''
# Create buffer
res = create_string_buffer(size)
# Load file informations into buffer res
windll.version.GetFileVersionInfoA(filename, None, size, res)
r = c_uint()
l = c_uint()
# Look for codepages
windll.version.VerQueryValueA(res, '\\VarFileInfo\\Translation',
byref(r), byref(l))
# If no codepage -> empty string
if not l.value:
return ''
# Take the first codepage (what else ?)
codepages = array.array('H', string_at(r.value, l.value))
codepage = tuple(codepages[:2].tolist())
# Extract information
windll.version.VerQueryValueA(res, ('\\StringFileInfo\\%04x%04x\\'
+ info) % codepage, byref(r), byref(l))
return string_at(r.value, l.value)

How to unlock a "secured" (read-protected) PDF in Python?

In Python I'm using pdfminer to read the text from a pdf with the code below this message. I now get an error message saying:
File "/usr/local/lib/python2.7/dist-packages/pdfminer/pdfpage.py", line 124, in get_pages
raise PDFTextExtractionNotAllowed('Text extraction is not allowed: %r' % fp)
PDFTextExtractionNotAllowed: Text extraction is not allowed: <cStringIO.StringO object at 0x7f79137a1
ab0>
When I open this pdf with Acrobat Pro it turns out it is secured (or "read protected"). From this link however, I read that there's a multitude of services which can disable this read-protection easily (for example pdfunlock.com. When diving into the source of pdfminer, I see that the error above is generated on these lines.
if check_extractable and not doc.is_extractable:
raise PDFTextExtractionNotAllowed('Text extraction is not allowed: %r' % fp)
Since there's a multitude of services which can disable this read-protection within a second, I presume it is really easy to do. It seems that .is_extractable is a simple attribute of the doc, but I don't think it is as simple as changing .is_extractable to True..
Does anybody know how I can disable the read protection on a pdf using Python? All tips are welcome!
================================================
Below you will find the code with which I currently extract the text from non-read protected.
def getTextFromPDF(rawFile):
resourceManager = PDFResourceManager(caching=True)
outfp = StringIO()
device = TextConverter(resourceManager, outfp, codec='utf-8', laparams=LAParams(), imagewriter=None)
interpreter = PDFPageInterpreter(resourceManager, device)
fileData = StringIO()
fileData.write(rawFile)
for page in PDFPage.get_pages(fileData, set(), maxpages=0, caching=True, check_extractable=True):
interpreter.process_page(page)
fileData.close()
device.close()
result = outfp.getvalue()
outfp.close()
return result
I had some issues trying to get qpdf to behave in my program. I found a useful library, pikepdf, that is based on qpdf and automatically converts pdfs to be extractable.
The code to use this is pretty straightforward:
import pikepdf
pdf = pikepdf.open('unextractable.pdf')
pdf.save('extractable.pdf')
As far as I know, in most cases the full content of the PDF is actually encrypted, using the password as the encryption key, and so simply setting .is_extractable to True isn't going to help you.
Per this thread:
Does a library exist to remove passwords from PDFs programmatically?
I would recommend removing the read-protection with a command-line tool such as qpdf (easily installable, e.g. on Ubuntu use apt-get install qpdf if you don't have it already):
qpdf --password=PASSWORD --decrypt SECURED.pdf UNSECURED.pdf
Then open the unlocked file with pdfminer and do your stuff.
For a pure-Python solution, you can try using PyPDF2 and its .decrypt() method, but it doesn't work with all types of encryption, so really, you're better off just using qpdf - see:
https://github.com/mstamy2/PyPDF2/issues/53
I used below code using pikepdf and able to overwrite.
import pikepdf
pdf = pikepdf.open('filepath', allow_overwriting_input=True)
pdf.save('filepath')
In my case there was no password, but simply setting check_extractable=False circumvented the PDFTextExtractionNotAllowed exception for a problematic file (that opened fine in other viewers).
Full disclosure, I am one of the maintainers of pdfminer.six. It is a community-maintained version of pdfminer for python 3.
This issue was fixed in 2020 by disabling the check_extractable by default. It now shows a warning instead of raising an error.
Similar question and answer here.
The 'check_extractable=True' argument is by design.
Some PDFs explicitly disallow to extract text, and PDFMiner follows the directive. You can override it (giving check_extractable=False), but do it at your own risk.
If you want to unlock all pdf files in a folder without renaming them, you may use this code:
import glob, os, pikepdf
p = os.getcwd()
for file in glob.glob('*.pdf'):
file_path = os.path.join(p, file).replace('\\','/')
init_pdf = pikepdf.open(file_path)
new_pdf = pikepdf.new()
new_pdf.pages.extend(init_pdf.pages)
new_pdf.save(str(file))
In pikepdf library it is impossible to overwrite the existing file by saving it with the same name. In contrast, you would like to copy the pages to the newly created empty pdf file, and save it.
I too faced the same problem of parsing the secured pdf but it has got resolved using pikepdf library. I tried this library on my jupyter notebbok and on windows os but it gave errors but it worked smoothly on Ubuntu
If you've forgotten the password to your PDF, below is a generic script which tries a LOT of password combinations on the same PDF. It uses pikepdf, but you can update the function check_password to use something else.
Usage example:
I used this when I had forgotten a password on a bank PDF. I knew that my bank always encrypts these kind of PDFs with the same password-structure:
Total length = 8
First 4 characters = an uppercase letter.
Last 4 characters = a number.
I call script as follows:
check_passwords(
pdf_file_path='/Users/my_name/Downloads/XXXXXXXX.pdf',
combination=[
ALPHABET_UPPERCASE,
ALPHABET_UPPERCASE,
ALPHABET_UPPERCASE,
ALPHABET_UPPERCASE,
NUMBER,
NUMBER,
NUMBER,
NUMBER,
]
)
Password-checking script:
(Requires Python3.8, with libraries numpy and pikepdf)
from typing import *
from itertools import product
import time, pikepdf, math, numpy as np
from pikepdf import PasswordError
ALPHABET_UPPERCASE: Sequence[str] = tuple('ABCDEFGHIJKLMNOPQRSTUVWXYZ')
ALPHABET_LOWERCASE: Sequence[str] = tuple('abcdefghijklmnopqrstuvwxyz')
NUMBER: Sequence[str] = tuple('0123456789')
def as_list(l):
if isinstance(l, (list, tuple, set, np.ndarray)):
l = list(l)
else:
l = [l]
return l
def human_readable_numbers(n, decimals: int = 0):
n = round(n)
if n < 1000:
return str(n)
names = ['', 'thousand', 'million', 'billion', 'trillion', 'quadrillion']
n = float(n)
idx = max(0,min(len(names)-1,
int(math.floor(0 if n == 0 else math.log10(abs(n))/3))))
return f'{n/10**(3*idx):.{decimals}f} {names[idx]}'
def check_password(pdf_file_path: str, password: str) -> bool:
## You can modify this function to use something other than pike pdf.
## This function should throw return True on success, and False on password-failure.
try:
pikepdf.open(pdf_file_path, password=password)
return True
except PasswordError:
return False
def check_passwords(pdf_file_path, combination, log_freq: int = int(1e4)):
combination = [tuple(as_list(c)) for c in combination]
print(f'Trying all combinations:')
for i, c in enumerate(combination):
print(f"{i}) {c}")
num_passwords: int = np.product([len(x) for x in combination])
passwords = product(*combination)
success: bool | str = False
count: int = 0
start: float = time.perf_counter()
for password in passwords:
password = ''.join(password)
if check_password(pdf_file_path, password=password):
success = password
print(f'SUCCESS with password "{password}"')
break
count += 1
if count % int(log_freq) == 0:
now = time.perf_counter()
print(f'Tried {human_readable_numbers(count)} ({100*count/num_passwords:.1f}%) of {human_readable_numbers(num_passwords)} passwords in {(now-start):.3f} seconds ({human_readable_numbers(count/(now-start))} passwords/sec). Latest password tried: "{password}"')
end: float = time.perf_counter()
msg: str = f'Tried {count} passwords in {1000*(end-start):.3f}ms ({count/(end-start):.3f} passwords/sec). '
msg += f"Correct password: {success}" if success is not False else f"All {num_passwords} passwords failed."
print(msg)
Comments
Obviously, don't use this to break into PDFs which are not your own. I hold no responsibility over how you use this script or any consequences of using it.
A lot of optimizations can be made.
Right now check_password uses pikepdf, which loads the file from disk for every "check". This is really slow, ideally it should run against an in-memory copy. I haven't figured out a way to do that, though.
You can probably speed this up a LOT by calling qpdf directly using C++, which is much better than Python for this kind of stuff.
I would avoid multi-processing here, since we're calling the same qpdf binary (which is normally a system-wide installation), which might become the bottleneck.

How to write OS X Finder Comments from python?

I'm working on a python script that creates numerous images files based on a variety of inputs in OS X Yosemite. I am trying to write the inputs used to create each file as 'Finder comments' as each file is created so that IF the the output is visually interesting I can look at the specific input values that generated the file. I've verified that this can be done easily with apple script.
tell application "Finder" to set comment of (POSIX file "/Users/mgarito/Desktop/Random_Pixel_Color/2015-01-03_14.04.21.png" as alias) to {Val1, Val2, Val3} as Unicode text
Afterward, upon selecting the file and showing its info (cmd+i) the Finder comments clearly display the expected text 'Val1, Val2, Val2'.
This is further confirmed by running mdls [File/Path/Name] before and after the applescript is used which clearly shows the expected text has been properly added.
The problem is I can't figure out how to incorporate this into my python script to save myself.
Im under the impression the solution should* be something to the effect of:
VarList = [Var1, Var2, Var3]
Fiele = [File/Path/Name]
file.os.system.add(kMDItemFinderComment, VarList)
As a side note I've also look at xattr -w [Attribute_Name] [Attribute_Value] [File/Path/Name] but found that though this will store the attribute, it is not stored in the desired location. Instead it ends up in an affiliated pList which is not what I'm after.
Here is my way to do that.
First you need to install applescript package using pip install applescript command.
Here is a function to add comments to a file:
def set_comment(file_path, comment_text):
import applescript
applescript.tell.app("Finder", f'set comment of (POSIX file "{file_path}" as alias) to "{comment_text}" as Unicode text')
and then I'm just using it like this:
set_comment('/Users/UserAccountName/Pictures/IMG_6860.MOV', 'my comment')
After more digging, I was able to locate a python applescript bundle: https://pypi.python.org/pypi/py-applescript
This got me to a workable answer, though I'd still prefer to do this natively in python if anyone has a better option?
import applescript
NewFile = '[File/Path/Name]' <br>
Comment = "Almost there.."
AddComment = applescript.AppleScript('''
on run {arg1, arg2}
tell application "Finder" to set comment of (POSIX file arg1 as alias) to arg2 as Unicode text
return
end run
''')
print(AddComment.run(NewFile, Comment))
print("Done")
This is the function to get comment of a file.
def get_comment(file_path):
import applescript
return applescript.tell.app("Finder", f'get comment of (POSIX file "{file_path}" as alias)').out
print(get_comment('Your Path'))
Another approach is to use appscript, a high-level Apple event bridge that is sadly no longer officially supported but still works (and saw an updated release in Jan. 2021). Here is an example of reading and setting the comment on a file:
import appscript
import mactypes
# Get a handle on the Finder.
finder = appscript.app('Finder')
# Tell Finder to select the file.
file = finder.items[mactypes.Alias("/path/to/a/file")]
# Print the current comment
comment = file.comment()
print("Current comment: " + comment)
# Set a new comment.
file.comment.set("New comment")
# Print the current comment again to verify.
comment = file.comment()
print("Current comment: " + comment)
Despite that the author of appscript recommends against using it in new projects, I used it recently to create a command-line utility called Urial for the specialized purpose of writing and updating URIs in Finder comments. Perhaps its code can serve as an an additional example of using appscript to manipulate Finder comments.

An efficient way to convert document to pdf format

I have been trying to find the efficient way to convert document e.g. doc, docx, ppt, pptx to pdf. So far i have tried docsplit and oowriter, but both took > 10 seconds to complete the job on pptx file having size 1.7MB. Can any one suggest me a better way or suggestions to improve my approach?
What i have tried:
from subprocess import Popen, PIPE
import time
def convert(src, dst):
d = {'src': src, 'dst': dst}
commands = [
'/usr/bin/docsplit pdf --output %(dst)s %(src)s' % d,
'oowriter --headless -convert-to pdf:writer_pdf_Export %(dst)s %(src)s' % d,
]
for i in range(len(commands)):
command = commands[i]
st = time.time()
process = Popen(command, stdout=PIPE, stderr=PIPE, shell=True) # I am aware of consequences of using `shell=True`
out, err = process.communicate()
errcode = process.returncode
if errcode != 0:
raise Exception(err)
en = time.time() - st
print 'Command %s: Completed in %s seconds' % (str(i+1), str(round(en, 2)))
if __name__ == '__main__':
src = '/path/to/source/file/'
dst = '/path/to/destination/folder/'
convert(src, dst)
Output:
Command 1: Completed in 11.91 seconds
Command 2: Completed in 11.55 seconds
Environment:
Linux - Ubuntu 12.04
Python 2.7.3
More tools result:
jodconverter took 11.32 seconds
Try calling unoconv from your Python code, it took 8 seconds on my local machine, I don't know if it's fast enough for you:
time unoconv 15.\ Text-Files.pptx
real 0m8.604s
Pandoc is a wonderful tool capable of doing what you'd like quickly. Since you're using Popen to effectively shell out the command for the tool, it doesn't matter what language the tool is written in (Pandoc is written in Haskell).
Unfortunately I don't have the time to do a full benchmark, but you may want to check out xtopdf, my Python toolkit for PDF creation. It doesn't do the full range of conversions you want, and some of the conversions have limitations, but it may be of use. xtopdf links:
Online presentation about xtopdf - a good summary of what it is, what it does, platforms, features, users, uses etc.: http://slid.es/vasudevram/xtopdf
xtopdf on Bitbucket: https://bitbucket.org/vasudevram/xtopdf
Many blog posts showing how to use xtopdf for various purpose, including many that show how to use it to convert different input formats to PDF: http://jugad2.blogspot.com/search/label/xtopdf
HTH,
Vasudev Ram
For doc and docx (but not ppt/pptx), you could try our independent (but commercial) high fidelity rendering engine online at OnlineDemo/docx_to_pdf
By "high fidelity", I mean it is designed from the ground up to have the same line and paragraph breaks, tab stops etc etc as Microsoft Word.

Make Sphinx generate RST class documentation from pydoc

I'm currently migrating all existing (incomplete) documentation to Sphinx.
The problem is that the documentation uses Python docstrings (the module is written in C, but it probably does not matter) and the class documentation must be converted into a form usable for Sphinx.
There is sphinx.ext.autodoc, but it automatically puts current docstrings to the document. I want to generate a source file in (RST) based on current docstrings, which I could then edit and improve manually.
How would you transform docstrings into RST for Sphinx?
The autodoc does generate RST only there is no official way to get it out of it. The easiest hack to get it was by changing sphinx.ext.autodoc.Documenter.add_line method to emit me the line it gets.
As all I want is one time migration, output to stdout is good enough for me:
def add_line(self, line, source, *lineno):
"""Append one line of generated reST to the output."""
print(self.indent + line)
self.directive.result.append(self.indent + line, source, *lineno)
Now autodoc prints generated RST on stdout while running and you can simply redirect or copy it elsewhere.
monkey patching autodoc so it works without needing to edit anything:
import sphinx.ext.autodoc
rst = []
def add_line(self, line, source, *lineno):
"""Append one line of generated reST to the output."""
rst.append(line)
self.directive.result.append(self.indent + line, source, *lineno)
sphinx.ext.autodoc.Documenter.add_line = add_line
try:
sphinx.main(['sphinx-build', '-b', 'html', '-d', '_build/doctrees', '.', '_build/html'])
except SystemExit:
with file('doc.rst', 'w') as f:
for line in rst:
print >>f, line
As far as I know there are no automated tools to do this. My approach would therefore be to write a small script that reads relevant modules (based on sphinc.ext.autodoc) and throws doc strings into a file (formatted appropriately).

Categories