How to include page in PDF in PDF document in Python

How to include page in PDF in PDF document in Python - python

I am using reportlab toolkit in Python to generate some reports in PDF format. I want to use some predefined parts of documents already published in PDF format to be included in generated PDF file. Is it possible (and how) to accomplish this in reportlab or in python library?
I know I can use some other tools like PDF Toolkit (pdftk) but I am looking for Python-based solution.

I'm currently using PyPDF to read, write, and combine existing PDF's and ReportLab to generate new content. Using the two package seemed to work better than any single package I was able to find.

If you want to place existing PDF pages in your Reportlab documents I recommend pdfrw. Unlike PageCatcher it is free.
I've used it for several projects where I need to add barcodes etc to existing documents and it works very well. There are a couple of examples on the project page of how to use it with Reportlab.
A couple of things to note though:
If the source PDF contains errors (due to the originating program following the PDF spec imperfectly for example), pdfrw may fail even though something like Adobe Reader has no apparent problems reading the PDF. pdfrw is currently not very fault tolerant.
Also, pdfrw works by being completely agnostic to the actual content of the PDF page you are placing. So for example, you wouldn't be able to use pdfrw inspect a page to see if it contains a certain string of text in the lower right-hand corner. However if you don't need to do anything like that you should be fine.

There is an add-on for ReportLab — PageCatcher.

Related

HTML to RTF string using Python

I am looking for a way to convert HTML text to RTF string. Is there any libraries that does this job. I get html content dynamically in my project and need it to be rendered in RTF format. I am using HTML parser to convert HTML text to normal string and then have trying to use PyRTF for conversion to RTF format. Is there any better way that this can be done.Thanks in advance.

RTF seems a dicey format to convert from/to. I've tried cutting and pasting among applications on Mac OS X, for example, where RTF is something of a lingua franca. Some of those apps are Microsoft apps (relevant in that RTF is a Microsoft-developed format), others are not. Even basic formatting information like font size, font face, line spacing, and list styling (ordered or unordered) is jumbled when copying from one ostensibly RTF-speaking app to another. Simply put, it's a mess.
I have searched for ways to programmatically read, write, and transform RTF, preferably from Python. I found a number of packages on PyPI, trying them out has been a disappointing experience. They would support RTF 1.5, say, when the current version is 1.9.1. RTF has been around a long time, but a 2005-vintage spec is not very recent. There were lots of gotchas and incompatibilities. LOTS.
Now, I'm not saying it's impossible, or that there aren't other libraries out there that would do the trick. I have not tried the zopyx.convert mentioned by others here, for example. Maybe it's great. But looking at its dependencies--Java, FOP, etc.--it looks like a pretty complex (and thus likely fragile) toolchain. I read its code on github, and the Python is really only there as a coordination veneer. It organizes external tools XFC, XINC, FOP, and PrinceXML--three of the four of which are commercial software. That includes the key XFC part that deals with RTF. Color me skeptical.
There are two converters that I've found are worth a look: If you're using a Mac, the textutil command line program is actually one of the better and simpler tools I've seen.
textutil -convert html filename.rtf -output filename.html
The other formatting engine that's worth considering is LibreOffice. It's free, open source, reasonably amenable to automation, and a decent foundation as an interoperability hub. That's not just a guess; I've built complex, multi-format document workflows around it.
I would question why you're trying to get into RTF. That seems like a document format you'd be trying to escape from. But if you need to go there, textutil and LibreOffice are the least-worst mechanisms I've found.

There is a wonderful python library that comes as a tarball.
You can download it at https://pypi.python.org/pypi/zopyx.convert2/2.4.5.
Good luck!

I see this question is over a year old, but figured I'd contribute anyway. I recently had a similar requirement, and turned to PyRTF, a small but powerful Python module that can construct RTF documents from a text file. You could use Beautiful Soup to scrape the HTML, going down the parse tree tag by tag, and use the PyRTF API to construct appropriate objects (table, cell, paragraph, section or document).
The API itself is quite granular, and allows for a whole bunch of custom formatting (font text, alignment, color, headers, footers etc.)
Hope this helps.

Are any Python PDF libraries able to access objects and groups in an existing PDF file?

I'm building a server application which accepts PDF files that artists have exported from Adobe Illustrator. Each file acts as an art "deck" containing several similar pieces of artwork, with each layer being a separate piece of art in the deck.
I'd like to be able to programmatically access those layers and separate them out into their own PDF files, single page documents in this case. Reading and creating PDFs is pretty easy with Python PDF libraries like ReportLab, pyPDF, and pyx. However, none of these libraries allow manipulating of an existing PDF at the layer/group/object level.
Have I missed something?

Have a look at pdfminer. I haven't used it much so I don't know if it does what you want, but from what I've seen it's quite powerful, and it is one of the more popular python PDF libraries.

Edit pdf file with PDFMiner

I was wondering if it's possible to édit an existing pdf file with Pdfminer. It seens to be a powerful tool, but the documentation is poor/inexisting.
I found some exemples, but they don't match with my goal. I want to make a search engine which changes the color of my keywords in the pdf file.

PDFMiner is not for altering existing PDF files, but for extracting text and metadata from them. The closest solution to what you're looking for using PDFMiner would probably be to use the included pdf2txt.py tool to extract the text and then mark that up to highlight your keywords.
There's also the simple option of just using a PDF viewer with the built-in ability to find and highlight multiple search terms. I think Adobe Acrobat can do it, but I'm not sure about others.

No, pdfminer doesn't support editing.
However, it might be a lot easier if you don't try to modify the pdf, but use PDFOpenParameters instead: http://partners.adobe.com/public/developer/en/acrobat/PDFOpenParameters.pdf
You can use url fragment identifiers like this:
http://www.example.com/test.pdf#search=foo
Or even when opening Acrobat on the commandline (Windows example)
AcroRd32.exe /A "search=foo" test.pdf
You could also open the pdf a specific page, and highlight a certain area of that page (but not different areas on different pages at the same time).
(ok, I know it's not really a solution for the question you asked, but if this is sufficient for your needs, it's a lot simpler)

Generating & Merging PDF Files in Python

I want to automatically generate booking confirmation PDF files in Python. Most of the content will be static (i.e. logos, booking terms, phone numbers), with a few dynamic bits (dates, costs, etc).
From the user side, the simplest way to do this would be to start with a PDF file with the static content, and then using python to just add the dynamic parts. Is this a simple process?
From doing a bit of search, it seems that I can use reportlab for creating content and pyPdf for merging PDF's together. Is this the best approach? Or is there a really funky way that I haven't come across yet?
Thanks!

From the user side, the simplest way to do this would be to start with a PDF file with the static content, and then using python to just add the dynamic parts. Is this a simple process?
Unfortunately no. There are several tools that are good at producing PDFs from scratch (most commonly for Python, ReportLab), but they don't generally load existing PDFs. You would have to include generating code for any boilerplate text, lines, blocks, shapes and images, rather than this being freely editable by the user.
On the other side there's pyPdf which can load PDFs, collate the pages, and extract some of the information, but can't really add new content. You can ‘merge’ pages into one, but you'd still have to create the extra information overlay as a page in ReportLab first.

Look into docutils and reSTructuredText. You could quickly write out your PDF document in reST and then compile the PDF using rst2pdf.py
I've used this, it creates very beautiful documents and the markup is extensible! Later you could take the same code and run it into rst2html to create a website out if it!
Take a look here:
http://docutils.sourceforge.net/docs/user/rst/quickref.html
http://code.google.com/p/rst2pdf/
Good luck

You could generate a document through, for example, TeX, or OpenOffice, or whatever gives you the most comfortable bindings and then print the document with a pdf printer.
This allows you not to have to figure out where to put fields precisely or figure out what to do if your content overflows the space allocated for it.

What program to write pdf including other pdf on Linux from Python?

On an Ubuntu server, I want to create pdfs which include other static pdfs. I have tried using ReportLab with pyPdf. Ideally I would use ReportLab to do the whole thing, but in order to import the pdfs requires their PageCatcher which has a large recurring fee.
So I use pyPdf to merge a page created with ReportLab and my other pdfs. The problem is that even though this looks fine in Acrobat and Foxit, part of one of the pages prints garbled on a Xerox 7400 color printer. I can't figure out the issue, but would be willing to buy a more integrated solution if it existed and was reasonably priced. I thought PDF Creator Pilot was it until I saw that it was Windows only.
So is there a reasonably priced ($1K or less) solution or a different suggestion?

I have had a lot of success with the Java library iText. They have a great library of samples for pretty much anything you could think of doing with PDF files. This example is for concatenating PDF files and sounds like it would do what you need: http://itextpdf.com/examples/index.php?page=example&id=123. There is also PDFBox which is another great Java based PDF manipulation library.
I realize that you are looking for a Python based solution but there may not be many other options. If you are using the Jython interpreter instead of CPython, integrating in iText should be trivial. If not, then you could consider calling out to it as a separate process. I realize that may not be idea for your situation but I figured I would mention it as an option.

Another non-Python answer. If you are just merging pages, then pdftk does that well (along with a lot of other things).

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.