Are any Python PDF libraries able to access objects and groups in an existing PDF file? - python

I'm building a server application which accepts PDF files that artists have exported from Adobe Illustrator. Each file acts as an art "deck" containing several similar pieces of artwork, with each layer being a separate piece of art in the deck.
I'd like to be able to programmatically access those layers and separate them out into their own PDF files, single page documents in this case. Reading and creating PDFs is pretty easy with Python PDF libraries like ReportLab, pyPDF, and pyx. However, none of these libraries allow manipulating of an existing PDF at the layer/group/object level.
Have I missed something?

Have a look at pdfminer. I haven't used it much so I don't know if it does what you want, but from what I've seen it's quite powerful, and it is one of the more popular python PDF libraries.

Related

XLSX to XML with schema map

I have built a couple basic workflows using XML tools on top of XLSX workbooks that are mapped to an XML schema. You would enter data into the spreadsheet, export the XML and I had some scripts that would then work with the data.
Now I'm trying to eliminate that step and build a more integrated and portable tool that others could use easily by moving from XSLT/XQuery to Python. I would still like to use Excel for the data entry, but have the Python script read the XLSX file directly.
I found a bunch of easy to use libraries to read from Excel but they need to explicitly state what cells the data is in, like range('A1:C2') etc. The useful thing about using the XML maps was that users could resize or even move tables to fit different rows and rename sheets. Is their a library that would let me select tables as units?
Another approach I tried was to just uncompress the XLSX and just parse the XML directly. The problem with that is that our data is quite complex (taking up to 30-50 sheets) and parsing that in the uncompressed XLSX structure is really daunting. I did find my XML schema within the uncompressed XLSX, so is there any way to reformat the data into this schema outside of Excel? (basically what Excel does when I save a workbook as an .xml file)
The Excel format is pretty complicated with dependencies between components – you can't for example be sure of that the order of the worksheets in the folder worksheets has any bearing to what the file looks like in Excel.
I don't really understand exactly what you're trying to do but the existing libraries present an interface for client code that hides the XML layer. If you don't want that you'll have to root around for the parts that you find useful. In openpyxl you want to look at the stuff in openpyxl/reader specifically worksheet.py.
However, you might have better luck using lxml as this (using libxml2 in the background) will allow you load a single XML into Python and manipulate it directly using the .objectify() method. We don't do this in openpyxl because XML trees consume a lot of memory (and many people have very large worksheets) but the library for working with Powerpoint shows just how easy this can be.

create office files from python

We have a project in python with django.
We need to generate complex word, excel and pdf files.
For the rest of our projects which were done in PHP we used PHPexcel ,
PHPWord and tcpdf for PDF.
What libraries for python would you recommend for creating this kind of files ? (for excel and word its imortant to use the open xml file format xlsx , docx)
Python-docx may help ( https://github.com/mikemaccana/python-docx ).
Python doesn't have highly-developed tools to manipulate word documents. I've found the java library xdocreport ( https://code.google.com/p/xdocreport/ ) to be the best by far for Word reporting. Because I need to generate PCL, which is efficiently done via FOP I also use docx4j.
To integrate this with my python, I use the spark framework to wrap it up with a simple web service, and use requests on the python side to talk to the service.
For excel, there's openpyxl, which actually is a python port of PHPexcel, afaik. I haven't used it yet, but it sounds ok to me.
I would recommend using Docutils. It takes reStructuredText files and converts them to a range of output files. Included in the package are HTML, LaTeX and .odf file writers but in the sandbox there are a whole load of other writers for writing to other formats, see for example, the WordML writer (disclaimer: I haven't used it).
The advantage of this solution is that you can write plain text (reStructuredText) master files, which are human readable as is, and then convert to a range of other file formats as required.
Whilst not a Python solution, you should also look at Pandoc a Haskell library which supports a much wider range of output and input formats than docutils. One major advantage of Pandoc over Docutils is that you can do the reverse translation, i.e. WordML to reStructuredText. You can try Pandoc here.
I have never used any libraries for this, but you can change the extension of any docx, xlsx file to zip, and see the magic!
Generating openxml files is as simple as generating couple of XML files (you can use templates) and zipping it.
Simplest way to generate PDF is to generate HTML (with CSS+images) and convert it using wkhtmltopdf tool.

Generating & Merging PDF Files in Python

I want to automatically generate booking confirmation PDF files in Python. Most of the content will be static (i.e. logos, booking terms, phone numbers), with a few dynamic bits (dates, costs, etc).
From the user side, the simplest way to do this would be to start with a PDF file with the static content, and then using python to just add the dynamic parts. Is this a simple process?
From doing a bit of search, it seems that I can use reportlab for creating content and pyPdf for merging PDF's together. Is this the best approach? Or is there a really funky way that I haven't come across yet?
Thanks!
From the user side, the simplest way to do this would be to start with a PDF file with the static content, and then using python to just add the dynamic parts. Is this a simple process?
Unfortunately no. There are several tools that are good at producing PDFs from scratch (most commonly for Python, ReportLab), but they don't generally load existing PDFs. You would have to include generating code for any boilerplate text, lines, blocks, shapes and images, rather than this being freely editable by the user.
On the other side there's pyPdf which can load PDFs, collate the pages, and extract some of the information, but can't really add new content. You can ‘merge’ pages into one, but you'd still have to create the extra information overlay as a page in ReportLab first.
Look into docutils and reSTructuredText. You could quickly write out your PDF document in reST and then compile the PDF using rst2pdf.py
I've used this, it creates very beautiful documents and the markup is extensible! Later you could take the same code and run it into rst2html to create a website out if it!
Take a look here:
http://docutils.sourceforge.net/docs/user/rst/quickref.html
http://code.google.com/p/rst2pdf/
Good luck
You could generate a document through, for example, TeX, or OpenOffice, or whatever gives you the most comfortable bindings and then print the document with a pdf printer.
This allows you not to have to figure out where to put fields precisely or figure out what to do if your content overflows the space allocated for it.

What program to write pdf including other pdf on Linux from Python?

On an Ubuntu server, I want to create pdfs which include other static pdfs. I have tried using ReportLab with pyPdf. Ideally I would use ReportLab to do the whole thing, but in order to import the pdfs requires their PageCatcher which has a large recurring fee.
So I use pyPdf to merge a page created with ReportLab and my other pdfs. The problem is that even though this looks fine in Acrobat and Foxit, part of one of the pages prints garbled on a Xerox 7400 color printer. I can't figure out the issue, but would be willing to buy a more integrated solution if it existed and was reasonably priced. I thought PDF Creator Pilot was it until I saw that it was Windows only.
So is there a reasonably priced ($1K or less) solution or a different suggestion?
I have had a lot of success with the Java library iText. They have a great library of samples for pretty much anything you could think of doing with PDF files. This example is for concatenating PDF files and sounds like it would do what you need: http://itextpdf.com/examples/index.php?page=example&id=123. There is also PDFBox which is another great Java based PDF manipulation library.
I realize that you are looking for a Python based solution but there may not be many other options. If you are using the Jython interpreter instead of CPython, integrating in iText should be trivial. If not, then you could consider calling out to it as a separate process. I realize that may not be idea for your situation but I figured I would mention it as an option.
Another non-Python answer. If you are just merging pages, then pdftk does that well (along with a lot of other things).

How to include page in PDF in PDF document in Python

I am using reportlab toolkit in Python to generate some reports in PDF format. I want to use some predefined parts of documents already published in PDF format to be included in generated PDF file. Is it possible (and how) to accomplish this in reportlab or in python library?
I know I can use some other tools like PDF Toolkit (pdftk) but I am looking for Python-based solution.
I'm currently using PyPDF to read, write, and combine existing PDF's and ReportLab to generate new content. Using the two package seemed to work better than any single package I was able to find.
If you want to place existing PDF pages in your Reportlab documents I recommend pdfrw. Unlike PageCatcher it is free.
I've used it for several projects where I need to add barcodes etc to existing documents and it works very well. There are a couple of examples on the project page of how to use it with Reportlab.
A couple of things to note though:
If the source PDF contains errors (due to the originating program following the PDF spec imperfectly for example), pdfrw may fail even though something like Adobe Reader has no apparent problems reading the PDF. pdfrw is currently not very fault tolerant.
Also, pdfrw works by being completely agnostic to the actual content of the PDF page you are placing. So for example, you wouldn't be able to use pdfrw inspect a page to see if it contains a certain string of text in the lower right-hand corner. However if you don't need to do anything like that you should be fine.
There is an add-on for ReportLab — PageCatcher.

Categories