I use Pisa/xhtml2pdf in my Django apps to generate pdf from an HTML source. That is:
I generate the HTML file formatted with all 'printing' stuffs (e.g. page-breaks, header, footer, etc.)
I convert this HTML into pdf using Pisa
This process is ok but it is slow (expecially when dealing with long tables) and I must use HTML/CSS according to Pisa features/limitations.
The question is: is this the right way to generate pdf from a web application (i.e. create HTML and then convert it to pdf) or there is a more direct way, that is "write" the pdf with a more suitable language?
WeasyPrint author here. The point of using HTML/CSS to generate PDF (vs. using a lower-level PDF library directly.) is to get automatic layout. It lets you specify high-level constraints like h1 { page-break-after: avoid } and let the layout engine figure it out, rather than specifying the absolute position of everything. The former is much more maintainable when you make changes to your documents.
Some tools like rst2pdf have their own stylesheet syntax, but that’s just a bad way of re-inventing CSS.
But yes, dumping complex stylesheets made for screen might not give great results. It’s better to build the stylesheets with print in mind, or even use completely different stylesheets with #media print in CSS or <link media="print"> in HTML.
I think generating a pdf from html with libraries like Pisa or http://weasyprint.org/ is the simplest approach. because it takes care of inserting images, css, barcode (on pisa) ... etc
If you want to write the pdf yourself take a look at Reportlab but it will take much longer to implement. In both cases i suggest to always generate the pdf in the background with celery or python-rq for optimization.
Pisa is known having various issues - especially with long tables. In general one should avoid using PISA. Other options are:
using Reportlab directly
z3c.rml (Reportlab template language clone)
commercial alternatives:
PrinceXML
PDFreactor
The general rule when it comes to PDF production: you get what you pay for.
Converters like Pisa or Apache FOP are half-baked solutions that work for simple cases but suck in general.
You can also use the QT webkit rendering engine to create PDFs from HTML with http://code.google.com/p/wkhtmltopdf/ and django-wkhtmltopdf.
The advantage is that you can write the HTML and CSS as you would normally for WebKit. This works well if you are outputting an existing web page but may be less appropriate if generating PDFs from scratch.
Related
I am looking for a way to convert HTML text to RTF string. Is there any libraries that does this job. I get html content dynamically in my project and need it to be rendered in RTF format. I am using HTML parser to convert HTML text to normal string and then have trying to use PyRTF for conversion to RTF format. Is there any better way that this can be done.Thanks in advance.
RTF seems a dicey format to convert from/to. I've tried cutting and pasting among applications on Mac OS X, for example, where RTF is something of a lingua franca. Some of those apps are Microsoft apps (relevant in that RTF is a Microsoft-developed format), others are not. Even basic formatting information like font size, font face, line spacing, and list styling (ordered or unordered) is jumbled when copying from one ostensibly RTF-speaking app to another. Simply put, it's a mess.
I have searched for ways to programmatically read, write, and transform RTF, preferably from Python. I found a number of packages on PyPI, trying them out has been a disappointing experience. They would support RTF 1.5, say, when the current version is 1.9.1. RTF has been around a long time, but a 2005-vintage spec is not very recent. There were lots of gotchas and incompatibilities. LOTS.
Now, I'm not saying it's impossible, or that there aren't other libraries out there that would do the trick. I have not tried the zopyx.convert mentioned by others here, for example. Maybe it's great. But looking at its dependencies--Java, FOP, etc.--it looks like a pretty complex (and thus likely fragile) toolchain. I read its code on github, and the Python is really only there as a coordination veneer. It organizes external tools XFC, XINC, FOP, and PrinceXML--three of the four of which are commercial software. That includes the key XFC part that deals with RTF. Color me skeptical.
There are two converters that I've found are worth a look: If you're using a Mac, the textutil command line program is actually one of the better and simpler tools I've seen.
textutil -convert html filename.rtf -output filename.html
The other formatting engine that's worth considering is LibreOffice. It's free, open source, reasonably amenable to automation, and a decent foundation as an interoperability hub. That's not just a guess; I've built complex, multi-format document workflows around it.
I would question why you're trying to get into RTF. That seems like a document format you'd be trying to escape from. But if you need to go there, textutil and LibreOffice are the least-worst mechanisms I've found.
There is a wonderful python library that comes as a tarball.
You can download it at https://pypi.python.org/pypi/zopyx.convert2/2.4.5.
Good luck!
I see this question is over a year old, but figured I'd contribute anyway. I recently had a similar requirement, and turned to PyRTF, a small but powerful Python module that can construct RTF documents from a text file. You could use Beautiful Soup to scrape the HTML, going down the parse tree tag by tag, and use the PyRTF API to construct appropriate objects (table, cell, paragraph, section or document).
The API itself is quite granular, and allows for a whole bunch of custom formatting (font text, alignment, color, headers, footers etc.)
Hope this helps.
I generate PDFs with the xhtml2pdf Python package. The output is not optimal. I use floating divs in order to place images and text on the page. In HTML this works but after PDF rendering, images and text ar placed underneath eachother which is not what I want. From surfing the web I learned that the Report Lab package that is used by xhtml2pdf can not handle floating divs. Does a workaround exist? I have tried webkit rendering via QT but the resulting PDFs are of low quality, i.e. character spacing is completely wrong.
If you cannot achieve the results you need with xhtml2pdf, I suggest you use ReportLab directly. ReportLab contains support for RML, ReportLabs own markup language that lets you easily create formatted text, and has a support library called Platypus that makes layout fairly simple using Python objects to represent document parts and page layouts.
The reason you are having problems, by the way, is that xhtml2pdf has to essentially act like a HTML rendering engine that outputs to PDF rather than the screen directly. As it took a long time and a lot of effort to make good rendering engines for browsers, so, too, does it seem that xhtml2pdf will take a lot of effort to make it of similar quality. This isn't to say that xhtml2pdf is bad, just that it's going to take time for it to be as good as rendering in a browser, and if PDF output for its own sake is what you really are interested in, I think using ReportLab directly is a better choice.
I have xml configs which are very complex. They are validated using dtds. I am looking for some application which reads the dtds and provides a GUI interface to write xmls. So that we don't have to write xmls by hand. If there is nothing existing already, how to start with developing one?
did you try Eclipse?
You can edit XML files according to the DTD.
If your audience is not development oriented, you can have a look to Editix XML Editor (the free version is already fully usable) or Notepad++ (a strange mix between the classic windows notepad and eclipse ;) ).
Otherwise, if you want to write come code, have a look to wxPython there are some widgets to edit/view HTML and XML.
Take a look at oXygen XML Author. It's not intended for programmers/developers like oXygen XML Editor is. It's a WYSIWYG XML authoring tool. You can create CSS stylesheets to display the XML in a way that is easy for the authors to create/modify the XML.
It's not free and you'll have to create CSS for display, but it is much better than trying to develop something from scratch. It also supports creating templates so that authors can start with a base XML file and then modify/update.
I want to automatically generate booking confirmation PDF files in Python. Most of the content will be static (i.e. logos, booking terms, phone numbers), with a few dynamic bits (dates, costs, etc).
From the user side, the simplest way to do this would be to start with a PDF file with the static content, and then using python to just add the dynamic parts. Is this a simple process?
From doing a bit of search, it seems that I can use reportlab for creating content and pyPdf for merging PDF's together. Is this the best approach? Or is there a really funky way that I haven't come across yet?
Thanks!
From the user side, the simplest way to do this would be to start with a PDF file with the static content, and then using python to just add the dynamic parts. Is this a simple process?
Unfortunately no. There are several tools that are good at producing PDFs from scratch (most commonly for Python, ReportLab), but they don't generally load existing PDFs. You would have to include generating code for any boilerplate text, lines, blocks, shapes and images, rather than this being freely editable by the user.
On the other side there's pyPdf which can load PDFs, collate the pages, and extract some of the information, but can't really add new content. You can ‘merge’ pages into one, but you'd still have to create the extra information overlay as a page in ReportLab first.
Look into docutils and reSTructuredText. You could quickly write out your PDF document in reST and then compile the PDF using rst2pdf.py
I've used this, it creates very beautiful documents and the markup is extensible! Later you could take the same code and run it into rst2html to create a website out if it!
Take a look here:
http://docutils.sourceforge.net/docs/user/rst/quickref.html
http://code.google.com/p/rst2pdf/
Good luck
You could generate a document through, for example, TeX, or OpenOffice, or whatever gives you the most comfortable bindings and then print the document with a pdf printer.
This allows you not to have to figure out where to put fields precisely or figure out what to do if your content overflows the space allocated for it.
I am using reportlab toolkit in Python to generate some reports in PDF format. I want to use some predefined parts of documents already published in PDF format to be included in generated PDF file. Is it possible (and how) to accomplish this in reportlab or in python library?
I know I can use some other tools like PDF Toolkit (pdftk) but I am looking for Python-based solution.
I'm currently using PyPDF to read, write, and combine existing PDF's and ReportLab to generate new content. Using the two package seemed to work better than any single package I was able to find.
If you want to place existing PDF pages in your Reportlab documents I recommend pdfrw. Unlike PageCatcher it is free.
I've used it for several projects where I need to add barcodes etc to existing documents and it works very well. There are a couple of examples on the project page of how to use it with Reportlab.
A couple of things to note though:
If the source PDF contains errors (due to the originating program following the PDF spec imperfectly for example), pdfrw may fail even though something like Adobe Reader has no apparent problems reading the PDF. pdfrw is currently not very fault tolerant.
Also, pdfrw works by being completely agnostic to the actual content of the PDF page you are placing. So for example, you wouldn't be able to use pdfrw inspect a page to see if it contains a certain string of text in the lower right-hand corner. However if you don't need to do anything like that you should be fine.
There is an add-on for ReportLab — PageCatcher.