Generating large PDFs using Wkhtmltopdf/pdfkit + Python/Flask - python

I have to generate a large PDF using a web app. The PDF is generated using a large data set of email content for clients, right now it is written in php and what I am doing is basically looping over every item in the dataset, create an individual HTML page for each client and then add all those pages one by one to wkhtmltopdf via add page option.
This is obviously not very elegant and the php dies when the input is very big, like for 1000 clients. The idea behind this pdf is that we have to regularly send physical mails to our clients and we just want to create a big file, which we will then print and individually put them in envelopes and then mail them and stuff.
I'm now redoing this using Python instead of php. I am also not sure of what coding practice should I follow to make sure the PDF is generated in the fastest and most efficient manner.
Here are couple of options I thought about
Create one big variable
I'm wondering can I create a single big variable and then write the entire contents in one go into a html file and then use it to create pdf using wkhtmltopdf. However this would be a one really big variable and the RAM might go nuts.
Write to only one file
Not sure how will I be able to implement this, but maybe instead of creating a bunch of html files, I should just create one file and keep appending things in the bottom of that html file?
Stick with current concept?
Maybe the exact same programming design/concept will magically work well with Python
...?
Any or all of these options I have thought maybe be completely wrong and flawed though.
EDIT: Write to one file cannot work, since these mails have to be sent physically, I need to make sure every new content for each client starts from a new page. And if I write a single big file, there is no way I would be able to do it.

As far as wkhtmltopdf is concerned, page breaking depends a lot on your content and your requirements. I need specific page breaks but I don't have a "1 content should always be one 1 page" limitation - if you do, don't bother with it. Also, if you have very specific styling rules, it might be difficult depending on what the styles are. The largest PDFs I've done with wkhtmltopdf are only 100 pages or so, so I can't comment on the sizes.
What I would with wkhtmltopdf do is format the content like this
<head>
<style>
.pb { page-break-before: always; }
</style>
</head>
<body>
<div id="mail1" class="pb">...</div>
<div id="mail2" class="pb">...</div>
<!-- etc etc -->
</body>
This ensures that each email starts from a new page. Then feed that output to wkhtmltopdf using the desired styles and cli options and check if everything worked out as planned. This test should be very quick to do.
Additionally, if the HTML is Extremely simple and you can always rely on it being in a specific format (you could validate it with a simple XML schema) you could try iTextSharp and manually transform the HTML. I haven't done it and it sounds horrible, but might work for you - iTextSharp is quite fast.

Related

Converting PDF to any parse-able format

I have a PDF file which consists of tables which can spread across various pages and may have text in between. An example of it can be found here.
I am able to convert the PDF to any format but the output files are not in any way parse-able i.e. I cannot extract data out of it as they are scattered. Here are the links to the output files which I created using pdftotext and pdftohtml.
Is there a way to extract data in a more suitable way?
Thanks in advance.
The general answer is no. pdf is a format intended for visual presentation and printing, and there is no guarantee that the contents will be in any particular order let alone structured as a table in any way other than what appears when the pdf is rendered onto paper or a screen. Sometimes there is even deliberate obfuscation to prevent anyone doing what you are attempting.
In this case it appears to be possible to cut and paste the contents of each table element. For a small number of similar files that is almost certainly the quickest thing to do. Open the pdf on the left hand of your screen, a spreadsheet or data-entry program on the right hand, then cut and paste. For a medium number - tens, hundreds? - it's probably cheapest to hire a temp to do the donkey-work. For a large number - thousands? - it would be possible to create a program to automate this process, but definitely not easy. I might think about using human input via the mouse to identify the corners of the table and the horizontal / vertical divisions, then generating cut and paste operations via control of the human interface devices. Don't ask me how. I'd have to find out if I had to do this, and I'd much rather not. It's a WOMBAT.
Whatever form of analysis you did on the pdf contents would certainly not generalize to other pdfs created by different organisations using different software, and possibly not even by the same organisation using the same process but merely a later release of the same software.
Following in the line of #nigel222, it really depends on the PDF how easily you can get the data out in some useful way.
It is best if the PDF is structured (has a document structure, created when the PDF was written). In this case, you can access the structure, and you are all set.
As structure is a fundamental necessity of an accessible PDF, you may try to "massage" the document by applying the various "make accessible" utilities floating around; definitely something to follow.

Need suggestions for designing a content organizer

I am trying to write a python program which could take content and categorize it based on the tags. I am using Nepomuk to tag files and PyQt for GUI. The problem is, I am unable to decide how to save the content. Right now, I am saving each entry individually to a text file in a folder. When I need to read the contents, I am telling the program to get all the files in that foder and then perform read operation on each file. Since the number of files is less now (less than 20), this approach is decent enough. But I am worried that when the file count increase, this method would become inefficient. Is there any other method to save content efficiently?
Thanks in advance.
You could use sqlite3 module from stdlib. Data will be stored in a single file. The code might be even simpler than the one used for reading all adhoc text files by hand.
You could always export the data in a format suitable for sharing in your case.

Hide information in a PDF file in Python

In Python, I have files generated by ReportLab. Now, i need to extract some pages from that PDF and hide confidential information.
I can create a PDF file with blacked-out spots and use pyPdf to mergePage, but people can still select and copy-paste the information under the blacked-out spots.
Is there a way to make those spots completely confidential?
Per example, I need to hide addresses on the pages, how would i do it?
Thanks,
Basically you'll have to remove the corresponding text drawing commands in the PDF's page content stream. It's much easier to generate the pages twice, once with the confidential information, once without them.
It might be possible (I don't know ReportLab enough) to specially craft the PDF in a way that the confidential information is easier accessible (e.g. as separate XObjects) for deletion. Still you'd have to do pretty low-level operations on the PDF -- which I would advise against.
(Sorry, I was not able to log on when I posted the question...)
Unfortunately, the document cannot be regenerated at will (context sensitive), and those PDF files (about 35) are 3000+ pages.
I was thinking about using pdf2ps and pdf2ps back, but there is a lot of quality.
pdf2ps -dLanguageLevel=3 input.pdf - | ps2pdf14 - output.pdf
And if i use "pdftops" instead, the text is still selectable. If there is a way to make it non-selectable like with "pdf2ps" but with better quality, it will do too.

Call macro from Python script?

One of our page templates is made up of a bunch of macros. These items are a bunch of html tables.
Now, I want a couple of these tables in a Python script to create a PDF. Is there a way call a macro from a Python script and get back the HTML that is produced?
If so, can you explain?
Thanks
Eric
Maybe you could create a new template including (use-macro) just the macros you want to access from python and then use z3c.pt.pagetemplate.PageTemplateFile() to render it?
Actually, it might be possible (and certainly easier) to use chameleon.zpt.template.PageTemplate('<div tal:use-macro="<your-macro-here>" />'), but I've never did this myself.
I'd probably use urllib.urlopen(url), pull the data from the page back to python and use BeautifulSoup to pull the table(s) out of the HTML... And then render that to PDF with XHTML2PDF (pisa.ho).
There might be a simpler way but for me, this would be the least stressful approach.

Generating & Merging PDF Files in Python

I want to automatically generate booking confirmation PDF files in Python. Most of the content will be static (i.e. logos, booking terms, phone numbers), with a few dynamic bits (dates, costs, etc).
From the user side, the simplest way to do this would be to start with a PDF file with the static content, and then using python to just add the dynamic parts. Is this a simple process?
From doing a bit of search, it seems that I can use reportlab for creating content and pyPdf for merging PDF's together. Is this the best approach? Or is there a really funky way that I haven't come across yet?
Thanks!
From the user side, the simplest way to do this would be to start with a PDF file with the static content, and then using python to just add the dynamic parts. Is this a simple process?
Unfortunately no. There are several tools that are good at producing PDFs from scratch (most commonly for Python, ReportLab), but they don't generally load existing PDFs. You would have to include generating code for any boilerplate text, lines, blocks, shapes and images, rather than this being freely editable by the user.
On the other side there's pyPdf which can load PDFs, collate the pages, and extract some of the information, but can't really add new content. You can ‘merge’ pages into one, but you'd still have to create the extra information overlay as a page in ReportLab first.
Look into docutils and reSTructuredText. You could quickly write out your PDF document in reST and then compile the PDF using rst2pdf.py
I've used this, it creates very beautiful documents and the markup is extensible! Later you could take the same code and run it into rst2html to create a website out if it!
Take a look here:
http://docutils.sourceforge.net/docs/user/rst/quickref.html
http://code.google.com/p/rst2pdf/
Good luck
You could generate a document through, for example, TeX, or OpenOffice, or whatever gives you the most comfortable bindings and then print the document with a pdf printer.
This allows you not to have to figure out where to put fields precisely or figure out what to do if your content overflows the space allocated for it.

Categories