Adding text over existing PDFs using reportlab - python

I'm interested in filling out existing PDF forms programatically. All I really need to do is pull information from user input and then place the appropriate text over an existing PDF in the appropriate locations. I can already do this with reportlab by feeding the same sheet of paper into a printer, twice, but this just really rubs me the wrong way.
I'm tempted to just personally reverse engineer each existing PDF and draw every line and character myself before adding the user-inputted text, but I wanted to check to see if there was an easy way to take an existing PDF and set it as a background for some extra text. I'd really prefer to use python as it's the only language I feel comfortable with.
I also realize that I could just scan the document itself and use the resulting raster image as a background, but I would prefer the precision of vector graphics.
It seems like ReportLab has a commercial product with this functionality, and the specific function I'm looking for is in it (copyPages) - but it seems like overkill to pay for a 4 figure product for a single, simple function for a nonprofit use.

If the PDF forms are real AcroForms you can use iText to fill them. I don't know if there's other port than iText (java, original) and iTextSharp (c#) but it's easy to use and free if you don't mind to open-source your solution. You can take a look at this sample code or (java snippet):
String formFile = "/path/to/myform.pdf"
String newFile = "/path/to/output.pdf"
PdfReader reader = new PdfReader(formFile);
FileOutputStream outStream = new FileOutputStream(newFile);
PdfStamper stamper = new PdfStamper(reader, outStream);
AcroFields fields = stamper.getAcroFields();
// fill the form
fields.setField("name", "Shane");
fields.setField("url", "http://stackoverflow.com");
// PDF infos
HashMap<String, String> infoDoc = new HashMap<String, String>();
infoDoc.put("Title", "your title here");
infoDoc.put("Author", "JRE ;)");
stamper.setMoreInfo(infoDoc);
// Flatten the PDF & cleanup
stamper.setFormFlattening(true);
stamper.close();
reader.close();
outStream.close();

If you just want to add some texts at a pre-printed paper.
You can scan it as a jpg, then put this jpg as background.
Please reference 15th page at reportlab manual, just call drawImage

It sounds like you just need to place an existing PDF in the background of a Reportlab PDF that you are generating. The free PDFRW library can do this easily. Take a look at the Example Tools page for some specific examples of this technique.

Related

Converting PDF to any parse-able format

I have a PDF file which consists of tables which can spread across various pages and may have text in between. An example of it can be found here.
I am able to convert the PDF to any format but the output files are not in any way parse-able i.e. I cannot extract data out of it as they are scattered. Here are the links to the output files which I created using pdftotext and pdftohtml.
Is there a way to extract data in a more suitable way?
Thanks in advance.
The general answer is no. pdf is a format intended for visual presentation and printing, and there is no guarantee that the contents will be in any particular order let alone structured as a table in any way other than what appears when the pdf is rendered onto paper or a screen. Sometimes there is even deliberate obfuscation to prevent anyone doing what you are attempting.
In this case it appears to be possible to cut and paste the contents of each table element. For a small number of similar files that is almost certainly the quickest thing to do. Open the pdf on the left hand of your screen, a spreadsheet or data-entry program on the right hand, then cut and paste. For a medium number - tens, hundreds? - it's probably cheapest to hire a temp to do the donkey-work. For a large number - thousands? - it would be possible to create a program to automate this process, but definitely not easy. I might think about using human input via the mouse to identify the corners of the table and the horizontal / vertical divisions, then generating cut and paste operations via control of the human interface devices. Don't ask me how. I'd have to find out if I had to do this, and I'd much rather not. It's a WOMBAT.
Whatever form of analysis you did on the pdf contents would certainly not generalize to other pdfs created by different organisations using different software, and possibly not even by the same organisation using the same process but merely a later release of the same software.
Following in the line of #nigel222, it really depends on the PDF how easily you can get the data out in some useful way.
It is best if the PDF is structured (has a document structure, created when the PDF was written). In this case, you can access the structure, and you are all set.
As structure is a fundamental necessity of an accessible PDF, you may try to "massage" the document by applying the various "make accessible" utilities floating around; definitely something to follow.

Modify and create PDF using Python

I have created a really nice looking invitation letter in word (.doc/.docx). Now, I need to personalize it for 1,000 people with their names and associated QR codes.
I tried working with pyfpdf and reportlab but it seems like in order to use these packages I have to re-generate the whole invitation letter along with text and graphics. I'm not sure if I will be able to generate an equally visually appealing letter as I have now in word (at least not without a lot of effort).
Is there a better way, where I use word template as input, fill-in the name and QR code and generate PDF?
If you are prepared to do the QR code and personalization in reportlab, then pdfrw (disclaimer: I am the author) will let you either merge the PDFs after the fact (similar to a watermarking operation), or can bring the PDF you generate from word in to reportlab a form XObject (similar to an image). You can use it for a background.
You should try using the Microsoft Word MailMerge feature which will probably do exactly what you want from within Word itself.
PDF editing is a very complex beast, as is docx editing. The majority of companies who offer PDF "support" use PDF APIs, since the software to edit and create PDF documents is so complex it's a retailable product in itself.
You can use MailMerge either to print or to email the PDF to lots of people at once with custom settings for each person.

how to create docx files with python

I am trying to take my data and put it in tables in either microsoft words or libreoffice writer.
I need to be able to change the background of cells within the table and I need to be able to change the page property to 'landscape'.
I have been looking for a library with simple code ( I am a beginner in coding ) but I did not find one for what I need to do.
Have you heard of anything for me ? If there are example on how to use it that would make it easier for me to learn it.
Check out this project
And here is a great quick-start guide
It's pretty simple to use, i haven't tested this, but it should work:
from docx import Document
document = Document()
r = 2 # Number of rows you want
c = 2 # Number of collumns you want
table = document.add_table(rows=r, cols=c)
table.style = 'LightShading-Accent1' # set your style, look at the help documentation for more help
for y in range(r):
for x in range(c):
cell.text = 'text goes here'
document.save('demo.docx') # Save document
It don't think you can set the page orientation property with this library, but what you could do is create a blank word document that is in landscape yourself, store it in the working directory and make a copy of it every time you generate this document.
The previous answer is a good one, but there is another way: create the document in Word then hack the xml in Python to insert the content you want. I have done this several times. In fact, my current invoicing program works this way.
Disadvantages: Conditional formatting and numbered lists will require some real xml knowledge.
Advantages: No limitations or intricate style definitions to manage. EVERYTHING is supported, because it's all done in Word.
Here's the workflow:
create a *.docx document with marker text where you want your headings, table cells, etc.
Use lxml to find those markers and copy their parent elements (along with their formatting).
Use those found elements to create templates
Insert your data into those templates, and assemble the whole thing like a jigsaw puzzle.
Zip your xml into a new *.docx file.
I have a short sample project showing the procedure. Docx2Python does most of the work for you.
https://github.com/ShayHill/replace_docx_tables

Generating & Merging PDF Files in Python

I want to automatically generate booking confirmation PDF files in Python. Most of the content will be static (i.e. logos, booking terms, phone numbers), with a few dynamic bits (dates, costs, etc).
From the user side, the simplest way to do this would be to start with a PDF file with the static content, and then using python to just add the dynamic parts. Is this a simple process?
From doing a bit of search, it seems that I can use reportlab for creating content and pyPdf for merging PDF's together. Is this the best approach? Or is there a really funky way that I haven't come across yet?
Thanks!
From the user side, the simplest way to do this would be to start with a PDF file with the static content, and then using python to just add the dynamic parts. Is this a simple process?
Unfortunately no. There are several tools that are good at producing PDFs from scratch (most commonly for Python, ReportLab), but they don't generally load existing PDFs. You would have to include generating code for any boilerplate text, lines, blocks, shapes and images, rather than this being freely editable by the user.
On the other side there's pyPdf which can load PDFs, collate the pages, and extract some of the information, but can't really add new content. You can ‘merge’ pages into one, but you'd still have to create the extra information overlay as a page in ReportLab first.
Look into docutils and reSTructuredText. You could quickly write out your PDF document in reST and then compile the PDF using rst2pdf.py
I've used this, it creates very beautiful documents and the markup is extensible! Later you could take the same code and run it into rst2html to create a website out if it!
Take a look here:
http://docutils.sourceforge.net/docs/user/rst/quickref.html
http://code.google.com/p/rst2pdf/
Good luck
You could generate a document through, for example, TeX, or OpenOffice, or whatever gives you the most comfortable bindings and then print the document with a pdf printer.
This allows you not to have to figure out where to put fields precisely or figure out what to do if your content overflows the space allocated for it.

how to extract formatted text content from PDF

How can I extract the text content (not images) from a PDF while (roughly) maintaining the style and layout like Google Docs can?
To extract the text from the PDF AND get it's position you can use PDFMiner. PDFMiner can also export the PDF directly in HTML keeping the text at the good position.
I don't know your use case, but there's a lot of problems you can encounter when doing this because PDF is really presentation oriented and not content oriented, the text flow is not continous. So, if you want the text to be editable, it will not be an easy task.
Have you tried pyPDF or ReportLab PDF libraries? I personally have not used them but you can have a go at them. here is useful too
Xpdf has a utility call PDFtoText that does a great job. http://foolabs.com/xpdf/download.html
If you want to do it just like Google:
Google converts the PDF to an image, and then overlays the image, where text used to be, with JavaScript highlightable areas (which is about like Voodoo magic). The areas appear to be text when you scroll over them with your cursor, but they're not. This might not help you to know, but that's how they do it. If you want to reverse engineer it, you might start with https://www.mercurial-scm.org/ On the home page, they do the same thing with JavaScript to make the text highlightable and copyable. You can extract the text from the PDF, and find it's location in the page with on of the mentioned libraries in the other answers. Then you can overlay an extracted image of the file with the same style of JavaScript areas.
If you don't have your heart set on doing this with python, Ghostscript can do this for you. Check out pdf2ascii (a script that comes with GS) to get the plain text. Styles are more complicated as they can be specified in a few different ways.
Acrobat Professional can do the job. In the "File" menu, choose export. Then, choose Text.

Categories