Create PDF from Word template in Python - python

I need to generate reports in python. The report needs to contain a header and footer on each page, some text that won't change, some text that's dynamic, and some charts.
I've created a template using Word, and I'm looking for a way of replacing placeholders such as [+my_placeholder+] with text content/charts/whatever.
Is there anything that can let me use Word documents (or something Word can write) as a template to create a PDF in Python? Since I've already created sample reports in Word I want to reuse what I've got instead of having to recreate them using ReportLab or HTML (I know about ReportLab, pyPDF and also xhtml2pdf).

about the first part, editing word templates, you can have a look at this question: Reading/Writing MS Word files in Python
There is a package for editing word files, called python-docx.
For converting word files into pdf documents, there should be a variety of different tools, including command line tools. So you could use python to edit your document and then use a converter tool that you call from your python code to create the pdf.

Related

Python - Split multi document pdf by keyword and save to it's own pdf file

I have to take a pdf file that has multiple documents in it and identify when each document starts and ends by using provided key phrases and then save that document to a separate pdf. I'm pretty sure I need to use the PyPDF2 module to split the pdf up into separate pdfs but I'm not sure how to approach identifying the start and end of the documents by the key phrases. Code would help, but I more so need clarity on the approach

Extracting content from DOCX into Python Code

I have been learning how to create DOCX files using Python.
However, I have a document that I want to automate the regular editing by using python. The editing (deleting or adding) needs to be conducted based on terms found in an excel spreadsheet.
The Document I have is around 25 pages, with different formats, tables, paragraphs, headings, and some images. Is there a way to extract all of this into python code, where I can then add terms based on an excel spreadsheet on what to print or leave in the docx file?
Main concern is DOCX content --> Python CODE
Example:
If the document I was reading only contains a paragraph saying "Test"
The code would generate a separate new code that would state:
document.add_paragraph('Test')
Depends what you want to do with the text. If you want to put it back "in-place" to a docx, you'll want to have a look at python-docx or edit the xml itself.
If you're willing to rebuild the tree document structure from some pile of text, several python libraries will pull the text for you (python-docx, docx2txt, docx2python).
Here's how you might edit the text in docx2python
from docx2python import docx2python
from docx2python.iterators import enum_paragraphs
content = docx2python('input.docx').document
for (i, j, k), paragraph in enum_paragraphs(content):
content[i][j][k] = transforming_function(paragraph)

Extracting text from text documents with Poedit

I am making a quiz app which reads data from text files. The app works fine but I now want to translate it into English (from my native language). I can do that for strings defined in source files (.py) such as text on buttons etc., but have troubles with extracting text that needs translating from those text documents where all my questions and possible answers are.
I am using module gettext with Python and am using operator _ or _( to indicate translatable strings (which I have set in Poedit under Properties - Sources Keywords).
I have also set paths of my translatable sources to . (all files in that directory) and even tried setting those .txt files specifically for extracting.
My text file looks like this (one line of one file):
_(Koliko je 2/0?);_(0):_(ni definirano):_(2);_(ni definirano)
I tried to find which document type's Poedit extracts text from but did not find anything other than "from source" - should it be able to extract from .txt files or not? If not, how should I name them?
As I said, it does extracts strings from my .py files so it is working otherwise.
Poedit can't magically know the syntax of your homegrown file format, so simply adding .txt files can't possibly do anything. You'll have to write a custom extractor (see how xgettext works for reference) or switch to some standard syntax:
Be sufficiently similar to a supported programming language, such as C (where as luck would have it, ; and : are both valid syntax elements, although using e.g. , instead of : would be safer):
_("Koliko je 2/0?");_("0"):_("ni definirano"):_("2");_("ni definirano")
Use XML-based format, where xgettext supports extraction rules described with ITS.
I was confronted with the same problem when trying to extract strings to be translated from an options.txt file in a WordPress plugin. The only solution I found was to copy that options.txt file to options.php, which PoEdit was able to search for strings. When the translation operation is finished, the options.txt file can then be deleted.

Python Slate Library: PDF text extraction concatenating words

Just trying to extract the text from a PDF in Python, using the Slate Library and PyPDF2. Unfortunately some PDFs are being output with multiple words merged/concatenated together. This seems to happen intermittently, for example for some PDFs words are extracted with the spaces between them correctly, whereas others are not.
One example of a PDF where words are not extracted correctly is included and available for download (SO wouldn't let me upload it) here. The output from
slate.PDF(open(name, 'rb') ).text()
is (or at least a segment is):
,notonadhocprocedures,andcanbeusedwithdatacollectedatmul-tiplespatialresolutions(Kulldorff1999).Ifdataontheabundanceofataxonovertimeareavailable,thesedatacanbeincorporatedintoanSTPSanalysistoincreasethesensitivityandreliabilityofthemodeltodetectsightingclusters,
where of course the first comma-separated token should be not on adhoc procedures
Does anybody know why this is happening, or have a better idea of a library to use for PDF text extraction?
Thanks for the help!

Modify and create PDF using Python

I have created a really nice looking invitation letter in word (.doc/.docx). Now, I need to personalize it for 1,000 people with their names and associated QR codes.
I tried working with pyfpdf and reportlab but it seems like in order to use these packages I have to re-generate the whole invitation letter along with text and graphics. I'm not sure if I will be able to generate an equally visually appealing letter as I have now in word (at least not without a lot of effort).
Is there a better way, where I use word template as input, fill-in the name and QR code and generate PDF?
If you are prepared to do the QR code and personalization in reportlab, then pdfrw (disclaimer: I am the author) will let you either merge the PDFs after the fact (similar to a watermarking operation), or can bring the PDF you generate from word in to reportlab a form XObject (similar to an image). You can use it for a background.
You should try using the Microsoft Word MailMerge feature which will probably do exactly what you want from within Word itself.
PDF editing is a very complex beast, as is docx editing. The majority of companies who offer PDF "support" use PDF APIs, since the software to edit and create PDF documents is so complex it's a retailable product in itself.
You can use MailMerge either to print or to email the PDF to lots of people at once with custom settings for each person.

Categories