how to create docx files with python - python

I am trying to take my data and put it in tables in either microsoft words or libreoffice writer.
I need to be able to change the background of cells within the table and I need to be able to change the page property to 'landscape'.
I have been looking for a library with simple code ( I am a beginner in coding ) but I did not find one for what I need to do.
Have you heard of anything for me ? If there are example on how to use it that would make it easier for me to learn it.

Check out this project
And here is a great quick-start guide
It's pretty simple to use, i haven't tested this, but it should work:
from docx import Document
document = Document()
r = 2 # Number of rows you want
c = 2 # Number of collumns you want
table = document.add_table(rows=r, cols=c)
table.style = 'LightShading-Accent1' # set your style, look at the help documentation for more help
for y in range(r):
for x in range(c):
cell.text = 'text goes here'
document.save('demo.docx') # Save document
It don't think you can set the page orientation property with this library, but what you could do is create a blank word document that is in landscape yourself, store it in the working directory and make a copy of it every time you generate this document.

The previous answer is a good one, but there is another way: create the document in Word then hack the xml in Python to insert the content you want. I have done this several times. In fact, my current invoicing program works this way.
Disadvantages: Conditional formatting and numbered lists will require some real xml knowledge.
Advantages: No limitations or intricate style definitions to manage. EVERYTHING is supported, because it's all done in Word.
Here's the workflow:
create a *.docx document with marker text where you want your headings, table cells, etc.
Use lxml to find those markers and copy their parent elements (along with their formatting).
Use those found elements to create templates
Insert your data into those templates, and assemble the whole thing like a jigsaw puzzle.
Zip your xml into a new *.docx file.
I have a short sample project showing the procedure. Docx2Python does most of the work for you.
https://github.com/ShayHill/replace_docx_tables

Related

How to copy-paste docx (containing text, image and formatting) to another docx template in a specific spot?

I'm currently struggling with the following point :
I work on a project that needed me to generate Documents (docx) for autodocumentation purposes, they contain text/images/formatting. These documents are singulars, it means that each part of my documentation (there are almost 15 parts) has its docx.
I have a global documentation template (docx) that expects to be filled by these singular documents (for some technical reason, I just can't directly generate a complete document) at specific locations
**Do you know a way to copy-paste these singular documents in the global documentation template ?
Thank you !
I tried using QuickParts from Word + MailMerge package but it seems that I can only fill these QuickParts with text.
I found ways to copy-paste a docx to another in a raw manner, I didn't find a way to copy-paste in a specific spot.

Missing document text when using python-docx

I am using python-docx 0.8.6 and python 3.6 to preform a simple search/replace operation.
I'm having a problem where not all of the document's text appears when iterating over the doc.paragraphs
For debugging I have tried
doc = Document(input_file)
fullText = []
for para in doc.paragraphs:
fullText.append(para.text)
print('\n'.join(fullText))
Which only seems to print out about half of the file's contents.
There are no tables or special formatting in the file. Is there any reason why so much of the document's contents cannot be read by python-docx?
Edit: the missing text is contained within a mail merge field if that makes any difference
The mail merge field does make a difference. Unfortunately, python-docx is not sophisticated enough to know which "container" elements hold displayable text and which do not. So it only reports paragraphs (and tables) that are at the "top" level.
This is also a limitation when it comes to revision marks, for example, which have two or more pieces of text of which only one appears, depending on the revision marks setting (show original, show latest after edits, etc.).
The only way around it with python-docx is to navigate the XML yourself, although some of the domain objects in python-docx can be handy, like Paragraph, etc. once you've gotten hold of the elements you want.

Modify and create PDF using Python

I have created a really nice looking invitation letter in word (.doc/.docx). Now, I need to personalize it for 1,000 people with their names and associated QR codes.
I tried working with pyfpdf and reportlab but it seems like in order to use these packages I have to re-generate the whole invitation letter along with text and graphics. I'm not sure if I will be able to generate an equally visually appealing letter as I have now in word (at least not without a lot of effort).
Is there a better way, where I use word template as input, fill-in the name and QR code and generate PDF?
If you are prepared to do the QR code and personalization in reportlab, then pdfrw (disclaimer: I am the author) will let you either merge the PDFs after the fact (similar to a watermarking operation), or can bring the PDF you generate from word in to reportlab a form XObject (similar to an image). You can use it for a background.
You should try using the Microsoft Word MailMerge feature which will probably do exactly what you want from within Word itself.
PDF editing is a very complex beast, as is docx editing. The majority of companies who offer PDF "support" use PDF APIs, since the software to edit and create PDF documents is so complex it's a retailable product in itself.
You can use MailMerge either to print or to email the PDF to lots of people at once with custom settings for each person.

Is there a way to automate specific data extraction from a number of pdf files and add them to an excel sheet?

Regularly I have to go through a list of pdf files and search for specific data and add them to an excel sheet for later review. As the number of pdf files are around 50 per month, it is both time taking and frustrating to do it manually.
Can the process be automated in windows by python or any other scripting language? I require to have all the pdf files in a folder and run the script which will generate an excel sheet with all the data added. The pdf files with which I work are tabular and have similar structures.
Yes. And no. And maybe.
The problem here is not extracting something from a PDF document. Extracting something is almost always possible and there are plenty of tools available to extract content from a PDF document. Text, images, whatever you need.
The major problem (and the reason for the "no" or "maybe") is that PDF in general is not a structured file format. It doesn't care about columns, paragraphs, tables, sentences or even words. In the general case it cares only about characters on a page in a specific location.
This means that in the general case you cannot query a PDF document and ask it for every paragraph or for the third sentence in the fifth paragraph. You can ask a library to get all of the text or all of the text in a specific location. And then you have to hope the library is able to extract the text you need in a legible format. Because there doesn't even have to be the case that you can copy and paste or otherwise extra understandable characters from a PDF file. Many PDF files don't even contain enough information for that.
So... If you have a certain type of document and you can test that it predictably behaves a certain way with a certain extraction engine, then yes, you can extract information from a PDF file.
If the PDF files you receive are different all the time or the layout on the page is totally different every time than the answer is probably that you cannot reliably extract the information you want.
As a side note:
There are certain types of PDF documents that are easier to handle than others so if you're lucky that might make your life easier. Two examples:
Many PDF files will in fact contain textual information in such a way that it can be extracted in a legible way. PDF files that follow certain standards (such as PDF/A-1a, PDF/A-2a or PDF/A-2u etc...) are even required to be created this way.
Some PDF files are "tagged" which means they contain additional structural information that allows you to extract information in an easier and more meaningful way. This structure would in fact identify paragraphs, images, tables etc and if the tagging was done in a good way it could make the job of content extraction much easier.
You could use pdf2text2 in Python to extract data from your PDF.
Alternatively you can use pdftotext that is part of the Xpdf suite

Adding text over existing PDFs using reportlab

I'm interested in filling out existing PDF forms programatically. All I really need to do is pull information from user input and then place the appropriate text over an existing PDF in the appropriate locations. I can already do this with reportlab by feeding the same sheet of paper into a printer, twice, but this just really rubs me the wrong way.
I'm tempted to just personally reverse engineer each existing PDF and draw every line and character myself before adding the user-inputted text, but I wanted to check to see if there was an easy way to take an existing PDF and set it as a background for some extra text. I'd really prefer to use python as it's the only language I feel comfortable with.
I also realize that I could just scan the document itself and use the resulting raster image as a background, but I would prefer the precision of vector graphics.
It seems like ReportLab has a commercial product with this functionality, and the specific function I'm looking for is in it (copyPages) - but it seems like overkill to pay for a 4 figure product for a single, simple function for a nonprofit use.
If the PDF forms are real AcroForms you can use iText to fill them. I don't know if there's other port than iText (java, original) and iTextSharp (c#) but it's easy to use and free if you don't mind to open-source your solution. You can take a look at this sample code or (java snippet):
String formFile = "/path/to/myform.pdf"
String newFile = "/path/to/output.pdf"
PdfReader reader = new PdfReader(formFile);
FileOutputStream outStream = new FileOutputStream(newFile);
PdfStamper stamper = new PdfStamper(reader, outStream);
AcroFields fields = stamper.getAcroFields();
// fill the form
fields.setField("name", "Shane");
fields.setField("url", "http://stackoverflow.com");
// PDF infos
HashMap<String, String> infoDoc = new HashMap<String, String>();
infoDoc.put("Title", "your title here");
infoDoc.put("Author", "JRE ;)");
stamper.setMoreInfo(infoDoc);
// Flatten the PDF & cleanup
stamper.setFormFlattening(true);
stamper.close();
reader.close();
outStream.close();
If you just want to add some texts at a pre-printed paper.
You can scan it as a jpg, then put this jpg as background.
Please reference 15th page at reportlab manual, just call drawImage
It sounds like you just need to place an existing PDF in the background of a Reportlab PDF that you are generating. The free PDFRW library can do this easily. Take a look at the Example Tools page for some specific examples of this technique.

Categories