Extracting content from DOCX into Python Code

Extracting content from DOCX into Python Code - python

I have been learning how to create DOCX files using Python.
However, I have a document that I want to automate the regular editing by using python. The editing (deleting or adding) needs to be conducted based on terms found in an excel spreadsheet.
The Document I have is around 25 pages, with different formats, tables, paragraphs, headings, and some images. Is there a way to extract all of this into python code, where I can then add terms based on an excel spreadsheet on what to print or leave in the docx file?
Main concern is DOCX content --> Python CODE
Example:
If the document I was reading only contains a paragraph saying "Test"
The code would generate a separate new code that would state:
document.add_paragraph('Test')

Depends what you want to do with the text. If you want to put it back "in-place" to a docx, you'll want to have a look at python-docx or edit the xml itself.
If you're willing to rebuild the tree document structure from some pile of text, several python libraries will pull the text for you (python-docx, docx2txt, docx2python).
Here's how you might edit the text in docx2python
from docx2python import docx2python
from docx2python.iterators import enum_paragraphs
content = docx2python('input.docx').document
for (i, j, k), paragraph in enum_paragraphs(content):
content[i][j][k] = transforming_function(paragraph)

Related

How to copy-paste docx (containing text, image and formatting) to another docx template in a specific spot?

I'm currently struggling with the following point :
I work on a project that needed me to generate Documents (docx) for autodocumentation purposes, they contain text/images/formatting. These documents are singulars, it means that each part of my documentation (there are almost 15 parts) has its docx.
I have a global documentation template (docx) that expects to be filled by these singular documents (for some technical reason, I just can't directly generate a complete document) at specific locations
**Do you know a way to copy-paste these singular documents in the global documentation template ?
Thank you !
I tried using QuickParts from Word + MailMerge package but it seems that I can only fill these QuickParts with text.
I found ways to copy-paste a docx to another in a raw manner, I didn't find a way to copy-paste in a specific spot.

python-docx cannot read a table inserted from excel

I need to deal with tables in many word files. Some of them are created in word table format, which can be read using python-docx.
However, some of them are inserted from excel. I don't know why python-docx cannot read them. Here is piece of code I wrote for test. As you can see in the terminal, there is nothings in the list variable 'tables'.
import docx
from docx import Document
docFile = 'a.docx'
document = Document(docFile)
tables = document.tables
print(tables)
Anyone can help? Thanks a lot!

I'm fighting the same issue using Pages on OSX to create a .docx template. I've found that Format > Arrange > Object Placement needs to be set to Move with text for the table, changing it to have any alignment or formatting causes the tables to disappear in python and be read as paragraphs that contain nothing. Looking at the XML of both and the python-docx code I'm suspicious of w:tblInd but I'm not clued up enough to go much further. I see recent GitHub issues covering this so hopefully will get sorted.
example on OSX:

How to retrieve highlighted sequences of an .epub file in Python?

I'm used to highlight the important parts of the .epub books I read on my Kobo reader, and I'd like to write a script extracting these highlighted parts, and saving them in a .txt file.
I've checked out epub library documentation, but I couldn't find anything relevant to my problem.
Can anyone give me some tips on how to select and save the highlighted parts of my epub files? Is there a specific tag I should search in the file?

The highlights and the annotations are stored in the KoboReader.sqlite file, not in each EPUB file.
You might want to check my script out:
http://github.com/pettarin/export-kobo
or search for Calibre plugins (there are several ones that can export highlights/annotations).

how to create docx files with python

I am trying to take my data and put it in tables in either microsoft words or libreoffice writer.
I need to be able to change the background of cells within the table and I need to be able to change the page property to 'landscape'.
I have been looking for a library with simple code ( I am a beginner in coding ) but I did not find one for what I need to do.
Have you heard of anything for me ? If there are example on how to use it that would make it easier for me to learn it.

Check out this project
And here is a great quick-start guide
It's pretty simple to use, i haven't tested this, but it should work:
from docx import Document
document = Document()
r = 2 # Number of rows you want
c = 2 # Number of collumns you want
table = document.add_table(rows=r, cols=c)
table.style = 'LightShading-Accent1' # set your style, look at the help documentation for more help
for y in range(r):
for x in range(c):
cell.text = 'text goes here'
document.save('demo.docx') # Save document
It don't think you can set the page orientation property with this library, but what you could do is create a blank word document that is in landscape yourself, store it in the working directory and make a copy of it every time you generate this document.

The previous answer is a good one, but there is another way: create the document in Word then hack the xml in Python to insert the content you want. I have done this several times. In fact, my current invoicing program works this way.
Disadvantages: Conditional formatting and numbered lists will require some real xml knowledge.
Advantages: No limitations or intricate style definitions to manage. EVERYTHING is supported, because it's all done in Word.
Here's the workflow:
create a *.docx document with marker text where you want your headings, table cells, etc.
Use lxml to find those markers and copy their parent elements (along with their formatting).
Use those found elements to create templates
Insert your data into those templates, and assemble the whole thing like a jigsaw puzzle.
Zip your xml into a new *.docx file.
I have a short sample project showing the procedure. Docx2Python does most of the work for you.
https://github.com/ShayHill/replace_docx_tables

Create PDF from Word template in Python

I need to generate reports in python. The report needs to contain a header and footer on each page, some text that won't change, some text that's dynamic, and some charts.
I've created a template using Word, and I'm looking for a way of replacing placeholders such as [+my_placeholder+] with text content/charts/whatever.
Is there anything that can let me use Word documents (or something Word can write) as a template to create a PDF in Python? Since I've already created sample reports in Word I want to reuse what I've got instead of having to recreate them using ReportLab or HTML (I know about ReportLab, pyPDF and also xhtml2pdf).

about the first part, editing word templates, you can have a look at this question: Reading/Writing MS Word files in Python
There is a package for editing word files, called python-docx.
For converting word files into pdf documents, there should be a variety of different tools, including command line tools. So you could use python to edit your document and then use a converter tool that you call from your python code to create the pdf.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Extracting content from DOCX into Python Code - python

Related

How to copy-paste docx (containing text, image and formatting) to another docx template in a specific spot?

python-docx cannot read a table inserted from excel

How to retrieve highlighted sequences of an .epub file in Python?

how to create docx files with python

Create PDF from Word template in Python

Categories

Resources