Extracting tables from a DOCX Word document in python

Extracting tables from a DOCX Word document in python - python

I'm trying to extract a content of tables in DOCX Word document, and boy I'm new to xml/xpath.
from docx import *
document = opendocx('someFile.docx')
tableList = document.xpath('/w:tbl')
This triggers "XPathEvalError: Undefined namespace prefix" error. I'm sure it's just the first one to expect while developing the script. Unfortunately, I couldn't find a tutorial for python-docx.
Could you kindly provide an example of table extraction?

After some back and forth, we found out that a namespace was needed for this to work correctly. The xpath method is the appropriate solution, it just needs to have the document namespace passed in first.
The lxml xpath method has the details for namespace stuff. Look down the page in the link for passing a namespaces dictionary, and other details.
As explained by mgierdal in his comment above:
tblList = document.xpath('//w:tbl', namespaces=document.nsmap) works
like a dream. So, as I understand it w: is a shorthand that has to be
expanded to the full namespace name, and the dictionary for that is
provided by document.nsmap.

You can extract the table from docx using python-docx. Check the following code:
from docx import Document()
document = Document(file_path)
tables = document.tables

First install python-docx as mentioned by #abdulsaboor
pip install python-docx
Then this code should do:
from docx import Document
document = Document('myfile.docx')
for table in document.tables:
print()
for row in table.rows:
for cell in row.cells:
print(cell.text, end=' ')

Related

python-docx Replace a URL with custom content

I want to write a tool that is able to process a docx file and replace urls that match a specific pattern, with different content (e.g. headings, tables, and text).
Finding the matching URL is simple enough
from docx import Document
from docx.opc.constants import RELATIONSHIP_TYPE as RT
doc = Document('input.docx')
rels = doc.part.rels
for r in iter_hyperlink_rels(rels):
print(r)
But I'm not sure how to remove that element and place my own content at that specific location. How can I do this?
In essence, I'm trying to implement a sort of macro processor that replaces specific tags within a document with content generated using these tags.

Reading core_properties using python-docx

I'm trying to read the last_saved_by attribute on docx files. I've followed the comments on Github, and from this question. It seems that support has been added, but the documentation isn't very clear for me.
I've entered the following code into my script (Notepad++):
import docx
document = Document()
core_properties = document.core_properties
core_properties.author = 'Foo B. Baz'
document.save('new-filename.docx')
I only get an error message at the end:
NameError: name 'Document' is not defined
I'm not sure where I'm going wrong. :(
When I enter it line by line through python itself, the problem seems to come up from the second line.
I'm using Python 3.4, and docx 0.8.6

Figured out where I was going wrong, for those that want to know:
from docx import Document
import docx
document = Document('mine.docx')
core_properties = document.core_properties
print(core_properties.author)
There'll be a more succinct way of doing this, I'm sure (importing docx twice seems redundant for a start) - but it works, so I'm happy! :)

If the only thing you need from the docx module is Document, then you only need to use
from docx import Document
If you use more than that, you could use
import docx
document = docx.Document()
importing specific names from the docx module is your choice; either way, you don't need to have two lines importing from (or importing) docx, although it's not expensive to have both.

Modifying text in docx through Python

I am currently trying to open an existing Word document, modify a line by inserting a raw input at the end, then save it as an overwrite to the original "DOCX". I am using the "DOCX" module. I'm able to create a new document, write in it, then save it... however cannot figure out how to modify an existing line in an existing "DOCX".
doc = docx.Document()
paragraph = doc.add_paragraph()
so far, i've tried this out.. the problem is the paragraph i need to modify is paragraph 0, and this code places my text at the bottom of the page on a new paragraph.
import docx
paragraph = doc.add_paragraph()
doc = docx.Document("C:\Users\xxx\Desktop\test.docx")
doc.add_paragraph("hello")
docx.text.paragraph.Paragraph object at 0x03697170
doc.save("C:\Users\xxx\Desktop\test.docx")
How can I go about instructing python to write at the end of an existing string in a paragraph then save it overwriting the original?

import docx
doc = docx.Document("C:\Users\xxx\Desktop\test.docx")
doc.paragraphs[0].add_run("hello")
doc.save("C:\Users\xxx\Desktop\test.docx")

Do you mean for something like this?
from docx import Document
existing_docx = Document(r'path_to_existing.docx')
for paragraph in existing_docx.paragraphs:
paragraph.text = paragraph.text + your_text
existing_docx.save(r'same_path_or_another.docx')
The new docx will contain your text after every paragraph with the code above.

To get paragraph 0 try:
para = doc.paragraphs[0]

Update the TOC (table of content) of MS Word .docx documents with Python

I use the python package "python-docx" to modify the structure amd content of MS word .docx documents. The package lacks the possibility to update the TOC (table of content) [Python: Create a "Table Of Contents" with python-docx/lxml.
Are there workarounds to update the TOC of a document? I thought about using "win32com.client" from the python package "pywin32" [https://pypi.python.org/pypi/pypiwin32] or a comparable pypi package offering "cli control" capabilities for MS Office.
I tried the following:
I changed the document.docx to document.docm and implemented the following macro [http://word.tips.net/T000301_Updating_an_Entire_TOC_from_a_Macro.html]:
Sub update_TOC()
If ActiveDocument.TablesOfContents.Count = 1 Then _
ActiveDocument.TablesOfContents(1).Update
End Sub
If i change the content (add/remove headings) and run the macro the TOC is updated. I save the document and i am happy.
I implement the following python code which should be equivalent to the macro:
import win32com.client
def update_toc(docx_file):
word = win32com.client.DispatchEx("Word.Application")
doc = word.Documents.Open(docx_file)
toc_count = doc.TablesOfContents.Count
if toc_count == 1:
toc = doc.TablesOfContents(1)
toc.Update
print('TOC should have been updated.')
else:
print('TOC has not been updated for sure...')
update_toc(docx_file) is called in a higher-level script (which manipulates the TOC-relevant content of the document). After this function call the document is saved (doc.Save()), closed (doc.Close()) and the word instance is closed (word.Quit()). However the TOC is not updated.
Does ms word perform additional actions after macro execution which i did not consider?

Here is a snippet to update the TOC of a word 2013 .docx document which includes only one table of content (e.g. just TOC of headings, no TOC of figures etc.). If the script update_toc.py is run from the command promt (windows 10, command promt not "running as admin") using python update_toc.py the system installation of python opens the file doc_with_toc.docx in the same directory, updates the TOC (in my case the headings) and saves the changes into the same file. The document may not be opened in another instance of Word 2013 and may not be write-protected. Be aware of that this script does not the same as selecting the whole document content and pressing the F9 key.
Content of update_toc.py:
import win32com.client
import inspect, os
def update_toc(docx_file):
word = win32com.client.DispatchEx("Word.Application")
doc = word.Documents.Open(docx_file)
doc.TablesOfContents(1).Update()
doc.Close(SaveChanges=True)
word.Quit()
def main():
script_dir = os.path.dirname(os.path.abspath(inspect.getfile(inspect.currentframe())))
file_name = 'doc_with_toc.docx'
file_path = os.path.join(script_dir, file_name)
update_toc(file_path)
if __name__ == "__main__":
main()

I autogenerate a docx file with docxtpl python package.
This document contains many autogenerated tables.
I need to update the entire document after template generation (to have my generated tables number refreshed as well as the Tables of content, of figure and of table).
I am not fluent in VBA and didn't know the functions to use for this updates. To find them, I created a word Macro through the "record Macro" button.
I translated the autogenerated code to python and here is the result.
I thing that can help to perform any word operation through python.
def DocxUpdate(docx_file):
word = win32com.client.DispatchEx("Word.Application")
doc = word.Documents.Open(docx_file)
# update all figure / table numbers
word.ActiveDocument.Fields.Update()
# update Table of content / figure / table
word.ActiveDocument.TablesOfContents(1).Update()
word.ActiveDocument.TablesOfFigures(1).Update()
word.ActiveDocument.TablesOfFigures(2).Update()
doc.Close(SaveChanges=True)
word.Quit()

To Update the TOC, this worked for me:
word = win32com.client.DispatchEx("Word.Application")
Selection = word.Selection
Selection.Fields.Update

Read Docx files via python

Does anyone know a python library to read docx files?
I have a word document that I am trying to read data from.

There are a couple of packages that let you do this.
Check
python-docx.
docx2txt (note that it does not seem to work with .doc). As per this, it seems to get more info than python-docx.
From original documentation:
import docx2txt
# extract text
text = docx2txt.process("file.docx")
# extract text and write images in /tmp/img_dir
text = docx2txt.process("file.docx", "/tmp/img_dir")
textract (which works via docx2txt).
Since .docx files are simply .zip files with a changed extension, this shows how to access the contents.
This is a significant difference with .doc files, and the reason why some (or all) of the above do not work with .docs.
In this case, you would likely have to convert doc -> docx first. antiword is an option.

python-docx can read as well as write.
doc = docx.Document('myfile.docx')
allText = []
for docpara in doc.paragraphs:
allText.append(docpara.text)
Now all paragraphs will be in the list allText.
Thanks to "How to Automate the Boring Stuff with Python" by Al Sweigart for the pointer.

See this library that allows for reading docx files https://python-docx.readthedocs.io/en/latest/
You should use the python-docx library available on PyPi. Then you can use the following
doc = docx.Document('myfile.docx')
allText = []
for docpara in doc.paragraphs:
allText.append(docpara.text)

A quick search of PyPI turns up the docx package.

import docx
def main():
try:
doc = docx.Document('test.docx') # Creating word reader object.
data = ""
fullText = []
for para in doc.paragraphs:
fullText.append(para.text)
data = '\n'.join(fullText)
print(data)
except IOError:
print('There was an error opening the file!')
return
if __name__ == '__main__':
main()
and dont forget to install python-docx using (pip install python-docx)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Extracting tables from a DOCX Word document in python - python

You can extract the table from docx using python-docx. Check the following code: from docx import Document() document = Document(file_path) tables = document.tables

First install python-docx as mentioned by #abdulsaboor pip install python-docx Then this code should do: from docx import Document document = Document('myfile.docx') for table in document.tables: print() for row in table.rows: for cell in row.cells: print(cell.text, end=' ')

Related

python-docx Replace a URL with custom content

Reading core_properties using python-docx

Modifying text in docx through Python

Update the TOC (table of content) of MS Word .docx documents with Python

Read Docx files via python

Categories

Resources