Extracting .docx data, images and structure

Extracting .docx data, images and structure - python

Good day SO,
I have a task where I need to extract specific parts of a document template (For automation purposes). While I am able to traverse, and know the current position, of the document during traversal (via checking for Regex, keywords, etc.), I am unable to extract:
The structure of the document
Detect Images that are in-between text
Am I able to obtain, for example, an array of the structure of the document below?
['Paragraph1','Paragraph2','Image1','Image2','Paragraph3','Paragraph4','Image3','Image4']
My current implementation is shown below:
from docx import Document
document = docx.Document('demo.docx')
text = []
for x in document.paragraphs:
if x.text != '':
text.append(x.text)
Using the code above, I am able to obtain all the Text data from the document, but I am unable to detect the type of text (Header or Normal), and I am unable to detect any Images. I am currently using python-docx.
My main problem is to obtain the position of the image within the document (i.e. between paragraphs) so that I can re-create another document, using text and images extracted. This task requires me to know where the image appears in the document, and where to insert the image in the new document.
Any help is greatly appreciated, thank you :)

For extracting the structure of the paragraph and heading you can use the built-in objects in python-docx. Check this code.
from docx import Document
document = docx.Document('demo.docx')
text = []
style = []
for x in document.paragraphs:
if x.text != '':
style.append(x.style.name)
text.append(x.text)
with x.style.name you can get the styling of text in your document.
You can't get the information regarding images in python-docx. For that, you need to parse the xml. Check XML ouput by
for elem in document.element.getiterator():
print(elem.tag)
Let me know if you need anything else.
For extracting image name and its location use this.
tags = []
text = []
for t in doc.element.getiterator():
if t.tag in ['{http://schemas.openxmlformats.org/wordprocessingml/2006/main}r', '{http://schemas.openxmlformats.org/wordprocessingml/2006/main}t','{http://schemas.openxmlformats.org/drawingml/2006/picture}cNvPr','{http://schemas.openxmlformats.org/wordprocessingml/2006/main}drawing']:
if t.tag == '{http://schemas.openxmlformats.org/drawingml/2006/picture}cNvPr':
print('Picture Found: ',t.attrib['name'])
tags.append('Picture')
text.append(t.attrib['name'])
elif t.text:
tags.append('text')
text.append(t.text)
You can check previous and next text from text list and their tag from the tag list.

Related

I am unable to parse through the corrections in the words document through the docx library

I have been trying to parse through the text document, which is a compared document and i want to extract the text in the document which was corrected, which is distinguished by the strike on the error..
To be able to view the corrections, I'd have to set Review->All markup and set showmarkup.
But when I parse through the text using the library, I don't seem to find the striked text in the text returned.
following is the code:
import docx
filename='123.docx'
def getText(filename):
doc = docx.Document(filename)
fullText = []
for para in doc.paragraphs:
fullText.append(para.text)
return '\n'.join(fullText)
getText(filename)
I was planning on using the stike attribute in the library to extract the corrected words, but it dosent seem to parse through the compared doc.
The correction looks like this:
slow it is slow
The text returned by the parse only has "it is slow"

Copy specific word section to new document in python

As the title says, I'm trying to copy a select section of text to a new word document. Basically, I have a bunch of annual reports with sections that are systemically named (i.e. Project 1, Project 2, etc..) I want to be able to search for a select section and copy that section into a report for the individual project. I've been looking through docx documentation and aspose.words documentation. This is the closest I've gotten to what I'm looking for but it's still not quite right:
# For complete examples and data files, please go to https://github.com/aspose-words/Aspose.Words-for-Python-via-.NET
doc = aw.Document(docs_base.my_dir + "Big document.docx")
for i in range(0, doc.sections.count) :
# Split a document into smaller parts, in this instance, split by section.
section = doc.sections[i].clone()
newDoc = aw.Document()
newDoc.sections.clear()
newSection = newDoc.import_node(section, True).as_section()
newDoc.sections.add(newSection)
# Save each section as a separate document.
newDoc.save(docs_base.artifacts_dir + f"SplitDocument.by_sections_{i}.docx")

I suspect by saying section you mean just content tat starts with some heading paragraph. This does not correspond to Section node in Aspose.Words Document Object Model. Section in MS Word are used if you need to have different page setup or headers/footers within the document.
It is difficult to say without your document, but I suppose there is only one section in your document. It looks like you need to extract content form the document based on the styles.
Also, I have create a simple code example that demonstrates the basic technique:
import aspose.words as aw
# Geneare document with heading paragraphs (just for demonstrations purposes).
doc = aw.Document()
builder = aw.DocumentBuilder(doc)
for i in range(0, 5):
builder.paragraph_format.style_identifier = aw.StyleIdentifier.HEADING1
builder.writeln("Project {0}".format(i))
builder.paragraph_format.style_identifier = aw.StyleIdentifier.NORMAL
for j in range(0, 10):
builder.writeln("This is the project {0} content {1}, each project section will be extracted into a separate document.".format(i, j))
doc.save("C:\\Temp\\out.docx")
# Now split the document by heading paragraphs.
# Code is just for demonstration purposes and supposes there is only one section in the document.
subDocumentIndex = 0
subDocument = doc.clone(False).as_document()
for child in doc.first_section.body.child_nodes:
if child.node_type == aw.NodeType.PARAGRAPH and child.as_paragraph().paragraph_format.style_identifier == aw.StyleIdentifier.HEADING1:
if not subDocument.has_child_nodes:
subDocument.ensure_minimum()
else:
# save subdocument
subDocument.save("C:\\Temp\\sub_document_{0}.docx".format(subDocumentIndex))
subDocumentIndex = subDocumentIndex+1
subDocument = doc.clone(False).as_document()
subDocument.ensure_minimum()
# Remove body content.
subDocument.first_section.body.remove_all_children()
# import current node to the subdocument.
dst_child = subDocument.import_node(child, True, aw.ImportFormatMode.USE_DESTINATION_STYLES)
subDocument.first_section.body.append_child(dst_child)
# save the last document.
subDocument.save("C:\\Temp\\sub_document_{0}.docx".format(subDocumentIndex))

Python docx paragraph in textbox

Is there any way to access and manipulate text in an existing docx document in a textbox with python-docx?
I tried to find a keyword in all paragraphs in a document by iteration:
doc = Document('test.docx')
for paragraph in doc.paragraphs:
if '<DATE>' in paragraph.text:
print('found date: ', paragraph.text)
It is found if placed in normal text, but not inside a textbox.

A workaround for textboxes that contain only formatted text is to use a floating, formatted table. It can be styled almost like a textbox (frames, colours, etc.) and is easily accessible by the docx API.
doc = Document('test.docx')
for table in doc.tables:
for row in table.rows:
for cell in row.cells:
for paragraph in cell.paragraphs:
if '<DATE>' in paragraph.text:
print('found date: ', paragraph.text)

Not via the API, not yet at least. You'd have to uncover the XML structure it lives in and go down to the lxml level and perhaps XPath to find it. Something like this might be a start:
body = doc._body
# assuming differentiating container element is w:textBox
text_box_p_elements = body.xpath('.//w:textBox//w:p')
I have no idea whether textBox is the actual element name here, you'd have to sort that out with the rest of the XPath path details, but this approach will likely work. I use similar approaches frequently to work around features that aren't built into the API yet.
opc-diag is a useful tool for inspecting the XML. The basic approach is to create a minimally small .docx file containing the type of thing you're trying to locate. Then use opc-diag to inspect the XML Word generates when you save the file:
$ opc browse test.docx document.xml
http://opc-diag.readthedocs.org/en/latest/index.html

Adding hyperlinks in Microsoft word using python

I've been searching for the last few hours and cannot find a library that allows me to add hyperlinks to a word document using python. In my ideal world I'd be able to manipulate a word doc using python to add hyperlinks to footnotes which link to internal documents. Python-docx doesn't seem to have this feature.
It breaks down into 2 questions. 1) Is there a way to add hyperlinks to word docs using python? 2) Is there a way to manipulate footnotes in word docs using python?
Does anyone know how to do this or any part of this?

Hyperlinks can be added using the win32com package:
import win32com.client
#connect to Word (start it if it isn't already running)
wordapp = win32com.client.Dispatch("Word.Application")
#add a new document
doc = wordapp.Documents.Add()
#add some text and turn it into a hyperlink
para = doc.Paragraphs.Add()
para.Range.Text = "Adding hyperlinks in Microsoft word using python"
doc.Hyperlinks.Add(Anchor=para.Range, Address="http://stackoverflow.com/questions/34636391/adding-hyperlinks-in-microsoft-word-using-python")
#In theory you should be able to also pass in a TextToDisplay argument to the above call but I haven't been able to get this to work
#The workaround is to insert the link text into the document first and then convert it into a hyperlink

# How to insert hyperlinks into an existing MS Word document using win32com:
# Use the same call as in the example above to connect to Word:
wordapp = win32com.client.Dispatch("Word.Application")
# Open the input file where you want to insert the hyperlinks:
wordapp.Documents.Open("my_input_file.docx")
# Select the currently active document
doc = wordapp.ActiveDocument
# For my application, I want to replace references to identifiers in another
# document with the general format of "MSS-XXXX", where X is any digit, with
# hyperlinks to local html pages that capture the supporting details...
# First capture the entire document's content as text
docText = doc.Content.text
# Search for all identifiers that match the format criteria in the document:
mss_ids_to_link = re.findall('MSS-[0-9]+', docText)
# Now loop over all the identifier strings that were found, construct the link
# address for each html page to be linked, select the desired text where I want
# to insert the hyperlink, and then apply the link to the correct range of
# characters:
for linkIndex in range(len(mss_ids_to_link)):
current_string_to_link = mss_ids_to_link[linkIndex]
link_address = html_file_pathname + \
current_string_to_link + '.htm'
if wordapp.Selection.Find.Execute(FindText=current_string_to_link, \
Address=link_address) == True:
doc.Hyperlinks.Add(Anchor=wordapp.Selection.Range, \
Address=link_address)
# Save off the result:
doc.SaveAs('my_input_file.docx')

Add an image in a specific position in the document (.docx)?

I use Python-docx to generate Microsoft Word document.The user want that when he write for eg: "Good Morning every body,This is my %(profile_img)s do you like it?"
in a HTML field, i create a word document and i recuper the picture of the user from the database and i replace the key word %(profile_img)s by the picture of the user NOT at the END OF THE DOCUMENT. With Python-docx we use this instruction to add a picture:
document.add_picture('profile_img.png', width=Inches(1.25))
The picture is added to the document but the problem that it is added at the end of the document.
Is it impossible to add a picture in a specific position in a microsoft word document with python? I've not found any answers to this in the net but have seen people asking the same elsewhere with no solution.
Thanks (note: I'm not a hugely experiance programmer and other than this awkward part the rest of my code will very basic)

Quoting the python-docx documentation:
The Document.add_picture() method adds a specified picture to the end of the document in a paragraph of its own. However, by digging a little deeper into the API you can place text on either side of the picture in its paragraph, or both.
When we "dig a little deeper", we discover the Run.add_picture() API.
Here is an example of its use:
from docx import Document
from docx.shared import Inches
document = Document()
p = document.add_paragraph()
r = p.add_run()
r.add_text('Good Morning every body,This is my ')
r.add_picture('/tmp/foo.jpg')
r.add_text(' do you like it?')
document.save('demo.docx')

well, I don't know if this will apply to you but here is what I've done to set an image in a specific spot to a docx document:
I created a base docx document (template document). In this file, I've inserted some tables without borders, to be used as placeholders for images. When creating the document, first I open the template, and update the file creating the images inside the tables. So the code itself is not much different from your original code, the only difference is that I'm creating the paragraph and image inside a specific table.
from docx import Document
from docx.shared import Inches
doc = Document('addImage.docx')
tables = doc.tables
p = tables[0].rows[0].cells[0].add_paragraph()
r = p.add_run()
r.add_picture('resized.png',width=Inches(4.0), height=Inches(.7))
p = tables[1].rows[0].cells[0].add_paragraph()
r = p.add_run()
r.add_picture('teste.png',width=Inches(4.0), height=Inches(.7))
doc.save('addImage.docx')

Here's my solution. It has the advantage on the first proposition that it surrounds the picture with a title (with style Header 1) and a section for additional comments. Note that you have to do the insertions in the reverse order they appear in the Word document.
This snippet is particularly useful if you want to programmatically insert pictures in an existing document.
from docx import Document
from docx.shared import Inches
# ------- initial code -------
document = Document()
p = document.add_paragraph()
r = p.add_run()
r.add_text('Good Morning every body,This is my ')
picPath = 'D:/Development/Python/aa.png'
r.add_picture(picPath)
r.add_text(' do you like it?')
document.save('demo.docx')
# ------- improved code -------
document = Document()
p = document.add_paragraph('Picture bullet section', 'List Bullet')
p = p.insert_paragraph_before('')
r = p.add_run()
r.add_picture(picPath)
p = p.insert_paragraph_before('My picture title', 'Heading 1')
document.save('demo_better.docx')

This is adopting the answer written by Robᵩ while considering more flexible input from user.
My assumption is that the HTML field mentioned by Kais Dkhili (orignal enquirer) is already loaded in docx.Document(). So...
Identify where is the related HTML text in the document.
import re
## regex module
img_tag = re.compile(r'%\(profile_img\)s') # declare pattern
for _p in enumerate(document.paragraphs):
if bool(img_tag.match(_p.text)):
img_paragraph = _p
# if and only if; suggesting img_paragraph a list and
# use append method instead for full document search
break # lose the break if want full document search
Replace desired image into placeholder identified as img_tag = '%(profile_img)s'
The following code is after considering the text contains only a single run
May be changed accordingly if condition otherwise
temp_text = img_tag.split(img_paragraph.text)
img_paragraph.runs[0].text = temp_text[0]
_r = img_paragraph.add_run()
_r.add_picture('profile_img.png', width = Inches(1.25))
img_paragraph.add_run(temp_text[1])
and done. document.save() it if finalised.
In case you are wondering what to expect from the temp_text...
[In]
img_tag.split(img_paragraph.text)
[Out]
['This is my ', ' do you like it?']

I spend few hours in it. If you need to add images to a template doc file using python, the best solution is to use python-docx-template library.
Documentation is available here
Examples available in here

This is variation on a theme. Letting I be the paragraph number in the specific document then:
p = doc.paragraphs[I].insert_paragraph_before('\n')
p.add_run().add_picture('Fig01.png', width=Cm(15))

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Extracting .docx data, images and structure - python

Related

I am unable to parse through the corrections in the words document through the docx library

Copy specific word section to new document in python

Python docx paragraph in textbox

Adding hyperlinks in Microsoft word using python

Add an image in a specific position in the document (.docx)?

Categories

Resources