Copy specific word section to new document in python - python

As the title says, I'm trying to copy a select section of text to a new word document. Basically, I have a bunch of annual reports with sections that are systemically named (i.e. Project 1, Project 2, etc..) I want to be able to search for a select section and copy that section into a report for the individual project. I've been looking through docx documentation and aspose.words documentation. This is the closest I've gotten to what I'm looking for but it's still not quite right:
# For complete examples and data files, please go to https://github.com/aspose-words/Aspose.Words-for-Python-via-.NET
doc = aw.Document(docs_base.my_dir + "Big document.docx")
for i in range(0, doc.sections.count) :
# Split a document into smaller parts, in this instance, split by section.
section = doc.sections[i].clone()
newDoc = aw.Document()
newDoc.sections.clear()
newSection = newDoc.import_node(section, True).as_section()
newDoc.sections.add(newSection)
# Save each section as a separate document.
newDoc.save(docs_base.artifacts_dir + f"SplitDocument.by_sections_{i}.docx")

I suspect by saying section you mean just content tat starts with some heading paragraph. This does not correspond to Section node in Aspose.Words Document Object Model. Section in MS Word are used if you need to have different page setup or headers/footers within the document.
It is difficult to say without your document, but I suppose there is only one section in your document. It looks like you need to extract content form the document based on the styles.
Also, I have create a simple code example that demonstrates the basic technique:
import aspose.words as aw
# Geneare document with heading paragraphs (just for demonstrations purposes).
doc = aw.Document()
builder = aw.DocumentBuilder(doc)
for i in range(0, 5):
builder.paragraph_format.style_identifier = aw.StyleIdentifier.HEADING1
builder.writeln("Project {0}".format(i))
builder.paragraph_format.style_identifier = aw.StyleIdentifier.NORMAL
for j in range(0, 10):
builder.writeln("This is the project {0} content {1}, each project section will be extracted into a separate document.".format(i, j))
doc.save("C:\\Temp\\out.docx")
# Now split the document by heading paragraphs.
# Code is just for demonstration purposes and supposes there is only one section in the document.
subDocumentIndex = 0
subDocument = doc.clone(False).as_document()
for child in doc.first_section.body.child_nodes:
if child.node_type == aw.NodeType.PARAGRAPH and child.as_paragraph().paragraph_format.style_identifier == aw.StyleIdentifier.HEADING1:
if not subDocument.has_child_nodes:
subDocument.ensure_minimum()
else:
# save subdocument
subDocument.save("C:\\Temp\\sub_document_{0}.docx".format(subDocumentIndex))
subDocumentIndex = subDocumentIndex+1
subDocument = doc.clone(False).as_document()
subDocument.ensure_minimum()
# Remove body content.
subDocument.first_section.body.remove_all_children()
# import current node to the subdocument.
dst_child = subDocument.import_node(child, True, aw.ImportFormatMode.USE_DESTINATION_STYLES)
subDocument.first_section.body.append_child(dst_child)
# save the last document.
subDocument.save("C:\\Temp\\sub_document_{0}.docx".format(subDocumentIndex))

Related

Python docx not finding all sections

I need this code to be able to work with existing documents I have, but it is only finding one footer in each document. I need it to find all 3 (first page, even, and odd).
Here is the link to my test document: http://www.filedropper.com/testdoc_1
from docx import Document
document = Document('C:/Users/username/Desktop/SPECS/TEST DOC.docx')
print(len(document.sections))
for section in document.sections:
footer = section.footer
footer.paragraphs[0].text = footer.paragraphs[0].text.replace("ISSUED FOR CONSTRUCTION", "DESIGN DEVELOPMENT")
document.save('C:/Users/username/Desktop/SPECS/TEST DOC.docx')
So I answered my question by reading this documentation: https://python-docx.readthedocs.io/en/latest/api/section.html
The default is the odd page footer. To access the other footers (if you enabled different first page and even footers) you have to use the below code:
first_page_footer = section.first_page_footer
even_page_footer = section.even_page_footer
odd_page_footer = section.footer

Extracting .docx data, images and structure

Good day SO,
I have a task where I need to extract specific parts of a document template (For automation purposes). While I am able to traverse, and know the current position, of the document during traversal (via checking for Regex, keywords, etc.), I am unable to extract:
The structure of the document
Detect Images that are in-between text
Am I able to obtain, for example, an array of the structure of the document below?
['Paragraph1','Paragraph2','Image1','Image2','Paragraph3','Paragraph4','Image3','Image4']
My current implementation is shown below:
from docx import Document
document = docx.Document('demo.docx')
text = []
for x in document.paragraphs:
if x.text != '':
text.append(x.text)
Using the code above, I am able to obtain all the Text data from the document, but I am unable to detect the type of text (Header or Normal), and I am unable to detect any Images. I am currently using python-docx.
My main problem is to obtain the position of the image within the document (i.e. between paragraphs) so that I can re-create another document, using text and images extracted. This task requires me to know where the image appears in the document, and where to insert the image in the new document.
Any help is greatly appreciated, thank you :)
For extracting the structure of the paragraph and heading you can use the built-in objects in python-docx. Check this code.
from docx import Document
document = docx.Document('demo.docx')
text = []
style = []
for x in document.paragraphs:
if x.text != '':
style.append(x.style.name)
text.append(x.text)
with x.style.name you can get the styling of text in your document.
You can't get the information regarding images in python-docx. For that, you need to parse the xml. Check XML ouput by
for elem in document.element.getiterator():
print(elem.tag)
Let me know if you need anything else.
For extracting image name and its location use this.
tags = []
text = []
for t in doc.element.getiterator():
if t.tag in ['{http://schemas.openxmlformats.org/wordprocessingml/2006/main}r', '{http://schemas.openxmlformats.org/wordprocessingml/2006/main}t','{http://schemas.openxmlformats.org/drawingml/2006/picture}cNvPr','{http://schemas.openxmlformats.org/wordprocessingml/2006/main}drawing']:
if t.tag == '{http://schemas.openxmlformats.org/drawingml/2006/picture}cNvPr':
print('Picture Found: ',t.attrib['name'])
tags.append('Picture')
text.append(t.attrib['name'])
elif t.text:
tags.append('text')
text.append(t.text)
You can check previous and next text from text list and their tag from the tag list.

Python Docx - how to number headings?

There is a good example for Python Docx.
I have used multiple document.add_heading('xxx', level=Y) and can see when I open the generated document in MS Word that the levels are correct.
What I don't see is numbering, such a 1, 1.1, 1.1.1, etc I just see the heading text.
How can I display heading numbers, using Docx ?
Alphanumeric heading prefixes are automatically created based on the outline style and level of the heading. Set the outline style and insert the correct level and you will get the numbering.
From documentation:
_NumberingStyle objects class docx.styles.style._NumberingStyle[source] A numbering style. Not yet
implemented.
However, if you set the heading like this:
paragraph.style = document.styles['Heading 1']
then it should default to the latent numbering style of that heading.
There is a great work around with python docx for achieving complex operations like headings enumerations. Here how to proceed:
Create a new blank document in Word and define your complex individual multilevel list there .
Save the blank document as my_template.docx.
Create the document in your python script based on your template:
from docx import Document
document = Document("my_template.docx")
document.add_heading('My first numbered heading', level=1)
And, voila, you can generate your perfect document with numbered headings.
this answer will realy help you
first you need to new a without number header like this
paragraph = document.add_paragraph()
paragraph.style = document.styles['Heading 4']
then you will have xml word like this
<w:pPr>
<w:pStyle w:val="4"/>
</w:pPr>
then you can access xml word "pStyle" property and change it using under code
header._p.pPr.pStyle.set(qn('w:val'), u'4FDD')
final, open word file you will get what you want !!!
def __str__(self):
if self.nivel == 1:
return str(Level.count_1)+'.- '+self.titulo
elif self.nivel==2: #Imprime si es del nivel 2
return str(Level.count_1)+'.'+str(Level.count_2)+'.- '+self.titulo
elif self.nivel==3: #Imprime si es del nivel 3
return str(Level.count_1)+'.'+str(Level.count_2)+'.'+str(Level.count_3)+'.- '+self.titulo

Adding hyperlinks in Microsoft word using python

I've been searching for the last few hours and cannot find a library that allows me to add hyperlinks to a word document using python. In my ideal world I'd be able to manipulate a word doc using python to add hyperlinks to footnotes which link to internal documents. Python-docx doesn't seem to have this feature.
It breaks down into 2 questions. 1) Is there a way to add hyperlinks to word docs using python? 2) Is there a way to manipulate footnotes in word docs using python?
Does anyone know how to do this or any part of this?
Hyperlinks can be added using the win32com package:
import win32com.client
#connect to Word (start it if it isn't already running)
wordapp = win32com.client.Dispatch("Word.Application")
#add a new document
doc = wordapp.Documents.Add()
#add some text and turn it into a hyperlink
para = doc.Paragraphs.Add()
para.Range.Text = "Adding hyperlinks in Microsoft word using python"
doc.Hyperlinks.Add(Anchor=para.Range, Address="http://stackoverflow.com/questions/34636391/adding-hyperlinks-in-microsoft-word-using-python")
#In theory you should be able to also pass in a TextToDisplay argument to the above call but I haven't been able to get this to work
#The workaround is to insert the link text into the document first and then convert it into a hyperlink
# How to insert hyperlinks into an existing MS Word document using win32com:
# Use the same call as in the example above to connect to Word:
wordapp = win32com.client.Dispatch("Word.Application")
# Open the input file where you want to insert the hyperlinks:
wordapp.Documents.Open("my_input_file.docx")
# Select the currently active document
doc = wordapp.ActiveDocument
# For my application, I want to replace references to identifiers in another
# document with the general format of "MSS-XXXX", where X is any digit, with
# hyperlinks to local html pages that capture the supporting details...
# First capture the entire document's content as text
docText = doc.Content.text
# Search for all identifiers that match the format criteria in the document:
mss_ids_to_link = re.findall('MSS-[0-9]+', docText)
# Now loop over all the identifier strings that were found, construct the link
# address for each html page to be linked, select the desired text where I want
# to insert the hyperlink, and then apply the link to the correct range of
# characters:
for linkIndex in range(len(mss_ids_to_link)):
current_string_to_link = mss_ids_to_link[linkIndex]
link_address = html_file_pathname + \
current_string_to_link + '.htm'
if wordapp.Selection.Find.Execute(FindText=current_string_to_link, \
Address=link_address) == True:
doc.Hyperlinks.Add(Anchor=wordapp.Selection.Range, \
Address=link_address)
# Save off the result:
doc.SaveAs('my_input_file.docx')

Add an image in a specific position in the document (.docx)?

I use Python-docx to generate Microsoft Word document.The user want that when he write for eg: "Good Morning every body,This is my %(profile_img)s do you like it?"
in a HTML field, i create a word document and i recuper the picture of the user from the database and i replace the key word %(profile_img)s by the picture of the user NOT at the END OF THE DOCUMENT. With Python-docx we use this instruction to add a picture:
document.add_picture('profile_img.png', width=Inches(1.25))
The picture is added to the document but the problem that it is added at the end of the document.
Is it impossible to add a picture in a specific position in a microsoft word document with python? I've not found any answers to this in the net but have seen people asking the same elsewhere with no solution.
Thanks (note: I'm not a hugely experiance programmer and other than this awkward part the rest of my code will very basic)
Quoting the python-docx documentation:
The Document.add_picture() method adds a specified picture to the end of the document in a paragraph of its own. However, by digging a little deeper into the API you can place text on either side of the picture in its paragraph, or both.
When we "dig a little deeper", we discover the Run.add_picture() API.
Here is an example of its use:
from docx import Document
from docx.shared import Inches
document = Document()
p = document.add_paragraph()
r = p.add_run()
r.add_text('Good Morning every body,This is my ')
r.add_picture('/tmp/foo.jpg')
r.add_text(' do you like it?')
document.save('demo.docx')
well, I don't know if this will apply to you but here is what I've done to set an image in a specific spot to a docx document:
I created a base docx document (template document). In this file, I've inserted some tables without borders, to be used as placeholders for images. When creating the document, first I open the template, and update the file creating the images inside the tables. So the code itself is not much different from your original code, the only difference is that I'm creating the paragraph and image inside a specific table.
from docx import Document
from docx.shared import Inches
doc = Document('addImage.docx')
tables = doc.tables
p = tables[0].rows[0].cells[0].add_paragraph()
r = p.add_run()
r.add_picture('resized.png',width=Inches(4.0), height=Inches(.7))
p = tables[1].rows[0].cells[0].add_paragraph()
r = p.add_run()
r.add_picture('teste.png',width=Inches(4.0), height=Inches(.7))
doc.save('addImage.docx')
Here's my solution. It has the advantage on the first proposition that it surrounds the picture with a title (with style Header 1) and a section for additional comments. Note that you have to do the insertions in the reverse order they appear in the Word document.
This snippet is particularly useful if you want to programmatically insert pictures in an existing document.
from docx import Document
from docx.shared import Inches
# ------- initial code -------
document = Document()
p = document.add_paragraph()
r = p.add_run()
r.add_text('Good Morning every body,This is my ')
picPath = 'D:/Development/Python/aa.png'
r.add_picture(picPath)
r.add_text(' do you like it?')
document.save('demo.docx')
# ------- improved code -------
document = Document()
p = document.add_paragraph('Picture bullet section', 'List Bullet')
p = p.insert_paragraph_before('')
r = p.add_run()
r.add_picture(picPath)
p = p.insert_paragraph_before('My picture title', 'Heading 1')
document.save('demo_better.docx')
This is adopting the answer written by Robᵩ while considering more flexible input from user.
My assumption is that the HTML field mentioned by Kais Dkhili (orignal enquirer) is already loaded in docx.Document(). So...
Identify where is the related HTML text in the document.
import re
## regex module
img_tag = re.compile(r'%\(profile_img\)s') # declare pattern
for _p in enumerate(document.paragraphs):
if bool(img_tag.match(_p.text)):
img_paragraph = _p
# if and only if; suggesting img_paragraph a list and
# use append method instead for full document search
break # lose the break if want full document search
Replace desired image into placeholder identified as img_tag = '%(profile_img)s'
The following code is after considering the text contains only a single run
May be changed accordingly if condition otherwise
temp_text = img_tag.split(img_paragraph.text)
img_paragraph.runs[0].text = temp_text[0]
_r = img_paragraph.add_run()
_r.add_picture('profile_img.png', width = Inches(1.25))
img_paragraph.add_run(temp_text[1])
and done. document.save() it if finalised.
In case you are wondering what to expect from the temp_text...
[In]
img_tag.split(img_paragraph.text)
[Out]
['This is my ', ' do you like it?']
I spend few hours in it. If you need to add images to a template doc file using python, the best solution is to use python-docx-template library.
Documentation is available here
Examples available in here
This is variation on a theme. Letting I be the paragraph number in the specific document then:
p = doc.paragraphs[I].insert_paragraph_before('\n')
p.add_run().add_picture('Fig01.png', width=Cm(15))

Categories