Is there any way to access and manipulate text in an existing docx document in a textbox with python-docx?
I tried to find a keyword in all paragraphs in a document by iteration:
doc = Document('test.docx')
for paragraph in doc.paragraphs:
if '<DATE>' in paragraph.text:
print('found date: ', paragraph.text)
It is found if placed in normal text, but not inside a textbox.
A workaround for textboxes that contain only formatted text is to use a floating, formatted table. It can be styled almost like a textbox (frames, colours, etc.) and is easily accessible by the docx API.
doc = Document('test.docx')
for table in doc.tables:
for row in table.rows:
for cell in row.cells:
for paragraph in cell.paragraphs:
if '<DATE>' in paragraph.text:
print('found date: ', paragraph.text)
Not via the API, not yet at least. You'd have to uncover the XML structure it lives in and go down to the lxml level and perhaps XPath to find it. Something like this might be a start:
body = doc._body
# assuming differentiating container element is w:textBox
text_box_p_elements = body.xpath('.//w:textBox//w:p')
I have no idea whether textBox is the actual element name here, you'd have to sort that out with the rest of the XPath path details, but this approach will likely work. I use similar approaches frequently to work around features that aren't built into the API yet.
opc-diag is a useful tool for inspecting the XML. The basic approach is to create a minimally small .docx file containing the type of thing you're trying to locate. Then use opc-diag to inspect the XML Word generates when you save the file:
$ opc browse test.docx document.xml
http://opc-diag.readthedocs.org/en/latest/index.html
Related
I have to extract all the text in a nested table (tables inside table inside table) from a word document. I'm unable to do it using the python-docx, maybe my lack of knowledge.
Please suggest some code examples.
You will want some sort of recursion. The basic idea is:
def iter_paragraphs_of_tables(tables):
for table in tables:
for row in table.rows:
for cell in row.cells:
yield from cell.paragraphs
yield from iter_paragraphs_of_tables(cell.tables)
for paragraph in iter_paragraphs_of_tables(document.tables):
print(paragraph.text)
This is Python3, if you're on Python2 you'll need to expand the yield from statements into, for example:
yield from cell.paragraphs
# --- becomes ---
for paragraph in cell.paragraphs:
yield paragraph
python-docx seems more like a write/modify docx library you may want to try PyPDF2 https://pythonhosted.org/PyPDF2/. But the table inside table thing i don't really understand it i guess the table is nested in the word document ? if that's the case just read the read with PyPDF2 and put the words that you want to keep in a table. I wish you the best time reading.
I have a question on using python to identify texts with certain features from word document
I wish to extract texts that are bold and that have quotations around them for example:
" This is a "sentence" in word document. "
How can I identify the word "sentence" in python?
This is what I have at the moment:
from docx import Document
document = Document(filepath)
short_list = []
for paragraph in document.paragraphs:
for run in paragraph.runs:
if run.bold:
short_list.append(run.text)
Thank you all for your help!
I would assume you cannot.
A docx file is in fact a zip file, and according to the documentation of the Python docx module, the Document object represents the document.xml part of the file. Unfortunately, footnotes are stored in a different part: footnotes.xml.
As on PyPi the modules declares its developpement status as 3-alpha, I suppose that it cannot currently process footnotes.
IMHO, you should first look if there are already open issues about the question, and if yes comment on it, or else fill a new issue on the project page.
Try using below example code:
for paragraph in document.paragraphs:
if 'sea' in paragraph.text:
print paragraph.text
paragraph.text = 'new text containing ocean'
To search in Tables as well, you would need to use something like:
for table in document.tables:
for cell in table.cells:
for paragraph in cell.paragraphs:
if 'sea' in paragraph.text:
...
See How to use python-docx to replace text in a Word document and save
Good day SO,
I have a task where I need to extract specific parts of a document template (For automation purposes). While I am able to traverse, and know the current position, of the document during traversal (via checking for Regex, keywords, etc.), I am unable to extract:
The structure of the document
Detect Images that are in-between text
Am I able to obtain, for example, an array of the structure of the document below?
['Paragraph1','Paragraph2','Image1','Image2','Paragraph3','Paragraph4','Image3','Image4']
My current implementation is shown below:
from docx import Document
document = docx.Document('demo.docx')
text = []
for x in document.paragraphs:
if x.text != '':
text.append(x.text)
Using the code above, I am able to obtain all the Text data from the document, but I am unable to detect the type of text (Header or Normal), and I am unable to detect any Images. I am currently using python-docx.
My main problem is to obtain the position of the image within the document (i.e. between paragraphs) so that I can re-create another document, using text and images extracted. This task requires me to know where the image appears in the document, and where to insert the image in the new document.
Any help is greatly appreciated, thank you :)
For extracting the structure of the paragraph and heading you can use the built-in objects in python-docx. Check this code.
from docx import Document
document = docx.Document('demo.docx')
text = []
style = []
for x in document.paragraphs:
if x.text != '':
style.append(x.style.name)
text.append(x.text)
with x.style.name you can get the styling of text in your document.
You can't get the information regarding images in python-docx. For that, you need to parse the xml. Check XML ouput by
for elem in document.element.getiterator():
print(elem.tag)
Let me know if you need anything else.
For extracting image name and its location use this.
tags = []
text = []
for t in doc.element.getiterator():
if t.tag in ['{http://schemas.openxmlformats.org/wordprocessingml/2006/main}r', '{http://schemas.openxmlformats.org/wordprocessingml/2006/main}t','{http://schemas.openxmlformats.org/drawingml/2006/picture}cNvPr','{http://schemas.openxmlformats.org/wordprocessingml/2006/main}drawing']:
if t.tag == '{http://schemas.openxmlformats.org/drawingml/2006/picture}cNvPr':
print('Picture Found: ',t.attrib['name'])
tags.append('Picture')
text.append(t.attrib['name'])
elif t.text:
tags.append('text')
text.append(t.text)
You can check previous and next text from text list and their tag from the tag list.
I have been using python docx library and oxml to automate some changes to my tables in my word document. Unfortunately, no matter what I do, I cannot wrap the text in the table cells.
I managed to successfully manipulate 'autofit' and 'fit-text' properties of my table, but non of them contribute to the wrapping of the text in the cells. I can see that there is a "w:noWrap" in the xml version of my word document and no matter what I do I cannot manipulate and remove it. I believe it is responsible for the word wrapping in my table.
for example in this case I am adding a table. I can fit text in cell and set autofit to 'true' but cannot for life of me wrap the text:
from docx import Document
from docx.oxml import OxmlElement
from docx.oxml.ns import qn
doc = Document()
table = doc.add_table(5,5)
table.autofit = True # Does Autofit but not wrapping
tc = table.cell(0,0)._tc # As a test, fit text to cell 0,0
tcPr = tc.get_or_add_tcPr()
tcFitText = OxmlElement('w:tcFitText')
tcFitText.set(qn('w:val'),"true")
tcPr.append(tcFitText) #Does fitting but no wrapping
doc.save('demo.docx')
I would appreciate any help or hints.
The <w:noWrap> element appears to be a child of <w:tcPr>, the element that controls table cell properties.
You should be able to access it from the table cell element using XPath:
tc = table.cell(0, 0)._tc
noWraps = tc.xpath(".//w:noWrap")
The noWraps variable here will then be a list containing zero or more <w:noWrap> elements, in your case probably one.
Deleting it is probably the simplest approach, which you can accomplish like this:
if noWraps: # ---skip following code if list is empty---
noWrap = noWraps[0]
noWrap.getparent().remove(noWrap)
You can also take the approach of setting the value of the w:val attribute of the w:noWrap element, but then you have to get into specifying the Clark name of the attribute namespace, which adds some extra fuss and doesn't really produce a different outcome unless for some reason you want to keep that element around.
Using the pptx module,
tbl.cell(3,3).text
is a writable, but not a readable attribute. Is there a way to just read the text in a PowerPoint table? I'm trying to avoid COM and pptx is a great module, but lacks this particular feature.
Thanks!
At present, you'll need to go a level deeper to get text out of a cell using python-pptx. The cell.text 'getter' is on the roadmap.
Something like this should get it done:
cell = tbl.cell(3, 3)
paragraphs = cell.textframe.paragraphs
for paragraph in paragraphs:
for run in paragraph.runs:
print run.text
Let me know if you need more to go on.