Using the pptx module,
tbl.cell(3,3).text
is a writable, but not a readable attribute. Is there a way to just read the text in a PowerPoint table? I'm trying to avoid COM and pptx is a great module, but lacks this particular feature.
Thanks!
At present, you'll need to go a level deeper to get text out of a cell using python-pptx. The cell.text 'getter' is on the roadmap.
Something like this should get it done:
cell = tbl.cell(3, 3)
paragraphs = cell.textframe.paragraphs
for paragraph in paragraphs:
for run in paragraph.runs:
print run.text
Let me know if you need more to go on.
Related
I want to process some pdf files using a NLP module, then I want to clean those files from all existing tables.
this is the code for extracting tables using pdfplumber
import pdfplumber
pdf = pdfplumber.open("file.pdf")
page = pdf.pages[1]
table=page.extract_table()
but I want to inverse the operation to extract text only
disclaimer: I am the author of pText, the library used in this answer.
load the Document
you need to define a LocationFilter
A LocationFilter does pretty much what it says on the tin. It will listen to parsing events (like "render text" or "change font to") but it will only allow those to come through that fall within a given boundary.
Keep in mind the origin in PDF coordinates is at the lower left corner.
The LocationFilter in this example will therefor match only text in the lower left corner of the page.
Add a SimpleTextExtraction to the LocationFilter
The next question is "what is the LocationFilter going to pass events to?"
In this case, you can start by trying a SimpleTextExtraction.
Putting it all together:
l0 = LocationFilter(0, 0, 100, 100)
l1 = SimpleTextExtraction()
l0.add_listener(l1)
doc = PDF.loads(pdf_file_handle, [l])
After the Document has loaded, you can ask the SimpleTextExtraction for all the text on a given Page.
l1.get_text(0)
You can obtain pText either on GitHub, or using PyPi
There are a ton more examples, check them out to find out more about working with images.
Do you really have to stick to the pdfplumber?.
If not, I can suggest a better solution, use tabula instead.
Here is an answer to a similar question you can check out: tabula
I have to extract all the text in a nested table (tables inside table inside table) from a word document. I'm unable to do it using the python-docx, maybe my lack of knowledge.
Please suggest some code examples.
You will want some sort of recursion. The basic idea is:
def iter_paragraphs_of_tables(tables):
for table in tables:
for row in table.rows:
for cell in row.cells:
yield from cell.paragraphs
yield from iter_paragraphs_of_tables(cell.tables)
for paragraph in iter_paragraphs_of_tables(document.tables):
print(paragraph.text)
This is Python3, if you're on Python2 you'll need to expand the yield from statements into, for example:
yield from cell.paragraphs
# --- becomes ---
for paragraph in cell.paragraphs:
yield paragraph
python-docx seems more like a write/modify docx library you may want to try PyPDF2 https://pythonhosted.org/PyPDF2/. But the table inside table thing i don't really understand it i guess the table is nested in the word document ? if that's the case just read the read with PyPDF2 and put the words that you want to keep in a table. I wish you the best time reading.
Good day SO,
I have a task where I need to extract specific parts of a document template (For automation purposes). While I am able to traverse, and know the current position, of the document during traversal (via checking for Regex, keywords, etc.), I am unable to extract:
The structure of the document
Detect Images that are in-between text
Am I able to obtain, for example, an array of the structure of the document below?
['Paragraph1','Paragraph2','Image1','Image2','Paragraph3','Paragraph4','Image3','Image4']
My current implementation is shown below:
from docx import Document
document = docx.Document('demo.docx')
text = []
for x in document.paragraphs:
if x.text != '':
text.append(x.text)
Using the code above, I am able to obtain all the Text data from the document, but I am unable to detect the type of text (Header or Normal), and I am unable to detect any Images. I am currently using python-docx.
My main problem is to obtain the position of the image within the document (i.e. between paragraphs) so that I can re-create another document, using text and images extracted. This task requires me to know where the image appears in the document, and where to insert the image in the new document.
Any help is greatly appreciated, thank you :)
For extracting the structure of the paragraph and heading you can use the built-in objects in python-docx. Check this code.
from docx import Document
document = docx.Document('demo.docx')
text = []
style = []
for x in document.paragraphs:
if x.text != '':
style.append(x.style.name)
text.append(x.text)
with x.style.name you can get the styling of text in your document.
You can't get the information regarding images in python-docx. For that, you need to parse the xml. Check XML ouput by
for elem in document.element.getiterator():
print(elem.tag)
Let me know if you need anything else.
For extracting image name and its location use this.
tags = []
text = []
for t in doc.element.getiterator():
if t.tag in ['{http://schemas.openxmlformats.org/wordprocessingml/2006/main}r', '{http://schemas.openxmlformats.org/wordprocessingml/2006/main}t','{http://schemas.openxmlformats.org/drawingml/2006/picture}cNvPr','{http://schemas.openxmlformats.org/wordprocessingml/2006/main}drawing']:
if t.tag == '{http://schemas.openxmlformats.org/drawingml/2006/picture}cNvPr':
print('Picture Found: ',t.attrib['name'])
tags.append('Picture')
text.append(t.attrib['name'])
elif t.text:
tags.append('text')
text.append(t.text)
You can check previous and next text from text list and their tag from the tag list.
I have been using python docx library and oxml to automate some changes to my tables in my word document. Unfortunately, no matter what I do, I cannot wrap the text in the table cells.
I managed to successfully manipulate 'autofit' and 'fit-text' properties of my table, but non of them contribute to the wrapping of the text in the cells. I can see that there is a "w:noWrap" in the xml version of my word document and no matter what I do I cannot manipulate and remove it. I believe it is responsible for the word wrapping in my table.
for example in this case I am adding a table. I can fit text in cell and set autofit to 'true' but cannot for life of me wrap the text:
from docx import Document
from docx.oxml import OxmlElement
from docx.oxml.ns import qn
doc = Document()
table = doc.add_table(5,5)
table.autofit = True # Does Autofit but not wrapping
tc = table.cell(0,0)._tc # As a test, fit text to cell 0,0
tcPr = tc.get_or_add_tcPr()
tcFitText = OxmlElement('w:tcFitText')
tcFitText.set(qn('w:val'),"true")
tcPr.append(tcFitText) #Does fitting but no wrapping
doc.save('demo.docx')
I would appreciate any help or hints.
The <w:noWrap> element appears to be a child of <w:tcPr>, the element that controls table cell properties.
You should be able to access it from the table cell element using XPath:
tc = table.cell(0, 0)._tc
noWraps = tc.xpath(".//w:noWrap")
The noWraps variable here will then be a list containing zero or more <w:noWrap> elements, in your case probably one.
Deleting it is probably the simplest approach, which you can accomplish like this:
if noWraps: # ---skip following code if list is empty---
noWrap = noWraps[0]
noWrap.getparent().remove(noWrap)
You can also take the approach of setting the value of the w:val attribute of the w:noWrap element, but then you have to get into specifying the Clark name of the attribute namespace, which adds some extra fuss and doesn't really produce a different outcome unless for some reason you want to keep that element around.
Is there any way to access and manipulate text in an existing docx document in a textbox with python-docx?
I tried to find a keyword in all paragraphs in a document by iteration:
doc = Document('test.docx')
for paragraph in doc.paragraphs:
if '<DATE>' in paragraph.text:
print('found date: ', paragraph.text)
It is found if placed in normal text, but not inside a textbox.
A workaround for textboxes that contain only formatted text is to use a floating, formatted table. It can be styled almost like a textbox (frames, colours, etc.) and is easily accessible by the docx API.
doc = Document('test.docx')
for table in doc.tables:
for row in table.rows:
for cell in row.cells:
for paragraph in cell.paragraphs:
if '<DATE>' in paragraph.text:
print('found date: ', paragraph.text)
Not via the API, not yet at least. You'd have to uncover the XML structure it lives in and go down to the lxml level and perhaps XPath to find it. Something like this might be a start:
body = doc._body
# assuming differentiating container element is w:textBox
text_box_p_elements = body.xpath('.//w:textBox//w:p')
I have no idea whether textBox is the actual element name here, you'd have to sort that out with the rest of the XPath path details, but this approach will likely work. I use similar approaches frequently to work around features that aren't built into the API yet.
opc-diag is a useful tool for inspecting the XML. The basic approach is to create a minimally small .docx file containing the type of thing you're trying to locate. Then use opc-diag to inspect the XML Word generates when you save the file:
$ opc browse test.docx document.xml
http://opc-diag.readthedocs.org/en/latest/index.html