python docx how to read text along with inline images?

python docx how to read text along with inline images? - python

I have a simple docx file like this(just insert a inline png file to text):
I've tried:
>>> x=docx.Document('12.docx')
>>> for p in x.paragraphs:
print(p.text)
headend
>>> list(x.inline_shapes)
[]
And I unzip 12.docx file, found word/media/image1.png is the location. So is there a way to get a output like:
>>> for p in x.paragraphs:
print(p.text_with_image_info)
head<word/media/image1.png>end

You should be able to get a list of inline shapes like this:
>>> [s for s in x.inline_shapes]
[<InlineShape object at 0x...>]
If none show up then you'd probably need to examine the XML to find out why it's not finding anything at the XPath location '//w:p/w:r/w:drawing/wp:inline'. That might yield an interesting finding if you're seeing an empty list there.
Regarding the bit about getting the text with image in document order, you'll need to go down to the lxml layer.
You can get the paragraph lxml element w:p using Paragraph._element. From there you can inspect the XML with the .xml property:
>>> p = paragraph._p
>>> p.xml
'<w:p> etc ...'
You'll need to iterate through the children of the w:p element, I expect you'll find primarily w:r (run) elements. Text is held below those in w:t elements and a w:drawing element is a peer of w:t if I'm not mistaken.
You can construct python-docx objects like InlineShape with the right child element to get access to a more convenient API once you've located the right bit.
So it's a bit of work but doable if you're up to working with lxml-level calls.

Related

Xpath that returns the whole document in python

I am trying to debug some inherited code. So there is a line of code
elt = doc.xpath('body/div/pre[#id="bb flags"]')[0]
I want to see what the entire document looks like at that point, rather than just a specific piece. So what xpath should I insert there?
i.e.
elt_entire_document = doc.xpath(new xpath)
logging.info("The full document here is " + elt_entire_document.text)
Is this even possible with xpath, or is it more complicated than that?

Simply like this, based on our comments :
logging.info("The full document here is " + text)

Your question title seems to be asking about selecting a whole document, yet your question body seems to be asking about displaying a selected node...
Selecting the whole document via XPath
Selecting the whole document might mean any of the following:
/ selects the root node of XML document.
/* selects the document element (aka the root element) of the XML document.(Its parent is the root node.)
string(/) selects the string-value of the XML document.
See also:
What is the difference between root node, root element and document element in XML?
How to pretty print XML from the command line?
Python pretty print subtree
How to use lxml and python to pretty print a subtree of an xml file?
how to get the full contents of a node using xpath & lxml?

Python element tree - extract text from element, stripping tags

With ElementTree in Python, how can I extract all the text from a node, stripping any tags in that element and keeping only the text?
For example, say I have the following:
<tag>
Some <a>example</a> text
</tag>
I want to return Some example text. How do I go about doing this? So far, the approaches I've taken have had fairly disastrous outcomes.

If you are running under Python 3.2+, you can use itertext.
itertext creates a text iterator which loops over this element and all subelements, in document order, and returns all inner text:
import xml.etree.ElementTree as ET
xml = '<tag>Some <a>example</a> text</tag>'
tree = ET.fromstring(xml)
print(''.join(tree.itertext()))
# -> 'Some example text'
If you are running in a lower version of Python, you can reuse the implementation of itertext() by attaching it to the Element class, after which you can call it exactly like above:
# original implementation of .itertext() for Python 2.7
def itertext(self):
tag = self.tag
if not isinstance(tag, basestring) and tag is not None:
return
if self.text:
yield self.text
for e in self:
for s in e.itertext():
yield s
if e.tail:
yield e.tail
# if necessary, monkey-patch the Element class
if 'itertext' not in ET.Element.__dict__:
ET.Element.itertext = itertext
xml = '<tag>Some <a>example</a> text</tag>'
tree = ET.fromstring(xml)
print(''.join(tree.itertext()))
# -> 'Some example text'

As the documentation says, if you want to read only the text, without any intermediate tags, you have to recursively concatenate all text and tail attributes in the correct order.
However, recent-enough versions (including the ones in the stdlib in 2.7 and 3.2, but not 2.6 or 3.1, and the current released versions of both ElementTree and lxml on PyPI) can do this for you automatically in the tostring method:
>>> s = '''<tag>
... Some <a>example</a> text
... </tag>'''
>>> t = ElementTree.fromstring(s)
>>> ElementTree.tostring(s, method='text')
'\n Some example text\n'
If you also want to strip whitespace from the text, you'll need to do so manually. In your simple case, that's easy:
>>> ElementTree.tostring(s, method='text').strip()
'Some example text'
In more complicated cases, however, where you want to strip out whitespace within intermediate tags, you'll probably have to fall back on recursively processing the texts and tails. That's not too hard; you just have to remember to deal with the possibility that the attributes may be None. For example, here's a skeleton you can hook your own code on:
def textify(t):
s = []
if t.text:
s.append(t.text)
for child in t.getchildren():
s.extend(textify(child))
if t.tail:
s.append(t.tail)
return ''.join(s)
This version only works when text and tail are guaranteed to be a str or None. For trees you build up manually, that's not guaranteed to be true.

Aslo exists a very simple solution in case it's possible to use XPath. It's called XPath Axes: more about it can be found here.
When having a node (like a tag div) which itself contains text and other nodes as well (like tags a or center or another div) with text inside or it contains just text and we want to select all text in that div node, it's possible to do it with folowing XPath: current_element.xpath("descendant-or-self::*/text()").extract(). What we will get is a list of all texts within a current element, stripping tags inside, if there are any.
What's nice about it is that no recursive function is needed, XPath takes care of all of this (using recusion itself, but for us it's as clean as it only can be).
Here is StackOverflow question concerning this proposed solution.

XML parsing in Python using Python 2 or 3

I'm just trying to write a simple program to allow me to parse some of the following XML.
So far in following examples I am not getting the results I'm looking for.
I encounter many of these XML files and I generally want the info after a handful of tags.
What's the best way using elementtree to be able to do a search for <Id> and grab what ever info is in that tag. I was trying things like
for Reel in root.findall('Reel'):
... id = Reel.findtext('Id')
... print id
Is there a way just to look for every instance of <Id> and grab the urn: etc that comes after it? Some code that traverses everything and looks for <what I want> and so on.
This is a very truncated version of what I usually deal with.
This didn't get what I wanted at all. Is there an easy just to match <what I want> in any XML file and get the contents of that tag, or do i need to know the structure of the XML well enough to know its relation to Root/child etc?
<Reel>
<Id>urn:uuid:632437bc-73f9-49ca-b687-fdb3f98f430c</Id>
<AssetList>
<MainPicture>
<Id>urn:uuid:46afe8a3-50be-4986-b9c8-34f4ba69572f</Id>
<EditRate>24 1</EditRate>
<IntrinsicDuration>340</IntrinsicDuration>
<EntryPoint>0</EntryPoint>
<Duration>340</Duration>
<FrameRate>24 1</FrameRate>
<ScreenAspectRatio>2048 858</ScreenAspectRatio>
</MainPicture>
<MainSound>
<Id>urn:uuid:1fce0915-f8c7-48a7-b023-36e204a66ed1</Id>
<EditRate>24 1</EditRate>
<IntrinsicDuration>340</IntrinsicDuration>
<EntryPoint>0</EntryPoint>
<Duration>340</Duration>
</MainSound>
</AssetList>
</Reel>
#Mata that worked perfectly, but when I tried to use that for different values on another XML file I fell flat on my face. For instance, what about this section of a file.I couldn't post the whole thing unfortunately. What if I want to grab what comes after KeyId?
<?xml version="1.0" encoding="UTF-8" standalone="no" ?><DCinemaSecurityMessage xmlns="http://www.digicine.com/PROTO-ASDCP-KDM-20040311#" xmlns:dsig="http://www.w3.org/2000/09/xmldsig#" xmlns:enc="http://www.w3.org/2001/04/xmlenc#">
<!-- Generated by Wailua Version 0.3.20 -->
<AuthenticatedPublic Id="ID_AuthenticatedPublic">
<MessageId>urn:uuid:7bc63f4c-c617-4d00-9e51-0c8cd6a4f59e</MessageId>
<MessageType>http://www.digicine.com/PROTO-ASDCP-KDM-20040311#</MessageType>
<AnnotationText>SPIDERMAN-3_FTR_S_EN-XX_US-13_51_4K_PH_20070423_DELUXE ~ KDM for Quvis-10010.pem</AnnotationText>
<IssueDate>2007-04-29T04:13:43-00:00</IssueDate>
<Signer>
<dsig:X509IssuerName>dnQualifier=BzC0n/VV/uVrl2PL3uggPJ9va7Q=,CN=.deluxe-admin-c,OU=.mxf-j2c.ca.cinecert.com,O=.ca.cinecert.com</dsig:X509IssuerName>
<dsig:X509SerialNumber>10039</dsig:X509SerialNumber>
</Signer>
<RequiredExtensions>
<Recipient>
<X509IssuerSerial>
<dsig:X509IssuerName>dnQualifier=RUxyQle0qS7qPbcNRFBEgVjw0Og=,CN=SM.QuVIS.com.001,OU=QuVIS Digital Cinema,O=QuVIS.com</dsig:X509IssuerName>
<dsig:X509SerialNumber>363</dsig:X509SerialNumber>
</X509IssuerSerial>
<X509SubjectName>CN=SM MD LE FM.QuVIS_CinemaPlayer-3d_10010,OU=QuVIS,O=QuVIS.com,dnQualifier=3oBfjTfx1me0p1ms7XOX\+eqUUtE=</X509SubjectName>
</Recipient>
<CompositionPlaylistId>urn:uuid:336263da-e4f1-324e-8e0c-ebea00ff79f4</CompositionPlaylistId>
<ContentTitleText>SPIDERMAN-3_FTR_S_EN-XX_US-13_51_4K_PH_20070423_DELUXE</ContentTitleText>
<ContentKeysNotValidBefore>2007-04-30T05:00:00-00:00</ContentKeysNotValidBefore>
<ContentKeysNotValidAfter>2007-04-30T10:00:00-00:00</ContentKeysNotValidAfter>
<KeyIdList>
<KeyId>urn:uuid:9851b0f6-4790-0d4c-a69d-ea8abdedd03d</KeyId>
<KeyId>urn:uuid:8317e8f3-1597-494d-9ed8-08a751ff8615</KeyId>
<KeyId>urn:uuid:5d9b228d-7120-344c-aefc-840cdd32bbfc</KeyId>
<KeyId>urn:uuid:1e32ccb2-ab0b-9d43-b879-1c12840c178b</KeyId>
<KeyId>urn:uuid:44d04416-676a-2e4f-8995-165de8cab78d</KeyId>
<KeyId>urn:uuid:906da0c1-b0cb-4541-b8a9-86476583cdc4</KeyId>
<KeyId>urn:uuid:0fe2d73a-ebe3-9844-b3de-4517c63c4b90</KeyId>
<KeyId>urn:uuid:862fa79a-18c7-9245-a172-486541bef0c0</KeyId>
<KeyId>urn:uuid:aa2f1a88-7a55-894d-bc19-42afca589766</KeyId>
<KeyId>urn:uuid:59d6eeff-cd56-6245-9f13-951554466626</KeyId>
<KeyId>urn:uuid:14a13b1a-76ba-764c-97d0-9900f58af53e</KeyId>
<KeyId>urn:uuid:ccdbe0ae-1c3f-224c-b450-947f43bbd640</KeyId>
<KeyId>urn:uuid:dcd37f10-b042-8e44-bef0-89bda2174842</KeyId>
<KeyId>urn:uuid:9dd7103e-7e5a-a840-a15f-f7d7fe699203</KeyId>
</KeyIdList>
</RequiredExtensions>
<NonCriticalExtensions/>
</AuthenticatedPublic>
<AuthenticatedPrivate Id="ID_AuthenticatedPrivate"><enc:EncryptedKey xmlns:enc="http://www.w3.org/2001/04/xmlenc#">
<enc:EncryptionMethod Algorithm="http://www.w3.org/2001/04/xmlenc#rsa-oaep-mgf1p">
<ds:DigestMethod xmlns:ds="http://www.w3.org/2000/09/xmldsig#" Algorithm="http://www.w3.org/2000/09/xmldsig#sha1"/>
</enc:EncryptionMethod>

The expression Reel.findtext('Id') only matches direct children of Reel. If you want to find all Id tags in your xml document, you can just use:
ids = [id.text for id in Reel.findall(".//Id")]
This would give you a list of all text nodes of all Id tags which are children of Reel.
edit:
Your updated example uses namespaces, in this case KeyId is in the default namespace (http://www.digicine.com/PROTO-ASDCP-KDM-20040311#), so to search for it you need to include it in your search:
from xml.etree import ElementTree
doc = ElementTree.parse('test.xml')
nsmap = {'ns': 'http://www.digicine.com/PROTO-ASDCP-KDM-20040311#'}
ids = [id.text for id in doc.findall(".//ns:KeyId", namespaces=nsmap)]
print(ids)
...
The xpath subset ElementTree supports is rather limited. If you want a more complete support, you should use lxml instead, it's xpath support is way more complete.
For example, using xpath to search for all KeyId tags (ignoring namespaces) and returning their text content directly:
from lxml import etree
doc = etree.parse('test.xml')
ids = doc.xpath(".//*[local-name()='KeyId']/text()")
print(ids)
...

It sounds like XPath might be right up your alley - it will let you query your XML document for exactly what you're looking for, as long as you know the structure.

Here's what I needed to do. This works for finding whatever I need.
for node in tree.getiterator():
... if 'KeyId' in node.tag:
... mylist = node.tag
... print(mylist)
...

how to strip all child tags in an xml tag but leaving the text to merge to the parens using lxml in python?

How can one tell etree.strip_tags() to strip all possible tags from a given tag element?
Do I have to map them myself, like:
STRIP_TAGS = [ALL TAGS...] # Is there a built in list or dictionary in lxml
# that gives you all tags?
etree.strip_tags(tag, *STRIP_TAGS)
Perhaps a more elegant approach I don't know of?
Example input:
parent_tag = "<parent>This is some <i>text</i> with multiple <some_tag>tags</some_tag> and sometimes they <tt>are<bold> nested</bold></tt>.</parent>"
Desired Output:
# <parent>This is some text with multiple tags and sometimes they are nested.</parent>
or even better:
This is some text with multiple tags and sometimes they are nested.

You can use the lxml.html.clean module:
import lxml.html, lxml.html.clean
s = '<parent>This is some <i>text</i> with multiple <some_tag>tags</some_tag> and sometimes they <tt>are<bold> nested</bold></tt>.</parent>'
tree = lxml.html.fromstring(s)
cleaner = lxml.html.clean.Cleaner(allow_tags=['parent'], remove_unknown_tags=False)
cleaned_tree = cleaner.clean_html(tree)
print lxml.etree.tostring(cleaned_tree)
# <parent>This is some text with multiple tags and sometimes they are nested.</parent>

This answer is a bit late, but I guess a simpler solution than the one provided by the initial answer by ars might be handy for safekeeping's sake.
Short Answer
Use the "*" argument when you call strip_tags() to specify all tags to be stripped.
Long Answer
Given your XML string, we can create an lxml Element:
>>> import lxml.etree
>>> s = "<parent>This is some <i>text</i> with multiple <some_tag>tags</some_tag> and sometimes they <tt>are<bold> nested</bold></tt>.</parent>"
>>> parent_tag = lxml.etree.fromstring(s)
You can inspect that instance like so:
>>> parent_tag
<Element parent at 0x5f9b70>
>>> lxml.etree.tostring(parent_tag)
b'<parent>This is some <i>text</i> with multiple <some_tag>tags</some_tag> and sometimes they <tt>are<bold> nested</bold></tt>.</parent>'
To strip out all the tags except the parent tag itself, use the etree.strip_tags() function like you suggested, but with a "*" argument:
>>> lxml.etree.strip_tags(parent_tag, "*")
Inspection shows that all child tags are gone:
>>> lxml.etree.tostring(parent_tag)
b'<parent>This is some text with multiple tags and sometimes they are nested.</parent>'
Which is your desired output. Note that this will modify the lxml Element instance itself! To make it even better (as you asked :-)) just grab the text property:
>>> parent_tag.text
'This is some text with multiple tags and sometimes they are nested.'

Getting non-contiguous text with lxml / ElementTree

Suppose I have this sort of HTML from which I need to select "text2" using lxml / ElementTree:
<div>text1<span>childtext1</span>text2<span>childtext2</span>text3</div>
If I already have the div element as mydiv, then mydiv.text returns just "text1".
Using itertext() seems problematic or cumbersome at best since it walks the entire tree under the div.
Is there any simple/elegant way to extract a non-first text chunk from an element?

Well, lxml.etree provides full XPath support, which allows you to address the text items:
>>> import lxml.etree
>>> fragment = '<div>text1<span>childtext1</span>text2<span>childtext2</span>text3</div>'
>>> div = lxml.etree.fromstring(fragment)
>>> div.xpath('./text()')
['text1', 'text2', 'text3']

Such text will be in the tail attributes of the children of your element. If your element were in elem then:
elem[0].tail
Would give you the tail text of the first child within the element, in your case the "text2" you are looking for.

As llasram said, any text not in the text attribute will be in the tail attributes of the child nodes.
As an example, here's the simplest way to extract all of the text chunks (first and otherwise) in a node:
html = '<div>text1<span>childtext1</span>text2<span>childtext2</span>text3</div>'
import lxml.html # ...or lxml.etree as appropriate
div = lxml.html.fromstring(html)
texts = [div.text] + [child.tail for child in div]
# Result: texts == ['text1', 'text2', 'text3']
# ...and you are guaranteed that div[x].tail == texts[x+1]
# (which can be useful if you need to access or modify the DOM)
If you'd rather sacrifice that relation in order to prevent texts from potentially containing empty strings, you could use this instead:
texts = [div.text] + [child.tail for child in div if child.tail]
I haven't tested this with plain old stdlib ElementTree, but it should work with that too. (Something that only occurred to me once I saw Shane Holloway's lxml-specific solution) I just prefer LXML because it's got better support for HTML's ideosyncracies and I usually already have it installed for lxml.html.clean

Use node.text_content() to get all of the text below a node as a single string.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

python docx how to read text along with inline images? - python

Related

Xpath that returns the whole document in python

Python element tree - extract text from element, stripping tags

XML parsing in Python using Python 2 or 3

how to strip all child tags in an xml tag but leaving the text to merge to the parens using lxml in python?

Getting non-contiguous text with lxml / ElementTree

Categories

Resources