Getting non-contiguous text with lxml / ElementTree - python

Suppose I have this sort of HTML from which I need to select "text2" using lxml / ElementTree:
<div>text1<span>childtext1</span>text2<span>childtext2</span>text3</div>
If I already have the div element as mydiv, then mydiv.text returns just "text1".
Using itertext() seems problematic or cumbersome at best since it walks the entire tree under the div.
Is there any simple/elegant way to extract a non-first text chunk from an element?

Well, lxml.etree provides full XPath support, which allows you to address the text items:
>>> import lxml.etree
>>> fragment = '<div>text1<span>childtext1</span>text2<span>childtext2</span>text3</div>'
>>> div = lxml.etree.fromstring(fragment)
>>> div.xpath('./text()')
['text1', 'text2', 'text3']

Such text will be in the tail attributes of the children of your element. If your element were in elem then:
elem[0].tail
Would give you the tail text of the first child within the element, in your case the "text2" you are looking for.

As llasram said, any text not in the text attribute will be in the tail attributes of the child nodes.
As an example, here's the simplest way to extract all of the text chunks (first and otherwise) in a node:
html = '<div>text1<span>childtext1</span>text2<span>childtext2</span>text3</div>'
import lxml.html # ...or lxml.etree as appropriate
div = lxml.html.fromstring(html)
texts = [div.text] + [child.tail for child in div]
# Result: texts == ['text1', 'text2', 'text3']
# ...and you are guaranteed that div[x].tail == texts[x+1]
# (which can be useful if you need to access or modify the DOM)
If you'd rather sacrifice that relation in order to prevent texts from potentially containing empty strings, you could use this instead:
texts = [div.text] + [child.tail for child in div if child.tail]
I haven't tested this with plain old stdlib ElementTree, but it should work with that too. (Something that only occurred to me once I saw Shane Holloway's lxml-specific solution) I just prefer LXML because it's got better support for HTML's ideosyncracies and I usually already have it installed for lxml.html.clean

Use node.text_content() to get all of the text below a node as a single string.

Related

XPath text() does not get the text of a link node

from lxml import etree
import requests
htmlparser = etree.HTMLParser()
f = requests.get('https://rss.orf.at/news.xml')
# without the ufeff this would fail because it tells me: "ValueError: Unicode strings with encoding declaration are not supported. Please use bytes input or XML fragments without declaration."
tree = etree.fromstring('\ufeff'+f.text, htmlparser)
print(tree.xpath('//item/title/text()')) #<- this does produce a liste of titles
print(tree.xpath('//item/link/text()')) #<- this does NOT produce a liste of links why ?!?!
Okay this is a bit of mystery to me, and maybe I'm just overlooking the simplest thing, but the XPath '//item/link/text()' does only produce an empty list while '//item/title/text()' works exactly like expected. Does the <link> node hold any special purpose? I can select all of them with '//item/link' I just can't get the text() selector to work on them.
You're using etree.HTMLParser to parse an XML document. I suspect this was an attempt to deal with XML namespacing, but I think it's probably the wrong solution. It's possible treating the XML document as HTML is ultimately the source of your problem.
If we use the XML parser instead, everything pretty much works as expected.
First, if we look at the root element, we see that it sets a default namespace:
<rdf:RDF
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:dc="http://purl.org/dc/elements/1.1/"
xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
xmlns:orfon="http://rss.orf.at/1.0/"
xmlns="http://purl.org/rss/1.0/"
>
That means when we see an item element in the document, it's actually an "item in the http://purl.org/rss/1.0/ namespace" element. We need to provide that namespace information in our xpath queries by passing in a namespaces dictionary and use a namespace prefix on the element names, like this:
>>> tree.xpath('//rss:item', namespaces={'rss': 'http://purl.org/rss/1.0/'})
[<Element {http://purl.org/rss/1.0/}item at 0x7f0497000e80>, ...]
Your first xpath expression (looking at /item/title/text()) becomes:
>>> tree.xpath('//rss:item/rss:title/text()', namespaces={'rss': 'http://purl.org/rss/1.0/'})
['Amnesty dokumentiert Kriegsverbrechen', ..., 'Moskauer Börse startet abgeschirmten Handel']
And your second xpath expression (looking at /item/link/text()) becomes:
>>> tree.xpath('//rss:item/rss:link/text()', namespaces={'rss': 'http://purl.org/rss/1.0/'})
['https://orf.at/stories/3255477/', ..., 'https://orf.at/stories/3255384/']
This makes the code look like:
from lxml import etree
import requests
f = requests.get('https://rss.orf.at/news.xml')
tree = etree.fromstring(f.content)
print(tree.xpath('//rss:item/rss:title/text()', namespaces={'rss': 'http://purl.org/rss/1.0/'}))
print(tree.xpath('//rss:item/rss:link/text()', namespaces={'rss': 'http://purl.org/rss/1.0/'}))
Note that by using f.content (which is a byte string) instead of f.text (a unicode string), we avoid the whole unicode parsing error.

python docx how to read text along with inline images?

I have a simple docx file like this(just insert a inline png file to text):
I've tried:
>>> x=docx.Document('12.docx')
>>> for p in x.paragraphs:
print(p.text)
headend
>>> list(x.inline_shapes)
[]
And I unzip 12.docx file, found word/media/image1.png is the location. So is there a way to get a output like:
>>> for p in x.paragraphs:
print(p.text_with_image_info)
head<word/media/image1.png>end
You should be able to get a list of inline shapes like this:
>>> [s for s in x.inline_shapes]
[<InlineShape object at 0x...>]
If none show up then you'd probably need to examine the XML to find out why it's not finding anything at the XPath location '//w:p/w:r/w:drawing/wp:inline'. That might yield an interesting finding if you're seeing an empty list there.
Regarding the bit about getting the text with image in document order, you'll need to go down to the lxml layer.
You can get the paragraph lxml element w:p using Paragraph._element. From there you can inspect the XML with the .xml property:
>>> p = paragraph._p
>>> p.xml
'<w:p> etc ...'
You'll need to iterate through the children of the w:p element, I expect you'll find primarily w:r (run) elements. Text is held below those in w:t elements and a w:drawing element is a peer of w:t if I'm not mistaken.
You can construct python-docx objects like InlineShape with the right child element to get access to a more convenient API once you've located the right bit.
So it's a bit of work but doable if you're up to working with lxml-level calls.

Python element tree - extract text from element, stripping tags

With ElementTree in Python, how can I extract all the text from a node, stripping any tags in that element and keeping only the text?
For example, say I have the following:
<tag>
Some <a>example</a> text
</tag>
I want to return Some example text. How do I go about doing this? So far, the approaches I've taken have had fairly disastrous outcomes.
If you are running under Python 3.2+, you can use itertext.
itertext creates a text iterator which loops over this element and all subelements, in document order, and returns all inner text:
import xml.etree.ElementTree as ET
xml = '<tag>Some <a>example</a> text</tag>'
tree = ET.fromstring(xml)
print(''.join(tree.itertext()))
# -> 'Some example text'
If you are running in a lower version of Python, you can reuse the implementation of itertext() by attaching it to the Element class, after which you can call it exactly like above:
# original implementation of .itertext() for Python 2.7
def itertext(self):
tag = self.tag
if not isinstance(tag, basestring) and tag is not None:
return
if self.text:
yield self.text
for e in self:
for s in e.itertext():
yield s
if e.tail:
yield e.tail
# if necessary, monkey-patch the Element class
if 'itertext' not in ET.Element.__dict__:
ET.Element.itertext = itertext
xml = '<tag>Some <a>example</a> text</tag>'
tree = ET.fromstring(xml)
print(''.join(tree.itertext()))
# -> 'Some example text'
As the documentation says, if you want to read only the text, without any intermediate tags, you have to recursively concatenate all text and tail attributes in the correct order.
However, recent-enough versions (including the ones in the stdlib in 2.7 and 3.2, but not 2.6 or 3.1, and the current released versions of both ElementTree and lxml on PyPI) can do this for you automatically in the tostring method:
>>> s = '''<tag>
... Some <a>example</a> text
... </tag>'''
>>> t = ElementTree.fromstring(s)
>>> ElementTree.tostring(s, method='text')
'\n Some example text\n'
If you also want to strip whitespace from the text, you'll need to do so manually. In your simple case, that's easy:
>>> ElementTree.tostring(s, method='text').strip()
'Some example text'
In more complicated cases, however, where you want to strip out whitespace within intermediate tags, you'll probably have to fall back on recursively processing the texts and tails. That's not too hard; you just have to remember to deal with the possibility that the attributes may be None. For example, here's a skeleton you can hook your own code on:
def textify(t):
s = []
if t.text:
s.append(t.text)
for child in t.getchildren():
s.extend(textify(child))
if t.tail:
s.append(t.tail)
return ''.join(s)
This version only works when text and tail are guaranteed to be a str or None. For trees you build up manually, that's not guaranteed to be true.
Aslo exists a very simple solution in case it's possible to use XPath. It's called XPath Axes: more about it can be found here.
When having a node (like a tag div) which itself contains text and other nodes as well (like tags a or center or another div) with text inside or it contains just text and we want to select all text in that div node, it's possible to do it with folowing XPath: current_element.xpath("descendant-or-self::*/text()").extract(). What we will get is a list of all texts within a current element, stripping tags inside, if there are any.
What's nice about it is that no recursive function is needed, XPath takes care of all of this (using recusion itself, but for us it's as clean as it only can be).
Here is StackOverflow question concerning this proposed solution.

how to strip all child tags in an xml tag but leaving the text to merge to the parens using lxml in python?

How can one tell etree.strip_tags() to strip all possible tags from a given tag element?
Do I have to map them myself, like:
STRIP_TAGS = [ALL TAGS...] # Is there a built in list or dictionary in lxml
# that gives you all tags?
etree.strip_tags(tag, *STRIP_TAGS)
Perhaps a more elegant approach I don't know of?
Example input:
parent_tag = "<parent>This is some <i>text</i> with multiple <some_tag>tags</some_tag> and sometimes they <tt>are<bold> nested</bold></tt>.</parent>"
Desired Output:
# <parent>This is some text with multiple tags and sometimes they are nested.</parent>
or even better:
This is some text with multiple tags and sometimes they are nested.
You can use the lxml.html.clean module:
import lxml.html, lxml.html.clean
s = '<parent>This is some <i>text</i> with multiple <some_tag>tags</some_tag> and sometimes they <tt>are<bold> nested</bold></tt>.</parent>'
tree = lxml.html.fromstring(s)
cleaner = lxml.html.clean.Cleaner(allow_tags=['parent'], remove_unknown_tags=False)
cleaned_tree = cleaner.clean_html(tree)
print lxml.etree.tostring(cleaned_tree)
# <parent>This is some text with multiple tags and sometimes they are nested.</parent>
This answer is a bit late, but I guess a simpler solution than the one provided by the initial answer by ars might be handy for safekeeping's sake.
Short Answer
Use the "*" argument when you call strip_tags() to specify all tags to be stripped.
Long Answer
Given your XML string, we can create an lxml Element:
>>> import lxml.etree
>>> s = "<parent>This is some <i>text</i> with multiple <some_tag>tags</some_tag> and sometimes they <tt>are<bold> nested</bold></tt>.</parent>"
>>> parent_tag = lxml.etree.fromstring(s)
You can inspect that instance like so:
>>> parent_tag
<Element parent at 0x5f9b70>
>>> lxml.etree.tostring(parent_tag)
b'<parent>This is some <i>text</i> with multiple <some_tag>tags</some_tag> and sometimes they <tt>are<bold> nested</bold></tt>.</parent>'
To strip out all the tags except the parent tag itself, use the etree.strip_tags() function like you suggested, but with a "*" argument:
>>> lxml.etree.strip_tags(parent_tag, "*")
Inspection shows that all child tags are gone:
>>> lxml.etree.tostring(parent_tag)
b'<parent>This is some text with multiple tags and sometimes they are nested.</parent>'
Which is your desired output. Note that this will modify the lxml Element instance itself! To make it even better (as you asked :-)) just grab the text property:
>>> parent_tag.text
'This is some text with multiple tags and sometimes they are nested.'

How to get all strings from all nested tags of a xml tag with python's lxml.etree library?

I have an xml file in which it is possible that the following occurs:
...
<a><b>This is</b> some text about <c>some</c> issue I have, parsing xml</a>
...
Edit: Let's assume, the tags could be nested more than only level, meaning
<a><b><c>...</c>...</b>...</a>
I came up with this using the python lxml.etree library.
context = etree.iterparse(PATH_TO_XML, dtd_validation=True, events=("end",))
for event, element in context:
tag = element.tag
if tag == "a":
print element.text # is empty :/
mystring = element.xpath("string()")
...
But somehow it goes wrong.
What I want is the whole string
"This is some text about some issue I have, parsing xml"
But I only get an empty string. Any suggestions? Thanks!
This question has been asked many times.
You can use lxml.html.text_content() method.
import lxml.html
t = lxml.html.fromstring("...")
t.text_content()
REF: Filter out HTML tags and resolve entities in python
OR use lxml.etree.strip_tags() method.
REF: In lxml, how do I remove a tag but retain all contents?

Categories