Adding attributes to existing elements, removing elements, etc with lxml - python

I parse in the XML using
from lxml import etree
tree = etree.parse('test.xml', etree.XMLParser())
Now I want to work on the parsed XML. I'm having trouble removing elements with namespaces or just elements in general such as
<rdf:description><dc:title>Example</dc:title></rdf:description>
and I want to remove that entire element as well as everything within the tags. I also want to add attributes to existing elements as well. The methods I need are in the Element class but I have no idea how to use that with the ElementTree object here. Any pointers would definitely be appreciated, thanks

You can get to the root element via this call: root=tree.getroot()
Using that root element, you can use findall() and remove elements that match your criteria:
deleteThese = root.findall("title")
for element in deleteThese: root.remove(element)
Finally, you can see what your new tree looks like with this: etree.tostring(root, pretty_print=True)
Here is some info about how find/findall work:
http://infohost.nmt.edu/tcc/help/pubs/pylxml/class-ElementTree.html#ElementTree-find
To add an attribute to an element, try something like this:
root.attrib['myNewAttribute']='hello world'

The remove method should do what you want:
>>> from lxml import etree
>>> from StringIO import StringIO
>>> s = '<Root><Description><Title>foo</Title></Description></Root>'
>>> tree = etree.parse(StringIO(s))
>>> print(etree.tostring(tree.getroot()))
<Root><Description><Title>foo</Title></Description></Root>
>>> title = tree.find('//Title')
>>> title.getparent().remove(title)
>>> etree.tostring(tree.getroot())
'<Root><Description/></Root>'
>>> print(etree.tostring(tree.getroot()))
<Root><Description/></Root>

Related

XML parser that contains debug information

I'm looking for an XML parser for python that includes some debug information in each node, for instance the line number and column where the node began. Ideally, it would be a parser that is compatible with xml.etree.ElementTree.XMLParser, i.e., one that I can pass to xml.etree.ElementTree.parse.
I know these parsers don't actually produce the elements, so I'm not sure how this would really work, but it seems like such a useful thing, I'll be surprised if no-body has one. Syntax errors in the XML are one thing, but semantic errors in the resulting structure can be difficult to debug if you can't point to a specific location in the source file/string.
Point to an element by xpath (lxml - getpath)
lxml offers finding an xpath for an element in document.
Having test document:
>>> from lxml import etree
>>> xmlstr = """<root><rec id="01"><subrec>a</subrec><subrec>b</subrec></rec>
... <rec id="02"><para>graph</para></rec>
... </root>"""
...
>>> doc = etree.fromstring(xmlstr)
>>> doc
<Element root at 0x7f61040fd5f0>
We pick an element <para>graph</para>:
>>> para = doc.xpath("//para")[0]
>>> para
<Element para at 0x7f61040fd488>
XPath has a meaning, if we have clear context, in this case it is root of the XML document:
>>> root = doc.getroottree()
>>> root
<lxml.etree._ElementTree at 0x7f610410f758>
And now we can ask, what xpath leads from the root to the element of our interest:
>>> root.getpath(para)
'/root/rec[2]/para'

Strip html tags - lxml.html.clean.clean_html doesn't work as expected

I want to strip all html tags from a string except some I specify.
If I call the constructor with default values everything works fine:
>>> cleaner = lxml.html.clean.Cleaner()
>>> cleaner.clean_html('''<i>italic</i><script>alert('');</script>''')
'<span><i>italic</i></span>'
But when I try to specify some tags, things doesn't work anymore:
>>> allowed_tags = ['i','s']
>>> cleaner = lxml.html.clean.Cleaner(remove_unknown_tags=False,allow_tags=allowed_tags)
>>> cleaner.clean_html('''<i>italic</i><s>strike</s>''')
'<span></span>'
So what am i doing wrong?
As a workaround, you can add span and div tags to allowed_tags.
UPD
lxml.html.Cleaner tries to convert string to html tree by calling fromstring, which checks if document have some root node, and adds it if necessary. So you need to allow span and div tags
It seems like a bug. I don't see it in lxml==2.3.3 version:
>>> from lxml.html import clean
>>> clean.clean_html('''<i>italic</i><script>alert('');</script>''')
'<span><i>italic</i></span>'
>>> c = clean.Cleaner(allow_tags='is', remove_unknown_tags=False)
>>> c.clean_html('''<i>italic</i><s>strike</s>''')
'<div><i>italic</i><s>strike</s></div>'

Python XML question

I have an XML document as a str. Now, in the XSD <foo> is unbounded, and while most of the time there is only 1, there COULD be more. I'm trying to use ElementTree, but am running into an issue:
>>> from xml.etree.ElementTree import fromstring
>>>
>>> xml_str = """<?xml version="1.0"?>
... <foo>
... <bar>
... <baz>Spam</baz>
... <qux>Eggs</qux>
... </bar>
... </foo>"""
>>> # Try to get the document
>>> el = fromstring(xml_str)
>>> el.findall('foo')
[]
>>> el.findall('bar')
[<Element 'bar' at 0x1004acb90>]
Clearly, I need to loop through the <foo>s, but because <foo> is at the root, I can't. Obviously, I could create an element called <root> and put el inside of it, but is there a more correct way of doing this?
Each XML document is supposed to have exactly one root element. You will need to adjust your XML if you want to support multiple foo elements.
Alas, wrapping the element in an ElementTree with tree = ElementTree(el) and trying tree.findall('//foo') doesn't seem to work either (it seems you can only search "beneath" an element, and even if the search is done from the full tree, it searches "beneath" the root). As ElementTree doesn't claim to really implement xpath, it's difficult to say whether this is intended or a bug.
Solution: without using lxml with full xpath support (el.xpath('//foo') for example), the easiest solution would be to use the Element.iter() method.
for foo in el.iter(tag='foo'):
print foo
or if you want the results in a list:
list(el.iter(tag='foo'))
Note that you can't use complex paths in this way, just find all elements with a certain tagname, starting from (and including) the element.

how to strip all child tags in an xml tag but leaving the text to merge to the parens using lxml in python?

How can one tell etree.strip_tags() to strip all possible tags from a given tag element?
Do I have to map them myself, like:
STRIP_TAGS = [ALL TAGS...] # Is there a built in list or dictionary in lxml
# that gives you all tags?
etree.strip_tags(tag, *STRIP_TAGS)
Perhaps a more elegant approach I don't know of?
Example input:
parent_tag = "<parent>This is some <i>text</i> with multiple <some_tag>tags</some_tag> and sometimes they <tt>are<bold> nested</bold></tt>.</parent>"
Desired Output:
# <parent>This is some text with multiple tags and sometimes they are nested.</parent>
or even better:
This is some text with multiple tags and sometimes they are nested.
You can use the lxml.html.clean module:
import lxml.html, lxml.html.clean
s = '<parent>This is some <i>text</i> with multiple <some_tag>tags</some_tag> and sometimes they <tt>are<bold> nested</bold></tt>.</parent>'
tree = lxml.html.fromstring(s)
cleaner = lxml.html.clean.Cleaner(allow_tags=['parent'], remove_unknown_tags=False)
cleaned_tree = cleaner.clean_html(tree)
print lxml.etree.tostring(cleaned_tree)
# <parent>This is some text with multiple tags and sometimes they are nested.</parent>
This answer is a bit late, but I guess a simpler solution than the one provided by the initial answer by ars might be handy for safekeeping's sake.
Short Answer
Use the "*" argument when you call strip_tags() to specify all tags to be stripped.
Long Answer
Given your XML string, we can create an lxml Element:
>>> import lxml.etree
>>> s = "<parent>This is some <i>text</i> with multiple <some_tag>tags</some_tag> and sometimes they <tt>are<bold> nested</bold></tt>.</parent>"
>>> parent_tag = lxml.etree.fromstring(s)
You can inspect that instance like so:
>>> parent_tag
<Element parent at 0x5f9b70>
>>> lxml.etree.tostring(parent_tag)
b'<parent>This is some <i>text</i> with multiple <some_tag>tags</some_tag> and sometimes they <tt>are<bold> nested</bold></tt>.</parent>'
To strip out all the tags except the parent tag itself, use the etree.strip_tags() function like you suggested, but with a "*" argument:
>>> lxml.etree.strip_tags(parent_tag, "*")
Inspection shows that all child tags are gone:
>>> lxml.etree.tostring(parent_tag)
b'<parent>This is some text with multiple tags and sometimes they are nested.</parent>'
Which is your desired output. Note that this will modify the lxml Element instance itself! To make it even better (as you asked :-)) just grab the text property:
>>> parent_tag.text
'This is some text with multiple tags and sometimes they are nested.'

Getting non-contiguous text with lxml / ElementTree

Suppose I have this sort of HTML from which I need to select "text2" using lxml / ElementTree:
<div>text1<span>childtext1</span>text2<span>childtext2</span>text3</div>
If I already have the div element as mydiv, then mydiv.text returns just "text1".
Using itertext() seems problematic or cumbersome at best since it walks the entire tree under the div.
Is there any simple/elegant way to extract a non-first text chunk from an element?
Well, lxml.etree provides full XPath support, which allows you to address the text items:
>>> import lxml.etree
>>> fragment = '<div>text1<span>childtext1</span>text2<span>childtext2</span>text3</div>'
>>> div = lxml.etree.fromstring(fragment)
>>> div.xpath('./text()')
['text1', 'text2', 'text3']
Such text will be in the tail attributes of the children of your element. If your element were in elem then:
elem[0].tail
Would give you the tail text of the first child within the element, in your case the "text2" you are looking for.
As llasram said, any text not in the text attribute will be in the tail attributes of the child nodes.
As an example, here's the simplest way to extract all of the text chunks (first and otherwise) in a node:
html = '<div>text1<span>childtext1</span>text2<span>childtext2</span>text3</div>'
import lxml.html # ...or lxml.etree as appropriate
div = lxml.html.fromstring(html)
texts = [div.text] + [child.tail for child in div]
# Result: texts == ['text1', 'text2', 'text3']
# ...and you are guaranteed that div[x].tail == texts[x+1]
# (which can be useful if you need to access or modify the DOM)
If you'd rather sacrifice that relation in order to prevent texts from potentially containing empty strings, you could use this instead:
texts = [div.text] + [child.tail for child in div if child.tail]
I haven't tested this with plain old stdlib ElementTree, but it should work with that too. (Something that only occurred to me once I saw Shane Holloway's lxml-specific solution) I just prefer LXML because it's got better support for HTML's ideosyncracies and I usually already have it installed for lxml.html.clean
Use node.text_content() to get all of the text below a node as a single string.

Categories