Removing an element, but not the text after it - python

I have an XML file similar to this:
<root>
<a>Some <b>bad</b> text <i>that</i> I <u>do <i>not</i></u> want to keep.</a>
</root>
I want to remove all text in <b> or <u> elements (and descendants), and print the rest. This is what I tried:
from __future__ import print_function
import xml.etree.ElementTree as ET
tree = ET.parse('a.xml')
root = tree.getroot()
parent_map = {c:p for p in root.iter() for c in p}
for item in root.findall('.//b'):
parent_map[item].remove(item)
for item in root.findall('.//u'):
parent_map[item].remove(item)
print(''.join(root.itertext()).strip())
(I used the recipe in this answer to build the parent_map). The problem, of course, is that with remove(item) I'm also removing the text after the element, and the result is:
Some that I
whereas what I want is:
Some text that I want to keep.
Is there any solution?

If you won't end up using anything better, you can use clear() instead of remove() keeping the tail of the element:
import xml.etree.ElementTree as ET
data = """<root>
<a>Some <b>bad</b> text <i>that</i> I <u>do <i>not</i></u> want to keep.</a>
</root>"""
tree = ET.fromstring(data)
a = tree.find('a')
for element in a:
if element.tag in ('b', 'u'):
tail = element.tail
element.clear()
element.tail = tail
print ET.tostring(tree)
prints (see empty b and u tags):
<root>
<a>Some <b /> text <i>that</i> I <u /> want to keep.</a>
</root>
Also, here's a solution using xml.dom.minodom:
import xml.dom.minidom
data = """<root>
<a>Some <b>bad</b> text <i>that</i> I <u>do <i>not</i></u> want to keep.</a>
</root>"""
dom = xml.dom.minidom.parseString(data)
a = dom.getElementsByTagName('a')[0]
for child in a.childNodes:
if getattr(child, 'tagName', '') in ('u', 'b'):
a.removeChild(child)
print dom.toxml()
prints:
<?xml version="1.0" ?><root>
<a>Some text <i>that</i> I want to keep.</a>
</root>

Related

Removing empty xml nodes

I have an xml file that I'm trying to remove empty nodes from with python. When I've tested it to check if a the value is, say, 'shark', it works. But when i check for it being none, it doesn't remove the empty node.
for records in recordList:
for fieldGroup in records:
for field in fieldGroup:
if field.text is None:
fieldGroup.remove(field)
xpath is your friend here.
from lxml import etree
doc = etree.XML("""<root><a>1</a><b><c></c></b><d></d></root>""")
def remove_empty_elements(doc):
for element in doc.xpath('//*[not(node())]'):
element.getparent().remove(element)
Then:
>>> print etree.tostring(doc,pretty_print=True)
<root>
<a>1</a>
<b>
<c/>
</b>
<d/>
</root>
>>> remove_empty_elements(doc)
>>> print etree.tostring(doc,pretty_print=True)
<root>
<a>1</a>
<b/>
</root>
>>> remove_empty_elements(doc)
>>> print etree.tostring(doc,pretty_print=True)
<root>
<a>1</a>
</root>

lxml, get xml between elements

given this sample xml:
<xml>
<pb facs="id1" />
<aa></aa>
<aa></aa>
<lot-of-xml></lot-of-xml>
<pb facs="id2" />
<bb></bb>
<bb></bb>
<lot-of-xml></lot-of-xml>
</xml>
i need to parse it and get all the content between pb, saving into distinct external files.
expected result:
$ cat id1
<aa></aa>
<aa></aa>
<lot-of-xml></lot-of-xml>
$ cat id2
<bb></bb>
<bb></bb>
<lot-of-xml></lot-of-xml>
what is the correct xpath axe to use?
from lxml import etree
xml = etree.parse("sample.xml")
for pb in xml.xpath('//pb'):
filename = pb.xpath('#facs')[0]
f = open(filename, 'w')
content = **{{ HOW TO GET THE CONTENT HERE? }}**
f.write(content)
f.close()
is there any xpath expression to get all descendants and stop when reached a new pb?
Do you want to extract the tag between two pb's? If yes then that's not quite possible because it is not a tag in between pb's rather than an individual tag on the same level as pb as you have closed the tag pb . If you close the tag after the test tag then test can become a child of pb.
In other words if your xml is like this:
<xml>
<pb facs="id1">
<test></test>
</pb>
<test></test>
<pb facs="id2" />
<test></test>
<test></test>
</xml>
Then you can use
import xml.etree.ElementTree as ET
tree = ET.parse('test.xml')
root = tree.getroot()
for child in root:
for subchild in child:
print subchild
to print the subchild('test') with pb as a parent.
Well if that's not the case (you just want to extract the attributes of pb tag)then you can use either of the two methods shown below to extract the elements.
With python's inbuilt etree
import xml.etree.ElementTree as ET
tree = ET.parse('sample.xml')
root = tree.getroot()
for child in root:
if child.get('facs'):
print child.get('facs')
With the lxml library you can parse it like this:
tree = etree.parse('test.xml')
root = tree.getroot()
for child in root:
if child.get('facs'):
print child.get('facs')
OK, I tested this code:
lists = []
for node in tree.findall('*'):
if node.tag == 'pb':
lists.append([])
else:
lists[-1].append(node)
Output:
>>> lists
[[<Element test at 2967fa8>, <Element test at 2a89030>, <Element lot-of-xml at 2a89080>], [<Element test at 2a89170>, <Element test at 2a891c0>, <Element lot-of-xml at 2a89210>]]
Input file (just in case):
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<xml>
<pb facs="id1" />
<test></test>
<test></test>
<lot-of-xml></lot-of-xml>
<pb facs="id2" />
<test></test>
<test></test>
<lot-of-xml></lot-of-xml>
</xml>

etree.strip_tags returning 'None' when trying to strip tag

Script:
print entryDetails
for i in range(len(entryDetails)):
print etree.tostring(entryDetails[i])
print etree.strip_tags(entryDetails[i], 'entry-details')
Output:
[<Element entry-details at 0x234e0a8>, <Element entry-details at 0x234e878>]
<entry-details>2014-02-05 11:57:01</entry-details>
None
<entry-details>2014-02-05 12:11:05</entry-details>
None
How is etree.strip_tags failing to strip the entry-details tag? Is the dash in the tag name affecting it?
strip_tags() does not return anything. It strips off the tags in-place.
The documentation says: "Note that this will not delete the element (or ElementTree root element) that you passed even if it matches. It will only treat its descendants.".
Demo code:
from lxml import etree
XML = """
<root>
<entry-details>ABC</entry-details>
</root>"""
root = etree.fromstring(XML)
ed = root.xpath("//entry-details")[0]
print ed
print
etree.strip_tags(ed, "entry-details") # Has no effect
print etree.tostring(root)
print
etree.strip_tags(root, "entry-details")
print etree.tostring(root)
Output:
<Element entry-details at 0x2123b98>
<root>
<entry-details>ABC</entry-details>
</root>
<root>
ABC
</root>

how can I select all descendants of a certain element with ElementTree in Python 3.3?

This is the sample data.
input.xml
<root>
<entry id="1">
<headword>go</headword>
<example>I <hw>go</hw> to school.</example>
</entry>
</root>
I'd like to put node and its descendants into . That is,
output.xml
<root>
<entry id="1">
<headword>go</headword>
<examplegrp>
<example>I <hw>go</hw> to school.</example>
</examplegrp>
</entry>
</root>
My poor and incomplete script is:
import codecs
import xml.etree.ElementTree as ET
fin = codecs.open(r'input.xml', 'rb', encoding='utf-8')
data = ET.parse(fin)
root = data.getroot()
example = root.find('.//example')
for elem in example.iter():
---and then I don't know what to do---
Here's an example of how it can be done:
text = """
<root>
<entry id="1">
<headword>go</headword>
<example>I <hw>go</hw> to school.</example>
</entry>
</root>
"""
import lxml.etree
import StringIO
data = lxml.etree.parse(StringIO.StringIO(text))
root = data.getroot()
for entry in root.xpath('//example/ancestor::entry[1]'):
examplegrp = lxml.etree.SubElement(entry,"examplegrp")
nodes = [node for node in entry.xpath('./example')]
for node in nodes:
entry.remove(node)
examplegrp.append(node)
print lxml.etree.tostring(root,pretty_print=True)
which will output:
<root>
<entry id="1">
<headword>go</headword>
<examplegrp><example>I <hw>go</hw> to school.</example>
</examplegrp></entry>
</root>
http://docs.python.org/3/library/xml.dom.html?highlight=xml#node-objects
http://docs.python.org/3/library/xml.dom.html?highlight=xml#document-objects
You probably want to follow some paradigm of creating a Document Element and appending reach result to it.
group = Document.createElement(tagName)
for found in founds:
group.appendNode(found)
Or something like this

I want to split text of a node and then put each of them into independent element

There is a xml file like this:
sample.xml
<root>
<keyword_group>
<headword>sell/buy</headword>
</keyword_group>
</root>
I'd like to split headword.text with '/' and then wrap each of them with tag. And finally I need to remove the tag . The output I expect is:
<root>
<keyword_group>
<word>sell</word>
<word>buy</word>
</keyword_group>
</root>
My ugly script is:
import lxml.etree as ET
xml = '''\
<root>
<keyword_group>
<headword>sell/buy</headword>
</keyword_group>
</root>
'''
root = ET.fromstring(xml)
headword = root.find('.//headword')
if headword is not None:
words = headword.text.split('/')
for word in words:
ET.SubElement(headword, 'word')
for wr in headword.iter('word'):
if not wr.text:
wr.text = word
headword.text = ''
print(ET.tostring(root, encoding='unicode'))
But this is too complicated, and I failed to remove headword tags.
Using lxml:
import lxml.etree as ET
xml = '''\
<root>
<keyword_group>
<headword>sell/buy</headword>
</keyword_group>
</root>
'''
root = ET.fromstring(xml)
headword = root.find('.//headword')
if headword is not None:
words = headword.text.split('/')
parent = headword.getparent()
parent.remove(headword)
for word in words:
ET.SubElement(parent, 'word').text = word
print(ET.tostring(root, encoding='unicode'))
yields
<root>
<keyword_group>
<word>sell</word><word>buy</word></keyword_group>
</root>

Categories