python lxml using iterparse to edit and output xml - python

I've been messing around with the lxml library for a little while and maybe I'm not understanding it correctly or I'm missing something but I can't seem to figure out how to edit the file after I catch a certain xpath and then be able to write that back out into xml while I'm parsing element by element.
Say we have this xml as an example:
<xml>
<items>
<pie>cherry</pie>
<pie>apple</pie>
<pie>chocolate</pie>
</items>
</xml>
What I would like to do while parsing is when I hit that xpath of "/xml/items/pie" is to add an element before pie, so it will turn out like this:
<xml>
<items>
<item id="1"><pie>cherry</pie></item>
<item id="2"><pie>apple</pie></item>
<item id="3"><pie>chocolate</pie></item>
</items>
</xml>
That output would need to be done by writing to a file line by line as I hit each tag and edit the xml at certain xpaths. I mean I could have it print the starting tag, the text, the attribute if it exists, and then the ending tag by hard coding certain parts, but that would be very messy and it be nice if there was a way to avoid that if possible.
Here's my guess code at this:
from lxml import etree
path=[]
count=0
context=etree.iterparse(file,events=('start','end'))
for event, element in context:
if event=='start':
path.append(element.tag)
if /'+'/'.join(path)=='/xml/items/pie':
itemnode=etree.Element('item',id=str(count))
itemnode.text=""
element.addprevious(itemnode)#Not the right way to do it of course
#write/print out xml here.
else:
element.clear()
path.pop()
Edit: Also, I need to run through fairly big files, so I have to use iterparse.

Here's a solution using iterparse(). The idea is to catch all tag "start" events, remember the parent (items) tag, then for every pie tag create an item tag and put the pie into it:
from StringIO import StringIO
from lxml import etree
from lxml.etree import Element
data = """<xml>
<items>
<pie>cherry</pie>
<pie>apple</pie>
<pie>chocolate</pie>
</items>
</xml>"""
stream = StringIO(data)
context = etree.iterparse(stream, events=("start", ))
for action, elem in context:
if elem.tag == 'items':
items = elem
index = 1
elif elem.tag == 'pie':
item = Element('item', {'id': str(index)})
items.replace(elem, item)
item.append(elem)
index += 1
print etree.tostring(context.root)
prints:
<xml>
<items>
<item id="1"><pie>cherry</pie></item>
<item id="2"><pie>apple</pie></item>
<item id="3"><pie>chocolate</pie></item>
</items>
</xml>

There is a more clean way to make modifications you need:
iterate over pie elements
make an item element
use replace() to replace a pie element with item
replace(self, old_element, new_element)
Replaces a subelement with the
element passed as second argument.
from lxml import etree
from lxml.etree import XMLParser, Element
data = """<xml>
<items>
<pie>cherry</pie>
<pie>apple</pie>
<pie>chocolate</pie>
</items>
</xml>"""
tree = etree.fromstring(data, parser=XMLParser())
items = tree.find('.//items')
for index, pie in enumerate(items.xpath('.//pie'), start=1):
item = Element('item', {'id': str(index)})
items.replace(pie, item)
item.append(pie)
print etree.tostring(tree, pretty_print=True)
prints:
<xml>
<items>
<item id="1"><pie>cherry</pie></item>
<item id="2"><pie>apple</pie></item>
<item id="3"><pie>chocolate</pie></item>
</items>
</xml>

I would suggest you to use an XSLT template, as it seems to match better for this task. Initially XSLT is a little bit tricky until you get used to it, if all you want is to generate some output from an XML, then XSLT is a great tool.

Related

lxml: Do not parse subtree but treat as binary content

I am working on XML content that contains elements which may hold potentially malformed XML/markup-like (e.g. HTML) content as text. For example:
<root>
<data>
<x>foo<y>bar</y>
</data>
<data>
<z>foo<y>bar</y>
</data>
</root>
Goal: I want lxml.etree to not attempt to parse anything under data-elements as XML but rather simply return it as bytes or str (can be in elem.text).
The files are big and I wanted to use lxml.etree.iterparse to extract the contents found in data-
elements.
Initial Idea: A straightforward way to just get the contents of the element (in this case containing the data start- and end-tags) could be:
data = BytesIO(b"""
<root>
<data>
<x>foo<y>bar</y>
</data>
<data>
<z>foo<y>bar</y>
</data>
</root>
""")
from lxml import etree
# see below why html=True
context = etree.iterparse(data, events=("end",), tag=("data",), html=True)
contents = [] # I don't keep lists in the "real" application
for event, elem in context:
contents.append(etree.tostring(elem)) # get back the full content underneath data
The problem with this is that lxml.etree can run into issues parsing the children of data (for example: I already had to use html=True to not run into issues when html-data is stored under data). I know that there are custom element classes in lxml but from how I understand the documentation, they do not change lxml.etree's parsing behaviour dictated by libxml2).
Is there any easy way to tell lxml to not attempt to parse element content as children. The application itself benefits from other lxml functionality which I would have to replicate if I wrote a custom extractor for data alone.
Or could there a way to use XSLT to first transform the input for processing in lxml and to later link back the data?
Does this work as expected?
The XML is modified by adding DTD and CDATA to specify that the content inside the data element has to be treated as character data.
data = io.BytesIO(B'''<?xml version='1.0' encoding='UTF-8'?>
<!DOCTYPE root [
<!ELEMENT root (data+)>
<!ELEMENT data (#PCDATA)>
]>
<root>
<data>
<![CDATA[
<x>foo<y>bar</y>
]]>
</data>
<data>
<![CDATA[
<z>foo<y>bar</y>
]]>
</data>
</root>
''')
from lxml import etree
# see below why html=True
context = etree.iterparse(data, events=("end",), tag=("data",), dtd_validation=True, load_dtd=True)
contents = [] # I don't keep lists in the "real" application
for event, elem in context:
contents.append(etree.tostring(elem)) # get back the full content underneath data

How to get specific block(group) based on child node's value in XPath from the XML?

I am newbie for XPath. I have the following XML file.
Here my xml file:
<?xml version='1.0' encoding='utf-8'?>
<items>
<item>
<country>India</country>
<referenceId>IN375TP</referenceId>
<price>400</price>
</item>
<item>
<country>Australia</country>
<referenceId>AU120ED</referenceId>
<price>15</price>
</item>
<item>
<country>United Kingdom</country>
<referenceId>UK862RB</referenceId>
<price>20</price>
</item>
</items>
I want the following <item> tag as an output:
<item>
<country>Australia</country>
<referenceId>AU120ED</referenceId>
<price>15</price>
</item>
Note: Please use condition like /items/item[referenceId/text()="AU120ED"]
If you want to find the item by country, you can use an xpath specifying you want to find the item in items that have the text=country_name:
from lxml.etree import parse, HTMLParser
xml = parse("check.xml",HTMLParser())
print(xml.find("//items//item[country='Australia']"))
<Element item at 0x7f40faa28950>
If you actually want to search be referenceid, just change to item[referenceid='AU120ED']:
print(xml.find("//items//item[referenceid='AU120ED']"))
<Element item at 0x7f02c0c24998>
For xml:
from xml.etree import ElementTree as et
xml = et.parse("check.xml")
print(xml.find(".").find("./item[referenceId='AU120ED']"))

Namespacing inside tags with Python lxml's SubElement

I'm using Python and the lxml library to produce an XML file that I want to look like this:
<item>
<a:text>hello</a:text>
</item>
However, I can't manage to produce this, I've tried the following code:
import lxml.etree as etree
item = etree.Element('item')
el = etree.SubElement(item, 'text', nsmap={'a': 'http://example.com/')
But then I end up with:
<item>
<text xmlns="http://example.com/">hello</text>
</item>
I also tried this after some inspiration from the lxml namespaces documentation (http://lxml.de/tutorial.html#namespaces):
import lxml.etree as etree
item = etree.Element('item')
el = etree.SubElement(item, '{a}text')
But that gives me:
<item>
<ns1:text xmlns:ns1="a">hello</text>
</item>
Is there any way to get the XML format I need with lxml ?
The first thing to note is that this...
<item>
<a:text>hello</a:text>
</item>
...is not valid XML. a: is a namespace prefix, but somewhere you have to map it to an actual namespace, as in:
<item xmlns:a="http://example.com/">
<a:text>hello</a:text>
</item>
As you read in the lxml documentation, you can use the {namespace}element syntax to specify a namespace...but this uses an actual namespace, not a namespace prefix (which is why your second example did not work as expected).
You can get what I think you want like this:
>>> from lxml import etree
>>> item = etree.Element('item', nsmap={'a': 'http://example.com/'})
>>> e1 = etree.SubElement(item, '{http://example.com/}text')
Which gives you:
>>> print etree.tostring(item, pretty_print=True)
<item xmlns:a="http://example.com/">
<a:text/>
</item>
It's also worth noting that from the perspective of XML, the above is exactly equivalent to:
<item>
<text xmlns="http://example.com/">hello</text>
</item>

how to parse the second xml tree in a file

Suppose I have a XML file like
<?xml version="1.0" encoding="utf-8"?>
<items>
<?xml version="1.0" encoding="utf-8"?>
<items>
<item>
<price>1500</price>
<info> asfgfdff</info>
</item>
</items>
How do I parse so that the parser selects the recently updated xml tree?
with open('file','r') as f:
newestXml = []
for line in f.readlines():
if re.search('^<\?xml',line):
newestXml = [line]
else:
newestXml.append(line)
At the end of the loop, newestXml will contain all the lines from the last occurrence of <?xml to the end of the file.
Now you can combine the lines and use the xml parser to parse the xml.
Note - I can't check this code now, so it may contain small mistakes, but I hope the idea will help you.

Python Minidom : Change Value of Node

I'm using Python's minidom library to try and manipulate some XML files. Here is an example file :
<document>
<item>
<link>http://www.this-is-a-url.com/</link>
<description>This is some information!</description>
</item>
<item>
<link>http://www.this-is-a-url.com/</link>
<description>This is some information!</description>
</item>
<item>
<link>http://www.this-is-a-url.com/</link>
<description>This is some information!</description>
</item>
</document>
What I need to do, is take the value in "description" and put it into "link" so both say "This is some information!". I've tried to do it like so:
#!/usr/bin/python
from xml.dom.minidom import parse
xmlData = parse("file.xml")
itmNode = xmlData.getElementsByTagName("item")
for n in itmNode:
n.childNodes[1] = n.childNodes[3]
n.childNodes[1].tagName = "link"
print xmlData.toxml()
However "n.childNodes[1] = n.childNodes[3]" seems to link the two nodes together, so when I do "n.childNodes[1].tagName = "link"" to correct the name BOTH child nodes become "link" where before they were both "description".
Furthermore, if I use "n.childNodes[1].nodeValue" the changes don't work and the XML is printed in it's original form. What am I doing wrong?
I'm not sure you can modify the DOM in place with xml.dom.minidom (creating the whole document from scratch with new values should work though).
Anyway, if you accept a solution based on xml.etree.ElementTree (I strongly recommend using it since it provides a friendlier interface), then you could use the following code:
from xml.etree.ElementTree import ElementTree, dump
tree = ElementTree()
tree.parse('file.xml')
items = tree.findall('item')
for item in items:
link, description = list(item)
link.text = description.text
dump(tree)

Categories