I have the following example XML tree:
<main>
<section>
<list key="capital" value="sydney">
<items>
<item id="abc-123"></item>
<item id="abc-345"></item>
</items>
</list>
<list key="capital" value="tokyo">
<items>
<item id="def-678"></item>
<item id="def-901"></item>
</items>
</list>
</section>
</maim>
Do you know how to run a query that will extract the "items" node under "list" with key="capital" and value="tokyo" (which should extract item nodes with id="def-678" and id="def-901")?
Thanks so much for your help!
You can use XPath expression that xml.etree supports (see the documentation) via find() or findall() method :
from xml.etree import ElementTree as ET
raw = '''your xml string here'''
root = ET.fromstring(raw)
result = root.findall(".//list[#key='capital'][#value='tokyo']/items/item")
console test output :
>>> for r in result:
... print ET.tostring(r)
...
<item id="def-678" />
<item id="def-901" />
Related
I have a CSV file with a list of numbers with gaps in them, like:
0001
0002
0003
0005
0007
etc.
And I have an XML file with nodes with identifiers with a list of numbers without gaps, like:
<?xml version='1.0' encoding='utf-8'?>
<root>
<item>
<unitd>0001</unitd>
<unittitle>description of item 1</unittitle>
</item>
<item>
<unitd>0002</unitd>
<unittitle>description of item 2</unittitle>
</item>
<item>
<unitd>0003</unitd>
<unittitle>description of item 3</unittitle>
</item>
<item>
<unitd>0004</unitd>
<unittitle>description of item 4</unittitle>
</item>
<item>
<unitd>0005</unitd>
<unittitle>description of item 5</unittitle>
</item>
<item>
<unitd>0006</unitd>
<unittitle>description of item 6</unittitle>
</item>
<item>
<unitd>0007</unitd>
<unittitle>description of item 7</unittitle>
</item>
</root> <!-- added by edit -->
I want to add an extra element to the items of the XML file that have identifiers that can be found in the CSV file, like this:
<root>
<item>
<unitd>0001</unitd>
<unittitle>description of item 1</unittitle>
<link>link to extra info on item 1</link>
</item>
<item>
<unitd>0002</unitd>
<unittitle>description of item 2</unittitle>
<link>link to extra info on item 2</link>
</item>
<item>
<unitd>0003</unitd>
<unittitle>description of item 3</unittitle>
<link>link to extra info on item 3</link>
</item>
<item>
<unitd>0004</unitd>
<unittitle>description of item 4</unittitle>
</item>
<item>
<unitd>0005</unitd>
<unittitle>description of item 5</unittitle>
<link>link to extra info on item 5</link>
</item>
<item>
<unitd>0006</unitd>
<unittitle>description of item 6</unittitle>
</item>
<item>
<unitd>0007</unitd>
<unittitle>description of item 7</unittitle>
<link>link to extra info on item 7</link>
</item>
Can I do this using python and how or is there a smarter way to take care of this?
The smartest way to handle an XML to XML transformation is using XSLT which was designed for this exact purpose.
So to transform your source XML to your desired destination XML you can use this XSLT-1.0 script (named trans.xslt):
<xsl:stylesheet version ="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="xml" encoding="UTF-8"/>
<xsl:variable name="additional" select="document('a_link.xml')/LinkMapping" /> <!-- name of helper XML file -->
<!-- identity template -->
<xsl:template match="node()|#*" >
<xsl:copy>
<xsl:apply-templates select="node()|#*" />
</xsl:copy>
</xsl:template>
<!-- item transform template -->
<xsl:template match="item" >
<xsl:copy>
<xsl:copy-of select="node()|#*" />
<xsl:if test="$additional/map[#id=current()/unitd]">
<link>
<xsl:value-of select="$additional/map[#id=current()/unitd]/text()" />
</link>
</xsl:if>
</xsl:copy>
</xsl:template>
</xsl:stylesheet>
This template requires an additional XML file containing the mapping of the links from the CSV file called a_link.xml. Your example of a CSV file does not show any relation like this, but transforming the CSV to something like the below format should be no problem.
<LinkMapping>
<map id="0001">link to extra info on item 1</map>
<map id="0002">link to extra info on item 2</map>
<map id="0003">link to extra info on item 3</map>
<map id="0005">link to extra info on item 5</map>
<map id="0007">link to extra info on item 7</map>
</LinkMapping>
The output of applying the above XSLT with the XML helper file is as desired.
So to use this with Python, you can refer to this SO answer which explains how to transform an XML file with XSLT.
Assuming that your XML file is named input.xml the code could look like this:
import lxml.etree as ET
dom = ET.parse("input.xml")
xslt = ET.parse("trans.xslt")
transform = ET.XSLT(xslt)
newdom = transform(dom)
print(ET.tostring(newdom, pretty_print=True))
Now you should have gotten your desired result.
I am newbie for XPath. I have the following XML file.
Here my xml file:
<?xml version='1.0' encoding='utf-8'?>
<items>
<item>
<country>India</country>
<referenceId>IN375TP</referenceId>
<price>400</price>
</item>
<item>
<country>Australia</country>
<referenceId>AU120ED</referenceId>
<price>15</price>
</item>
<item>
<country>United Kingdom</country>
<referenceId>UK862RB</referenceId>
<price>20</price>
</item>
</items>
I want the following <item> tag as an output:
<item>
<country>Australia</country>
<referenceId>AU120ED</referenceId>
<price>15</price>
</item>
Note: Please use condition like /items/item[referenceId/text()="AU120ED"]
If you want to find the item by country, you can use an xpath specifying you want to find the item in items that have the text=country_name:
from lxml.etree import parse, HTMLParser
xml = parse("check.xml",HTMLParser())
print(xml.find("//items//item[country='Australia']"))
<Element item at 0x7f40faa28950>
If you actually want to search be referenceid, just change to item[referenceid='AU120ED']:
print(xml.find("//items//item[referenceid='AU120ED']"))
<Element item at 0x7f02c0c24998>
For xml:
from xml.etree import ElementTree as et
xml = et.parse("check.xml")
print(xml.find(".").find("./item[referenceId='AU120ED']"))
I'm using Python and the lxml library to produce an XML file that I want to look like this:
<item>
<a:text>hello</a:text>
</item>
However, I can't manage to produce this, I've tried the following code:
import lxml.etree as etree
item = etree.Element('item')
el = etree.SubElement(item, 'text', nsmap={'a': 'http://example.com/')
But then I end up with:
<item>
<text xmlns="http://example.com/">hello</text>
</item>
I also tried this after some inspiration from the lxml namespaces documentation (http://lxml.de/tutorial.html#namespaces):
import lxml.etree as etree
item = etree.Element('item')
el = etree.SubElement(item, '{a}text')
But that gives me:
<item>
<ns1:text xmlns:ns1="a">hello</text>
</item>
Is there any way to get the XML format I need with lxml ?
The first thing to note is that this...
<item>
<a:text>hello</a:text>
</item>
...is not valid XML. a: is a namespace prefix, but somewhere you have to map it to an actual namespace, as in:
<item xmlns:a="http://example.com/">
<a:text>hello</a:text>
</item>
As you read in the lxml documentation, you can use the {namespace}element syntax to specify a namespace...but this uses an actual namespace, not a namespace prefix (which is why your second example did not work as expected).
You can get what I think you want like this:
>>> from lxml import etree
>>> item = etree.Element('item', nsmap={'a': 'http://example.com/'})
>>> e1 = etree.SubElement(item, '{http://example.com/}text')
Which gives you:
>>> print etree.tostring(item, pretty_print=True)
<item xmlns:a="http://example.com/">
<a:text/>
</item>
It's also worth noting that from the perspective of XML, the above is exactly equivalent to:
<item>
<text xmlns="http://example.com/">hello</text>
</item>
I have some SGML that looks like this
<!DOCTYPE sometype>
<ITEM>
<DATE>19-OCT-1987</DATE>
<TEXT>
<TITLE>I AM THE TITLE</TITLE>
<AUTHOR>I AM THE AUTHOR</AUTHOR>
<DATELINE>WHEN I WAS CREATED</DATELINE><BODY>
I WANT TO PRESERVE THIS TAG!
</BODY></TEXT>
</ITEM>
<ITEM>...
I tried to parse it with lxml.html, but it appears to strip the BODY tags, which I need to preserve. Next I tried to use lxml.etree, but as you can see there is not common parent element for all the ITEM tags. The code I'm currently using
doc = """<!DOCTYPE sometype>
<ITEM>
<DATE>19-OCT-1987</DATE>
<TEXT>
<TITLE>I AM THE TITLE</TITLE>
<AUTHOR>I AM THE AUTHOR</AUTHOR>
<DATELINE>WHEN I WAS CREATED</DATELINE><BODY>
I WANT TO PRESERVE THIS TAG!
</BODY></TEXT>
</ITEM>"""
from lxml import etree
parser = etree.XMLParser(recover=True) # I have invalid HTML chars to ignore
sgml = etree.fromstring(doc, parser)
Now sgml is only the first ITEM element. I need it to be all of the ITEM elements. Any ideas? lxml.html does what I want, but it strips the BODY tags by default, and I haven't found a way to disable this behavior.
There isn't a common parent element? Just make one!
You can just rewrite them to have a parent element, say ROOT. Insert <ROOT> before the first <ITEM> and </ROOT> at the end of the document. It's pretty trivial to do programmatically, even if you have to preserve the actual on-disk content.
eg.
<!DOCTYPE sometype>
<ROOT>
<ITEM>
<DATE>19-OCT-1987</DATE>
<TEXT>
<TITLE>I AM THE TITLE</TITLE>
<AUTHOR>I AM THE AUTHOR</AUTHOR>
<DATELINE>WHEN I WAS CREATED</DATELINE><BODY>
I WANT TO PRESERVE THIS TAG!
</BODY></TEXT>
</ITEM>
<ITEM>
<DATE>19-OCT-1879</DATE>
<TEXT>
<TITLE>I AM THE TITLE</TITLE>
<AUTHOR>I AM THE AUTHOR</AUTHOR>
<DATELINE>WHEN I WAS CREATED</DATELINE><BODY>
I WANT TO PRESERVE THIS TAG!
</BODY></TEXT>
</ITEM>
<ITEM>
<DATE>19-OCT-9871</DATE>
<TEXT>
<TITLE>I AM THE TITLE</TITLE>
<AUTHOR>I AM THE AUTHOR</AUTHOR>
<DATELINE>WHEN I WAS CREATED</DATELINE><BODY>
I WANT TO PRESERVE THIS TAG!
</BODY></TEXT>
</ITEM>
</ROOT>
I've just tried this and it seems to do what you want. Saved as /tmp/goodfoo and loaded with lxml.etree.fromstring(allcontent); then I accessed the text you say 'want to preserve' like this: b.getchildren()[0].getchildren()[-1].getchildren()[-1].text
(that is, get the first ITEM, get its TEXT element, get the TEXT element's BODY element, and return any text content of the BODY element.)
I can read tags, except when there is a prefix. I'm not having luck searching SO for a previous question.
I need to read media:content. I tried image = node.find("media:content").
Rss input:
<channel>
<title>Popular Photography in the last 1 week</title>
<item>
<title>foo</title>
<media:category label="Miscellaneous">photography/misc</media:category>
<media:content url="http://foo.com/1.jpg" height="375" width="500" medium="image"/>
</item>
<item> ... </item>
</channel>
I can read a sibling tag title.
from xml.etree import ElementTree
with open('cache1.rss', 'rt') as f:
tree = ElementTree.parse(f)
for node in tree.findall('.//channel/item'):
title = node.find("title").text
I've been using the docs, yet stuck on the 'prefix' part.
Here's an example of using XML namespaces with ElementTree:
>>> x = '''\
<channel xmlns:media="http://www.w3.org/TR/html4/">
<title>Popular Photography in the last 1 week</title>
<item>
<title>foo</title>
<media:category label="Miscellaneous">photography/misc</media:category>
<media:content url="http://foo.com/1.jpg" height="375" width="500" medium="image"/>
</item>
<item> ... </item>
</channel>
'''
>>> node = ElementTree.fromstring(x)
>>> for elem in node.findall('item/{http://www.w3.org/TR/html4/}category'):
print elem.text
photography/misc
media is an XML namespace, it has to be defined somewhere earlier with xmlns:media="...". See http://lxml.de/xpathxslt.html#namespaces-and-prefixes for how to define xml namespaces for use in XPath expressions in lxml.