XML parsing with ElementTree produces wrong output - python

I want to parse an XML file with ElementTree but at a certain tag the output is wrong
<descriptions>
<description descriptionType="Abstract">Some Abstract Text
</description>
</descriptions>
So I parse it with the XML function
import xml.etree.ElementTree as ElementTree
root = ElementTree.XML(my_xml)
root.getchildren()[0].items()
and the outcome is:
Out: [('descriptionType', 'Abstract')]
Is there any problem with the XML, I use ElementTree in a wrong way or it's a bug?

I guess you want to get the text. So:
root.getchildren()[0].text
not
root.getchildren()[0].items()

It was just that if there are no tags its stored in the text attribute..

Related

how to use fromstring for xml parsing by ElementTree using python?

xml code is this
<foo>
<bar key="value">text</bar>
</foo>
Python code is:
import xml.etree.ElementTree as ET
xml=ET.fromstring(contents)
xml.find('./bar').attrib['key']
Output: 'value'
What must be placed in contents place of the above python code to get the value as output?
If i write as contents only it is giving an error as contents not defined.
It works if the XML is provided as a triple-quoted string. This allows you to include unescaped quotes within the string.
import xml.etree.ElementTree as ET
contents = """
<foo>
<bar key="value">text</bar>
</foo>"""
xml = ET.fromstring(contents)
print xml.find('./bar').attrib['key']

Remove ns0 from XML

I have an XML file where I would like to edit certain attributes. I am able to properly edit the attributes but when I write the changes to the file, the tags have a strange "ns0" added onto them. How can I get rid of this? This is what I have tried and have been unsuccessful. I am working in Python and using lxml.
import xml.etree.ElementTree as ET
from xml.etree import ElementTree as etree
from lxml import etree, objectify
frag_xml_tree = ET.parse(xml_name)
frag_root = frag_xml_tree.getroot()
for e in frag_root:
for elem in frag_root.iter(e):
elem.attrib[frag_param_name] = update_val
etree.register_namespace("", "http://www.w3.org/2001")
frag_xml_tree.write(xml_name)
However, when I do this, I only get the error Invalid tag name u''. I thought this error came up if the xml tags started with digits but that is not the case with my xml. I am really stuck on how to proceed. Thanks
Actually the way to do it seemed to be a combination of two things.
The import statement is import xml.etree.ElementTree as ET
ET.register_namespace("", NAMESPACE) is the correct call, where NAMESPACE is the namespace listed in the input xml, ie the url after xmlns.
Here's the corrected code using only xml.etree.ElementTree instead of lxml:
import xml.etree.ElementTree as ET
frag_xml_tree = ET.parse(xml_name)
frag_root = frag_xml_tree.getroot()
for e in frag_root:
for elem in frag_root.iter(e):
elem.attrib[frag_param_name] = update_val
ET.register_namespace("", "http://www.w3.org/2001")
frag_xml_tree.write(xml_name)
the following snippet removes the presence of ns0 throughout the xml file
for i in range (0,len(list(root))):
print(root[i])
ET.register_namespace("",NAMESPACE)
tree.write('TP_updated2.xml',xml_declaration=True,method='xml',encoding="utf8",default_namespace=None)
NAMESPACE = the url after the xmlns

extracting result text from xml using Python

I have the following xml :
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<ns3:result xmlns:ns2="http://ws.def.com/">
<ns3:value>QWESW12323D2412123S</ns3:value>
</ns3:result>
and want to parse it with python and extract this text i tried the following :
from xml.etree import ElementTree as etree
xml = etree.fromstring(data)
item = xml.find('ns3:value')
print item
but i get empty item ,could someone help to achieve this with Python?
Use the syntax '{ns3}value' to apply the namespace - although as ns3 isn't defined, I don't think this is actually valid xml.

How to (push) parse XML files in Python?

I've already seen this question, but it's from the 2009.
What's a simple modern way to handle XML files in Python 3?
I.e., from this TLD (adapted from here):
<?xml version="1.0" encoding="UTF-8" ?>
<taglib>
<tlib-version>1.0</tlib-version>
<short-name>bar-baz</short-name>
<tag>
<name>present</name>
<tag-class>condpkg.IfSimpleTag</tag-class>
<body-content>scriptless</body-content>
<attribute>
<name>test</name>
<required>true</required>
<rtexprvalue>true</rtexprvalue>
</attribute>
</tag>
</taglib>
I want to parse TLD files (Java Server Pages Tag Library Descriptors), to obtain some sort of structure in Python (I have still to decide about that part).
Hence, I need a push parser. But I won't do much more with it, so I'd rather prefer a simple API (I'm new to Python).
xml.etree.ElementTree is still there, in the standard library:
import xml.etree.ElementTree as ET
data = """your xml here"""
tree = ET.fromstring(data)
print(tree.find('tag/name').text) # prints "present"
If you look outside of the standard library, there is a very popular and fast lxml module that follows the ElementTree interface and supports Python3:
from lxml import etree as ET
data = """your xml here"""
tree = ET.fromstring(data)
print(tree.find('tag/name').text) # prints "present"
Besides, there is lxml.objectify that allows you to deal with XML structure like with a Python object.

Python ElementTree - print out namespace definitions?

I'm using Python's elementtree to parse some XML configuration files.
At the top of the file, I have a root element like this:
<?xml version="1.0" encoding="utf-8"?>
<sgx:FooConfig
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns:foo="http://ns.au.firm.com/foo.xsd"
xmlns:bar="http://ns.au.firm.com/bar.xsd"
>
The problem is, the bar namespace can be set to one of two different XSDs, depending on the version of the configuration file.
I'm looking for a way to print out the namespace mapping using ElementTree, so I can check which of the two XSDs is being used - then I can get my code to handle the correct case.
Is there a way to print out all the namespace definitions out using Python?
Cheers,
Victor
What you have is not valid xml (undefined prefixes) and I think you can't do this with xml.etree but you should be able to do it using lxml.
import lxml.etree as et
tree = et.XML(yourxml)
print tree.nsmap

Categories