Emitting namespace specifications with ElementTree in Python - python

I am trying to emit an XML file with element-tree that contains an XML declaration and namespaces. Here is my sample code:
from xml.etree import ElementTree as ET
ET.register_namespace('com',"http://www.company.com") #some name
# build a tree structure
root = ET.Element("STUFF")
body = ET.SubElement(root, "MORE_STUFF")
body.text = "STUFF EVERYWHERE!"
# wrap it in an ElementTree instance, and save as XML
tree = ET.ElementTree(root)
tree.write("page.xml",
xml_declaration=True,
method="xml" )
However, neither the <?xml tag comes out nor any namespace/prefix information. I'm more than a little confused here.

Although the docs say otherwise, I only was able to get an <?xml> declaration by specifying both the xml_declaration and the encoding.
You have to declare nodes in the namespace you've registered to get the namespace on the nodes in the file. Here's a fixed version of your code:
from xml.etree import ElementTree as ET
ET.register_namespace('com',"http://www.company.com") #some name
# build a tree structure
root = ET.Element("{http://www.company.com}STUFF")
body = ET.SubElement(root, "{http://www.company.com}MORE_STUFF")
body.text = "STUFF EVERYWHERE!"
# wrap it in an ElementTree instance, and save as XML
tree = ET.ElementTree(root)
tree.write("page.xml",
xml_declaration=True,encoding='utf-8',
method="xml")
Output (page.xml)
<?xml version='1.0' encoding='utf-8'?><com:STUFF xmlns:com="http://www.company.com"><com:MORE_STUFF>STUFF EVERYWHERE!</com:MORE_STUFF></com:STUFF>
ElementTree doesn't pretty-print either. Here's pretty-printed output:
<?xml version='1.0' encoding='utf-8'?>
<com:STUFF xmlns:com="http://www.company.com">
<com:MORE_STUFF>STUFF EVERYWHERE!</com:MORE_STUFF>
</com:STUFF>
You can also declare a default namespace and don't need to register one:
from xml.etree import ElementTree as ET
# build a tree structure
root = ET.Element("{http://www.company.com}STUFF")
body = ET.SubElement(root, "{http://www.company.com}MORE_STUFF")
body.text = "STUFF EVERYWHERE!"
# wrap it in an ElementTree instance, and save as XML
tree = ET.ElementTree(root)
tree.write("page.xml",
xml_declaration=True,encoding='utf-8',
method="xml",default_namespace='http://www.company.com')
Output (pretty-print spacing is mine)
<?xml version='1.0' encoding='utf-8'?>
<STUFF xmlns="http://www.company.com">
<MORE_STUFF>STUFF EVERYWHERE!</MORE_STUFF>
</STUFF>

I've never been able to get the <?xml tag out of the element tree libraries programatically so I'd suggest you try something like this.
from xml.etree import ElementTree as ET
root = ET.Element("STUFF")
root.set('com','http://www.company.com')
body = ET.SubElement(root, "MORE_STUFF")
body.text = "STUFF EVERYWHERE!"
f = open('page.xml', 'w')
f.write('<?xml version="1.0" encoding="UTF-8"?>' + ET.tostring(root))
f.close()
Non std lib python ElementTree implementations may have different ways to specify namespaces, so if you decide to move to lxml, the way you declare those will be different.

Related

Unexpected results when parsing XML via lxml

The output of my xml parsing is not es expected.
The xml file
<?xml version="1.0"?>
<stationaer xsi:schemaLocation="http:/foo.bar" xmlns="http://foo.bar" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<einrichtung>
<name>Name</name>
</einrichtung>
<einrichtung>
<name>Name</name>
</einrichtung>
</stationaer>
I would expect to get something like root.tag == 'stationaer' and child.tag = 'einrichtung'.
See the outpout at the end.
This is the MWE
#!/usr/bin/env python3
import pathlib
import lxml
from lxml import etree
import pandas
xml_src = '''<?xml version="1.0"?>
<stationaer xsi:schemaLocation="http:/foo.bar" xmlns="http://foo.bar" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<einrichtung>
<name>Name</name>
</einrichtung>
<einrichtung>
<name>Name</name>
</einrichtung>
</stationaer>
'''
# tree = etree.parse(file_path)
# root = tree.getroot()
root = etree.fromstring(xml_src)
print(repr(root.tag))
print(repr(root.text))
child = root.getchildren()[0]
print(repr(child.tag))
print(repr(child.text))
The output for root is
'{http://foo.bar}stationaer'
'\n '
and for child
'{http://foo.bar}einrichtung'
'\n '
I don't understand what's going on here and why that URL is in the output.
This is actually not unexpected. The elements in the XML document are bound to the http://foo.bar default namespace. The namespace is declared by xmlns="http://foo.bar" on the root element and the declaration is inherited by all descendants.
The special notation with the namespace URI enclosed in curly braces ({http://foo.bar}stationaer) is never used in XML documents, but it is used by lxml and ElementTree when printing element (tag) names. It can also be used when searching or creating elements that belong to a namespace.
More information:
https://www.w3.org/TR/xml-names/
https://lxml.de/tutorial.html#namespaces
https://docs.python.org/3/library/xml.etree.elementtree.html#parsing-xml-with-namespaces

How to get the content of child->child->child->child in XML file using Python

<?xml version="1.0" encoding="UTF-8"?>
<Document xmlns="urn:iso:std:iso:20022:tech:xsd:camt.056.001.01" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<FIToFIPmtCxlReq>
<Assgnmt>
<Id>TEST-ISO-81</Id>
<Assgnr>
<Agt>
<FinInstnId>
<BIC>CCCCGB2L</BIC>
</FinInstnId>
</Agt>
</Assgnr>
<Assgne>
<Agt>
<FinInstnId>
<BIC>MMMMGB2L</BIC>
</FinInstnId>
</Agt>
</Assgne>
<CreDtTm>2009-03-24T11:22:59</CreDtTm>
</Assgnmt>
<TxInf>
<CxlId>103012345</CxlId>
<Case>
<Id>ISO_TEST_CASE</Id>
<Cretr>
<Agt>
<FinInstnId>
<BIC>MMMMGB2L</BIC>
</FinInstnId>
</Agt>
</Cretr>
</Case>
</TxInf>
</Undrlyg>
</FIToFIPmtCxlReq>
</Document>
Here I want to get the content of "TxInf" like all its child and child of child and the data.
What I have tried is :
import xml.etree.ElementTree as ET
from xml.etree import ElementTree
tree = ET.parse('R3-CAMT.056.001.07-ISO-V.XML')
root = tree.getroot()
for element in root.iter():
if element.tag == "{urn:iso:std:iso:20022:tech:xsd:camt.056.001.01}TxInf":
tree._setroot(element.tag)
print(root.tag)
print(root.attrib)
Please suggest if I can change the root with _setroot or any other possible method
Try something along these lines on your code to see if it works:
for r in root.findall(".//*"):
if 'TxInf' in r.tag:
print(ET.tostring(r))
By the way, it may be easier to do it with lxml, if you can use it.

Python: Xml parsing method

I have a problem with a python script which is used to parse a xml file. This is the xml file:
file.xml
<Tag1 SchemaVersion="1.1" xmlns="http://www.microsoft.com/axe">
<RandomTag>TextText</RandomTag>
<Tag2 xmlns="http://schemas.datacontract.org/2004/07">
<AnotherRandom>Abc</AnotherRandom>
</Tag2>
</Tag1>
I am using xml.etree.ElementTree as parsing method. My task is to change the tags between RandomTag (in this case "TextText"). This is the python code:
python code
import xml.etree.ElementTree as ET
customXmlFile = 'file.xml'
ns = {
'ns': 'http://www.microsoft.com/axe',
'sc': 'http://schemas.datacontract.org/2004/07/Microsoft.Assessments.Relax.ObjectModel_V1'
}
tree = ET.parse(customXmlFile)
root = tree.getroot()
node = root.find('ns:RandomTag', namespaces=ns)
node.text = 'NEW TEXT'
ET.register_namespace('', 'http://www.microsoft.com/axe')
tree.write(customXmlFile + ".new",
xml_declaration=True,
encoding='utf-8',
method="xml")
I don't have run time errors, the code works fine, but all the namespaces are defined in the first node (Tag1) and in AnotherRandom and Tag2 is used a shorcut. Also, the SchemaVersion is deleted.
file.xml.new - output
<?xml version='1.0' encoding='utf-8'?>
<Tag1 xmlns="http://www.microsoft.com/axe" xmlns:ns1="http://schemas.datacontract.org/2004/07" SchemaVersion="1.1">
<RandomTag>NEW TEXT</RandomTag>
<ns1:Tag2>
<ns1:AnotherRandom>Abc</ns1:AnotherRandom>
</ns1:Tag2>
</Tag1>
file.xml.new - desired output
<Tag1 SchemaVersion="1.1" xmlns="http://www.microsoft.com/axe">
<RandomTag>TextText</RandomTag>
<Tag2 xmlns="http://schemas.datacontract.org/2004/07">
<AnotherRandom>NEW TEXT</AnotherRandom>
</Tag2>
</Tag1>
What should I change to get exact the same format of XML as at the beggining with that only text changed?
This is a bit of a hack but will do kind of what you want. However, playing around with namespaces like this surely violates the XML standard. I suggest you check out lxml if you want better handling of namespaces.
You must call register_namespace() before parsing in the file. Since repeated calls to that function overwrite previous mapping, you must manually edit the internal dict.
import xml.etree.ElementTree as ET
customXmlFile = 'test.xml'
ns = {'ns': 'http://www.microsoft.com/axe',
'sc': 'http://schemas.datacontract.org/2004/07/'}
ET.register_namespace('', 'http://www.microsoft.com/axe')
ET._namespace_map['http://schemas.datacontract.org/2004/07'] = ''
tree = ET.parse(customXmlFile)
root = tree.getroot()
node = root.find('ns:RandomTag', namespaces=ns)
node.text = 'NEW TEXT'
tree.write(customXmlFile + ".new",
xml_declaration=True,
encoding='utf-8',
method="xml")
For more information about this see:
http://effbot.org/zone/element-namespaces.htm
Saving XML files using ElementTree
Cannot write XML file with default namespace

python elementree blank output

I am parsing an XML output from VCloud, however I am not able to reach to the values
<?xml version="1.0" encoding="UTF-8"?>
<SupportedVersions xmlns="http://www.vmware.com/vcloud/versions" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.vmware.com/vcloud/versions http://10.10.6.12/api/versions/schema/versions.xsd">
<VersionInfo>
<Version>1.5</Version>
<LoginUrl>https://api.vcd.portal.skyscapecloud.com/api/sessions</LoginUrl>
<MediaTypeMapping>
<MediaType>application/vnd.vmware.vcloud.instantiateVAppTemplateParams+xml</MediaType>
<ComplexTypeName>InstantiateVAppTemplateParamsType</ComplexTypeName>
<SchemaLocation>http://api.vcd.portal.skyscapecloud.com/api/v1.5/schema/master.xsd</SchemaLocation>
</MediaTypeMapping>
<MediaTypeMapping>
<MediaType>application/vnd.vmware.admin.vmwProviderVdcReferences+xml</MediaType>
<ComplexTypeName>VMWProviderVdcReferencesType</ComplexTypeName>
<SchemaLocation>http://api.vcd.portal.skyscapecloud.com/api/v1.5/schema/vmwextensions.xsd</SchemaLocation>
</MediaTypeMapping>
<MediaTypeMapping>
<MediaType>application/vnd.vmware.vcloud.customizationSection+xml</MediaType>
<ComplexTypeName>CustomizationSectionType</ComplexTypeName>
<SchemaLocation>http://api.vcd.portal.skyscapecloud.com/api/v1.5/schema/master.xsd</SchemaLocation>
</MediaTypeMapping>
this is what I have been using
import xml.etree.ElementTree as ET
data = ET.fromstring(content)
versioninfo = data.findall("VersionInfo/Version")
print len(versioninfo)
print versioninfo.text
however this gives a blank output...any suggestions?
Try this:
import xml.etree.ElementTree as ET
data = ET.fromstring(content)
versioninfo = data.find(
"ns:VersionInfo/ns:Version",
namespaces={'ns':'http://www.vmware.com/vcloud/versions'})
print versioninfo.text
Use .find(), not .findall() to return a single element
Your XML uses namespaces. The full path to your desired object is: '{http://www.vmware.com/vcloud/versions}VersionInfo/{http://www.vmware.com/vcloud/versions}Version' By passing in the namespaces parameter, you are able to use the shortcut syntax: ns:VersionInfo/ns:Version.

parsing xml by python lxml tree.xpath

I try to parse a huge file. The sample is below. I try to take <Name>, but I can't
It works only without this string
<LevelLayout xmlns="http://schemas.datacontract.org/2004/07/ArcherTech.Common.Domain" xmlns:i="http://www.w3.org/2001/XMLSchema-instance">
xml2 = '''<?xml version="1.0" encoding="UTF-8"?>
<PackageLevelLayout>
<LevelLayouts>
<LevelLayout levelGuid="4a54f032-325e-4988-8621-2cb7b49d8432">
<LevelLayout xmlns="http://schemas.datacontract.org/2004/07/ArcherTech.Common.Domain" xmlns:i="http://www.w3.org/2001/XMLSchema-instance">
<LevelLayoutSectionBase>
<LevelLayoutItemBase>
<Name>Tracking ID</Name>
</LevelLayoutItemBase>
</LevelLayoutSectionBase>
</LevelLayout>
</LevelLayout>
</LevelLayouts>
</PackageLevelLayout>'''
from lxml import etree
tree = etree.XML(xml2)
nodes = tree.xpath('/PackageLevelLayout/LevelLayouts/LevelLayout[#levelGuid="4a54f032-325e-4988-8621-2cb7b49d8432"]/LevelLayout/LevelLayoutSectionBase/LevelLayoutItemBase/Name')
print nodes
Your nested LevelLayout XML document uses a namespace. I'd use:
tree.xpath('.//LevelLayout[#levelGuid="4a54f032-325e-4988-8621-2cb7b49d8432"]//*[local-name()="Name"]')
to match the Name element with a shorter XPath expression (ignoring the namespace altogether).
The alternative is to use a prefix-to-namespace mapping and use those on your tags:
nsmap = {'acd': 'http://schemas.datacontract.org/2004/07/ArcherTech.Common.Domain'}
tree.xpath('/PackageLevelLayout/LevelLayouts/LevelLayout[#levelGuid="4a54f032-325e-4988-8621-2cb7b49d8432"]/acd:LevelLayout/acd:LevelLayoutSectionBase/acd:LevelLayoutItemBase/acd:Name',
namespaces=nsmap)
lxml's xpath method has a namespaces parameter. You can pass it a dict mapping namespace prefixes to namespaces. Then you can refer build XPaths that use the namespace prefix:
xml2 = '''<?xml version="1.0" encoding="UTF-8"?>
<PackageLevelLayout>
<LevelLayouts>
<LevelLayout levelGuid="4a54f032-325e-4988-8621-2cb7b49d8432">
<LevelLayout xmlns="http://schemas.datacontract.org/2004/07/ArcherTech.Common.Domain" xmlns:i="http://www.w3.org/2001/XMLSchema-instance">
<LevelLayoutSectionBase>
<LevelLayoutItemBase>
<Name>Tracking ID</Name>
</LevelLayoutItemBase>
</LevelLayoutSectionBase>
</LevelLayout>
</LevelLayout>
</LevelLayouts>
</PackageLevelLayout>'''
namespaces={'ns': 'http://schemas.datacontract.org/2004/07/ArcherTech.Common.Domain',
'i': 'http://www.w3.org/2001/XMLSchema-instance'}
import lxml.etree as ET
# This is an lxml.etree._Element, not a tree, so don't call it tree
root = ET.XML(xml2)
nodes = root.xpath(
'''/PackageLevelLayout/LevelLayouts/LevelLayout[#levelGuid="4a54f032-325e-4988-8621-2cb7b49d8432"]
/ns:LevelLayout/ns:LevelLayoutSectionBase/ns:LevelLayoutItemBase/ns:Name''', namespaces = namespaces)
print nodes
yields
[<Element {http://schemas.datacontract.org/2004/07/ArcherTech.Common.Domain}Name at 0xb74974dc>]

Categories