Python: Xml parsing method

Python: Xml parsing method - python

I have a problem with a python script which is used to parse a xml file. This is the xml file:
file.xml
<Tag1 SchemaVersion="1.1" xmlns="http://www.microsoft.com/axe">
<RandomTag>TextText</RandomTag>
<Tag2 xmlns="http://schemas.datacontract.org/2004/07">
<AnotherRandom>Abc</AnotherRandom>
</Tag2>
</Tag1>
I am using xml.etree.ElementTree as parsing method. My task is to change the tags between RandomTag (in this case "TextText"). This is the python code:
python code
import xml.etree.ElementTree as ET
customXmlFile = 'file.xml'
ns = {
'ns': 'http://www.microsoft.com/axe',
'sc': 'http://schemas.datacontract.org/2004/07/Microsoft.Assessments.Relax.ObjectModel_V1'
}
tree = ET.parse(customXmlFile)
root = tree.getroot()
node = root.find('ns:RandomTag', namespaces=ns)
node.text = 'NEW TEXT'
ET.register_namespace('', 'http://www.microsoft.com/axe')
tree.write(customXmlFile + ".new",
xml_declaration=True,
encoding='utf-8',
method="xml")
I don't have run time errors, the code works fine, but all the namespaces are defined in the first node (Tag1) and in AnotherRandom and Tag2 is used a shorcut. Also, the SchemaVersion is deleted.
file.xml.new - output
<?xml version='1.0' encoding='utf-8'?>
<Tag1 xmlns="http://www.microsoft.com/axe" xmlns:ns1="http://schemas.datacontract.org/2004/07" SchemaVersion="1.1">
<RandomTag>NEW TEXT</RandomTag>
<ns1:Tag2>
<ns1:AnotherRandom>Abc</ns1:AnotherRandom>
</ns1:Tag2>
</Tag1>
file.xml.new - desired output
<Tag1 SchemaVersion="1.1" xmlns="http://www.microsoft.com/axe">
<RandomTag>TextText</RandomTag>
<Tag2 xmlns="http://schemas.datacontract.org/2004/07">
<AnotherRandom>NEW TEXT</AnotherRandom>
</Tag2>
</Tag1>
What should I change to get exact the same format of XML as at the beggining with that only text changed?

This is a bit of a hack but will do kind of what you want. However, playing around with namespaces like this surely violates the XML standard. I suggest you check out lxml if you want better handling of namespaces.
You must call register_namespace() before parsing in the file. Since repeated calls to that function overwrite previous mapping, you must manually edit the internal dict.
import xml.etree.ElementTree as ET
customXmlFile = 'test.xml'
ns = {'ns': 'http://www.microsoft.com/axe',
'sc': 'http://schemas.datacontract.org/2004/07/'}
ET.register_namespace('', 'http://www.microsoft.com/axe')
ET._namespace_map['http://schemas.datacontract.org/2004/07'] = ''
tree = ET.parse(customXmlFile)
root = tree.getroot()
node = root.find('ns:RandomTag', namespaces=ns)
node.text = 'NEW TEXT'
tree.write(customXmlFile + ".new",
xml_declaration=True,
encoding='utf-8',
method="xml")
For more information about this see:
http://effbot.org/zone/element-namespaces.htm
Saving XML files using ElementTree
Cannot write XML file with default namespace

Related

Unexpected results when parsing XML via lxml

The output of my xml parsing is not es expected.
The xml file
<?xml version="1.0"?>
<stationaer xsi:schemaLocation="http:/foo.bar" xmlns="http://foo.bar" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<einrichtung>
<name>Name</name>
</einrichtung>
<einrichtung>
<name>Name</name>
</einrichtung>
</stationaer>
I would expect to get something like root.tag == 'stationaer' and child.tag = 'einrichtung'.
See the outpout at the end.
This is the MWE
#!/usr/bin/env python3
import pathlib
import lxml
from lxml import etree
import pandas
xml_src = '''<?xml version="1.0"?>
<stationaer xsi:schemaLocation="http:/foo.bar" xmlns="http://foo.bar" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<einrichtung>
<name>Name</name>
</einrichtung>
<einrichtung>
<name>Name</name>
</einrichtung>
</stationaer>
'''
# tree = etree.parse(file_path)
# root = tree.getroot()
root = etree.fromstring(xml_src)
print(repr(root.tag))
print(repr(root.text))
child = root.getchildren()[0]
print(repr(child.tag))
print(repr(child.text))
The output for root is
'{http://foo.bar}stationaer'
'\n '
and for child
'{http://foo.bar}einrichtung'
'\n '
I don't understand what's going on here and why that URL is in the output.

This is actually not unexpected. The elements in the XML document are bound to the http://foo.bar default namespace. The namespace is declared by xmlns="http://foo.bar" on the root element and the declaration is inherited by all descendants.
The special notation with the namespace URI enclosed in curly braces ({http://foo.bar}stationaer) is never used in XML documents, but it is used by lxml and ElementTree when printing element (tag) names. It can also be used when searching or creating elements that belong to a namespace.
More information:
https://www.w3.org/TR/xml-names/
https://lxml.de/tutorial.html#namespaces
https://docs.python.org/3/library/xml.etree.elementtree.html#parsing-xml-with-namespaces

Need some help generating XML with Python

I have some variables in Python that I need to store as XML. I have been using the python:LXML module for this so far. Not too experienced with it. Have tried playing around with various tutorials and docs, but I am at a dead end need some help.
Here is the python script:
root = etree.Element("root")
coins=etree.Element("coins")
doc=etree.ElementTree(coins)
coins.append(etree.Element("trader"))
coins.append(etree.Element("metal"))
coins.append(etree.Element("type"))
coins.append(etree.Element("price"))
coins[0].text="Gold.co.uk"
coins[0].attrib["variable"]=("GLDAG_MAPLE")
coins[1].text="Silver"
coins[2].text="Britannia"
coins[3].text=str(GLDAG_MAPLE)
doc.write('data.xml', pretty_print=True)
As of now it outputs this:
<coins>
<trader variable="GLDAG_MAPLE">Gold.co.uk</trader>
<metal>Silver</metal>
<type>Britannia</type>
<price>
£31.20
</price>
</coins>
However I would like it to look like this:
<root>
<coin>
<trader> Gold.co.uk </trader>
<type> Britannia </type>
<price> £31.20 </price>
</coin>
</root>
The tags and their sub-tags would be duplicated for every type of coin. I have no idea how to construct the XML so that the output looks like the third code-block. So far I have tried to follow other scripts that I have seen on github and other sites but modify them to suit my needs but my scripts keep failing or producing incorrect resaults for some reason.
If someone could help me out then that would be great!

You can simply append the Element to root:
from lxml import etree
coinItems = [
{'trader': 'Gold.co.uk', 'metal': 'Silver', 'type': 'Britannia'},
{'trader': 'copper.co.uk', 'metal': 'Copper', 'type': 'World'}
]
root = etree.Element("root")
for ci in coinItems:
coin=etree.Element("coin")
etree.SubElement(coin, "trader", {'variable': 'GLDAG_MAPLE'}).text = ci['trader'] # example how to use attributes!
etree.SubElement(coin, "metal").text = ci['metal']
etree.SubElement(coin, "type").text = ci['type']
root.append(coin)
fName = '/tmp/data.xml'
with open(fName, 'wb') as f:
# remove encoding here, in case you want escaped ASCII characters: £
f.write(etree.tostring(root, xml_declaration=True, encoding="utf-8", pretty_print=True))
print(open(fName).read())
Output:
<?xml version='1.0' encoding='utf-8'?>
<root>
<coin>
<trader variable="GLDAG_MAPLE">Gold.co.uk</trader>
<metal>Silver</metal>
<type>Britannia</type>
</coin>
<coin>
<trader variable="GLDAG_MAPLE">copper.co.uk</trader>
<metal>Copper</metal>
<type>World</type>
</coin>
</root>

I prefer using the lxml builder (https://lxml.de/api/lxml.builder.ElementMaker-class.html) because imho it is easier to see the structure of your XML document.
from lxml.builder import E
root = E.root(
E.coin(
E.trader("Gold.co.uk",
variable="GLDAG_MAPLE"),
E.metal("silver"),
E.price("£31.20")
)
)
You can then append the root element to your main document.

How to force ElementTree to keep xmlns attribute within its original element?

I have an input XML file:
<?xml version='1.0' encoding='utf-8'?>
<configuration>
<runtime name="test" version="1.2" xmlns:ns0="urn:schemas-microsoft-com:asm.v1">
<ns0:assemblyBinding>
<ns0:dependentAssembly />
</ns0:assemblyBinding>
</runtime>
</configuration>
...and Python script:
import xml.etree.ElementTree as ET
file_xml = 'test.xml'
tree = ET.parse(file_xml)
root = tree.getroot()
print (root.tag)
print (root.attrib)
element_runtime = root.find('.//runtime')
print (element_runtime.tag)
print (element_runtime.attrib)
tree.write(file_xml, xml_declaration=True, encoding='utf-8', method="xml")
...which gives the following output:
>test.py
configuration
{}
runtime
{'name': 'test', 'version': '1.2'}
...and has an undesirable side-effect of modifying XML into:
<?xml version='1.0' encoding='utf-8'?>
<configuration xmlns:ns0="urn:schemas-microsoft-com:asm.v1">
<runtime name="test" version="1.2">
<ns0:assemblyBinding>
<ns0:dependentAssembly />
</ns0:assemblyBinding>
</runtime>
</configuration>
My original script modifies XML so I do have to call tree.write and save edited file. But the problem is that ElementTree parser moves xmlns attribute from runtime element up to the root element configuration which is not desirable in my case.
I can't remove xmlns attribute from the root element (remove it from the dictionary of its attributes) as it is not listed in a list of its attributes (unlike the attributes listed for runtime element).
Why does xmlns attribute never gets listed within the list of attributes for any element?
How to force ElementTree to keep xmlns attribute within its original element?
I am using Python 3.5.1 on Windows.

xml.etree.ElementTree pulls all namespaces into the first element as it internally doesn't track on which element the namespace was declared originally.
If you don't want that, you'll have to write your own serialisation logic.
The better alternative would be to use lxml instead of xml.etree, because it preserves the location where a namespace prefix is declared.

Following #mata advice, here I give an answer with an example with code and xml file attached.
The xml input is as shown in the picture (original and modified)
The python codes check the NtnlCcy Name and if it is "EUR", convert the Price to USD (by multiplying EURUSD: = 1.2) and change the NtnlCcy Name to "USD".
The python code is as follows:
from lxml import etree
pathToXMLfile = r"C:\Xiang\codes\Python\afmreports\test_original.xml"
tree = etree.parse(pathToXMLfile)
root = tree.getroot()
EURUSD = 1.2
for Rchild in root:
print ("Root child: ", Rchild.tag, ". \n")
if Rchild.tag.endswith("Pyld"):
for PyldChild in Rchild:
print ("Pyld Child: ", PyldChild.tag, ". \n")
Doc = Rchild.find('{001.003}Document')
FinInstrNodes = Doc.findall('{001.003}FinInstr')
for FinInstrNode in FinInstrNodes:
FinCcyNode = FinInstrNode.find('{001.003}NtnlCcy')
FinPriceNode = FinInstrNode.find('{001.003}Price')
FinCcyNodeText = ""
if FinCcyNode is not None:
CcyNodeText = FinCcyNode.text
if CcyNodeText == "EUR":
PriceText = FinPriceNode.text
Price = float(PriceText)
FinPriceNode.text = str(Price * EURUSD)
FinCcyNode.text = "USD"
tree.write(r"C:\Xiang\codes\Python\afmreports\test_modified.xml", encoding="utf-8", xml_declaration=True)
print("\n the program runs to the end! \n")
As we compare the original and modified xml files, the namespace remains unchanged, the whole structure of the xml remains unchanged, only some NtnlCcy and Price Nodes have been changed, as desired.
The only minor difference we do not want is the first line. In the original xml file, it is <?xml version="1.0" encoding="UTF-8"?>, while in the modified xml file, it is <?xml version='1.0' encoding='UTF-8'?>. The quotation sign changes from double quotation to single quotation. But we think this minor difference should not matter.
The original file context will be attached for your easy test:
<?xml version="1.0" encoding="UTF-8"?>
<BizData xmlns="001.001">
<Hdr>
<AppHdr xmlns="001.002">
<Fr>
<Id>XXX01</Id>
</Fr>
<To>
<Id>XXX02</Id>
</To>
<CreDt>2019-10-25T15:38:30</CreDt>
</AppHdr>
</Hdr>
<Pyld>
<Document xmlns="001.003">
<FinInstr>
<Id>NLENX240</Id>
<FullNm>AO.AAI</FullNm>
<NtnlCcy>EUR</NtnlCcy>
<Price>9</Price>
</FinInstr>
<FinInstr>
<Id>NLENX681</Id>
<FullNm>AO.ABN</FullNm>
<NtnlCcy>USD</NtnlCcy>
<Price>10</Price>
</FinInstr>
<FinInstr>
<Id>NLENX320</Id>
<FullNm>AO.ING</FullNm>
<NtnlCcy>EUR</NtnlCcy>
<Price>11</Price>
</FinInstr>
</Document>
</Pyld>

python elementree blank output

I am parsing an XML output from VCloud, however I am not able to reach to the values
<?xml version="1.0" encoding="UTF-8"?>
<SupportedVersions xmlns="http://www.vmware.com/vcloud/versions" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.vmware.com/vcloud/versions http://10.10.6.12/api/versions/schema/versions.xsd">
<VersionInfo>
<Version>1.5</Version>
<LoginUrl>https://api.vcd.portal.skyscapecloud.com/api/sessions</LoginUrl>
<MediaTypeMapping>
<MediaType>application/vnd.vmware.vcloud.instantiateVAppTemplateParams+xml</MediaType>
<ComplexTypeName>InstantiateVAppTemplateParamsType</ComplexTypeName>
<SchemaLocation>http://api.vcd.portal.skyscapecloud.com/api/v1.5/schema/master.xsd</SchemaLocation>
</MediaTypeMapping>
<MediaTypeMapping>
<MediaType>application/vnd.vmware.admin.vmwProviderVdcReferences+xml</MediaType>
<ComplexTypeName>VMWProviderVdcReferencesType</ComplexTypeName>
<SchemaLocation>http://api.vcd.portal.skyscapecloud.com/api/v1.5/schema/vmwextensions.xsd</SchemaLocation>
</MediaTypeMapping>
<MediaTypeMapping>
<MediaType>application/vnd.vmware.vcloud.customizationSection+xml</MediaType>
<ComplexTypeName>CustomizationSectionType</ComplexTypeName>
<SchemaLocation>http://api.vcd.portal.skyscapecloud.com/api/v1.5/schema/master.xsd</SchemaLocation>
</MediaTypeMapping>
this is what I have been using
import xml.etree.ElementTree as ET
data = ET.fromstring(content)
versioninfo = data.findall("VersionInfo/Version")
print len(versioninfo)
print versioninfo.text
however this gives a blank output...any suggestions?

Try this:
import xml.etree.ElementTree as ET
data = ET.fromstring(content)
versioninfo = data.find(
"ns:VersionInfo/ns:Version",
namespaces={'ns':'http://www.vmware.com/vcloud/versions'})
print versioninfo.text
Use .find(), not .findall() to return a single element
Your XML uses namespaces. The full path to your desired object is: '{http://www.vmware.com/vcloud/versions}VersionInfo/{http://www.vmware.com/vcloud/versions}Version' By passing in the namespaces parameter, you are able to use the shortcut syntax: ns:VersionInfo/ns:Version.

Emitting namespace specifications with ElementTree in Python

I am trying to emit an XML file with element-tree that contains an XML declaration and namespaces. Here is my sample code:
from xml.etree import ElementTree as ET
ET.register_namespace('com',"http://www.company.com") #some name
# build a tree structure
root = ET.Element("STUFF")
body = ET.SubElement(root, "MORE_STUFF")
body.text = "STUFF EVERYWHERE!"
# wrap it in an ElementTree instance, and save as XML
tree = ET.ElementTree(root)
tree.write("page.xml",
xml_declaration=True,
method="xml" )
However, neither the <?xml tag comes out nor any namespace/prefix information. I'm more than a little confused here.

Although the docs say otherwise, I only was able to get an <?xml> declaration by specifying both the xml_declaration and the encoding.
You have to declare nodes in the namespace you've registered to get the namespace on the nodes in the file. Here's a fixed version of your code:
from xml.etree import ElementTree as ET
ET.register_namespace('com',"http://www.company.com") #some name
# build a tree structure
root = ET.Element("{http://www.company.com}STUFF")
body = ET.SubElement(root, "{http://www.company.com}MORE_STUFF")
body.text = "STUFF EVERYWHERE!"
# wrap it in an ElementTree instance, and save as XML
tree = ET.ElementTree(root)
tree.write("page.xml",
xml_declaration=True,encoding='utf-8',
method="xml")
Output (page.xml)
<?xml version='1.0' encoding='utf-8'?><com:STUFF xmlns:com="http://www.company.com"><com:MORE_STUFF>STUFF EVERYWHERE!</com:MORE_STUFF></com:STUFF>
ElementTree doesn't pretty-print either. Here's pretty-printed output:
<?xml version='1.0' encoding='utf-8'?>
<com:STUFF xmlns:com="http://www.company.com">
<com:MORE_STUFF>STUFF EVERYWHERE!</com:MORE_STUFF>
</com:STUFF>
You can also declare a default namespace and don't need to register one:
from xml.etree import ElementTree as ET
# build a tree structure
root = ET.Element("{http://www.company.com}STUFF")
body = ET.SubElement(root, "{http://www.company.com}MORE_STUFF")
body.text = "STUFF EVERYWHERE!"
# wrap it in an ElementTree instance, and save as XML
tree = ET.ElementTree(root)
tree.write("page.xml",
xml_declaration=True,encoding='utf-8',
method="xml",default_namespace='http://www.company.com')
Output (pretty-print spacing is mine)
<?xml version='1.0' encoding='utf-8'?>
<STUFF xmlns="http://www.company.com">
<MORE_STUFF>STUFF EVERYWHERE!</MORE_STUFF>
</STUFF>

I've never been able to get the <?xml tag out of the element tree libraries programatically so I'd suggest you try something like this.
from xml.etree import ElementTree as ET
root = ET.Element("STUFF")
root.set('com','http://www.company.com')
body = ET.SubElement(root, "MORE_STUFF")
body.text = "STUFF EVERYWHERE!"
f = open('page.xml', 'w')
f.write('<?xml version="1.0" encoding="UTF-8"?>' + ET.tostring(root))
f.close()
Non std lib python ElementTree implementations may have different ways to specify namespaces, so if you decide to move to lxml, the way you declare those will be different.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python: Xml parsing method - python

Related

Unexpected results when parsing XML via lxml

Need some help generating XML with Python

How to force ElementTree to keep xmlns attribute within its original element?

python elementree blank output

Emitting namespace specifications with ElementTree in Python

Categories

Resources