Round-tripping Python's ElementTree from/tostring drops namespaces

Round-tripping Python's ElementTree from/tostring drops namespaces - python

I've got a base XML string that I want to build off of, so the first thing I do is parse the XML string into an etree.
However, it looks like the other namespaces "d" and "m" are being ignored. I can successfully parse the string into an XML Element:
import xml.etree.ElementTree as ET
BASE = """<?xml version="1.0" encoding="utf-8" ?>
<feed
xml:base="https://www.nuget.org/api/v2/"
xmlns="http://www.w3.org/2005/Atom"
xmlns:d="http://schemas.microsoft.com/ado/2007/08/dataservices"
xmlns:m="http://schemas.microsoft.com/ado/2007/08/dataservices/metadata"
>
</feed>
"""
a = ET.fromstring(BASE)
# <Element '{http://www.w3.org/2005/Atom}feed' at 0x000002264B03F778>
But when we convert back to string, we drop the "d" and "m" namespaces:
ET.tostring(a)
# Formatted manually for StackOverflow
# b'<ns0:feed
# xmlns:ns0="http://www.w3.org/2005/Atom"
# xml:base="https://www.nuget.org/api/v2/">
# </ns0:feed>'
So what's going on here?

It appears that unused namespaces are dropped. If you change your BASE to something like this:
BASE = """<?xml version="1.0" encoding="utf-8" ?>
<feed
xml:base="https://www.nuget.org/api/v2/"
xmlns="http://www.w3.org/2005/Atom"
xmlns:d="http://schemas.microsoft.com/ado/2007/08/dataservices"
xmlns:m="http://schemas.microsoft.com/ado/2007/08/dataservices/metadata"
>
<m:properties>
<d:Id>NuGetTest</d:Id>
</m:properties>
</feed>
"""
You'll see the missing namespaces:
>>> et.tostring(a)
b'<ns0:feed
xmlns:ns0="http://www.w3.org/2005/Atom"
xmlns:ns1="http://schemas.microsoft.com/ado/2007/08/dataservices/metadata"
xmlns:ns2="http://schemas.microsoft.com/ado/2007/08/dataservices"
xml:base="http://localhost:40221/nuget">
<ns1:properties>
<ns2:Id>NuGetTest</ns2:Id>
</ns1:properties>
</ns0:feed>'
Note that the namespaces change: d becomes ns2, m becomes ns1. I'm not sure how Python does this, but it looks like it's just based on which is used first.

Related

XML file generating unwanted data

I have tried writing few things to xml file after reading it from a different xml file, everything works smoothly but there are few unwanted tags coming inside the xml file which i generate as output.
Here is what I have tried
from xml.etree import ElementTree as ET
from xml.dom.minidom import getDOMImplementation
from xml.dom.minidom import parseString
tree = ET.parse('C:\\Users\\ca33.xml')
root = tree.getroot()
impl = getDOMImplementation()
#print(root)
header = [root.find('header')]
for h in header:
h1=(parseString(ET.tostring(h)).toprettyxml(''))
#print(h1)
commands = root.findall(".//records//")
recs=[c for c in commands if c.find('soc_id')!=None and c.find('soc_id').text[:9]=='000001051']
bb=""
for rec in recs:
aa=(parseString(ET.tostring(rec)).toprettyxml(''))
bb=bb+aa
#print(bb)
newdoc = impl.createDocument(None, "file"+h1+bb, None)
newdoc.writexml(open('data.xml', 'w'),'\n'.join([line for line in newdoc.toprettyxml(indent=' '*2).split('\n') if line.strip()]))
I get the output data.xml file as.
<?xml version="1.0" ?><?xml version="1.0" ?>
<file<?xml version="1.0" ?>
<header>
<number_of_records>41</number_of_records>
</header>
<?xml version="1.0" ?>
<record>
<soc_id>00000105139E3B82</soc_id>
</record>
<?xml version="1.0" ?>
<soc_id>00000105139E3640</soc_id>
</record>
<?xml version="1.0" ?>
<header>
<number_of_records>41</number_of_records>
So you can see that many tags of <?xml version="1.0" ?> is being generated everywhere and in the last it again starts writing the data from first but leaves a 2 line spacing

So, what I understand is that you are trying to read a xml file at first place and then you are trying to write the same data into a different file.
In this process you are running into problems
from xml.etree import ElementTree as ET
tree = ET.parse('C:\\Users\\ca33.xml')
root = tree.getroot()
for header_ex in root.findall('header'):
h = [ET.tostring(c) for c in header_ex]
str_header=str(h)
for record_ex in root.findall('records'):
r = [ET.tostring(c) for c in record if c.find('soc_id')!=None and c.find('soc_id').text[:9]=='000001051']
for rec in r:
str_rec=str(rec)
with open("output.xml","w") as f:
f.write("<?xml version='1.0' encoding='ASCII' standalone='yes'?>")
f.write("<file>"+"<header>"+str_header+"</header>")
f.close()
Since you have not posted any random data, I assume it to be the way you had posted in question.I assume that record is a tag and it has something more or many sub/child tags inside it and that's the reason for me to loop twice over it.
And also stop using unnecessary imports in your code.

keep original xml declaration when editing with python

my original xml file looks like this:
<?xml version="1.0" encoding="utf-8"?>
<foo/>
and I want to change it to
<?xml version="1.0" encoding="utf-8"?>
<foo>
<bar>confusing dev</bar>
</foo>
I am using xml.etree.ElementTree as suggested by this tutorial
with open('file.xml','r+b') as f:
tree = etree.parse(f)
f.seek(0,0)
tree.write(f,xml_declaration=True)# default argument: encoding="us-ascii"
this outputs
<?xml version='1.0' encoding='us-ascii'?>
<foo/>
But how do I get the encoding of file.xml at runtime and pass it as an argument to tree.write or is there a better way to edit xml in python? I just want to change some Element.text but keep the declaration and namespace unchanged.

parsing xml by python lxml tree.xpath

I try to parse a huge file. The sample is below. I try to take <Name>, but I can't
It works only without this string
<LevelLayout xmlns="http://schemas.datacontract.org/2004/07/ArcherTech.Common.Domain" xmlns:i="http://www.w3.org/2001/XMLSchema-instance">
xml2 = '''<?xml version="1.0" encoding="UTF-8"?>
<PackageLevelLayout>
<LevelLayouts>
<LevelLayout levelGuid="4a54f032-325e-4988-8621-2cb7b49d8432">
<LevelLayout xmlns="http://schemas.datacontract.org/2004/07/ArcherTech.Common.Domain" xmlns:i="http://www.w3.org/2001/XMLSchema-instance">
<LevelLayoutSectionBase>
<LevelLayoutItemBase>
<Name>Tracking ID</Name>
</LevelLayoutItemBase>
</LevelLayoutSectionBase>
</LevelLayout>
</LevelLayout>
</LevelLayouts>
</PackageLevelLayout>'''
from lxml import etree
tree = etree.XML(xml2)
nodes = tree.xpath('/PackageLevelLayout/LevelLayouts/LevelLayout[#levelGuid="4a54f032-325e-4988-8621-2cb7b49d8432"]/LevelLayout/LevelLayoutSectionBase/LevelLayoutItemBase/Name')
print nodes

Your nested LevelLayout XML document uses a namespace. I'd use:
tree.xpath('.//LevelLayout[#levelGuid="4a54f032-325e-4988-8621-2cb7b49d8432"]//*[local-name()="Name"]')
to match the Name element with a shorter XPath expression (ignoring the namespace altogether).
The alternative is to use a prefix-to-namespace mapping and use those on your tags:
nsmap = {'acd': 'http://schemas.datacontract.org/2004/07/ArcherTech.Common.Domain'}
tree.xpath('/PackageLevelLayout/LevelLayouts/LevelLayout[#levelGuid="4a54f032-325e-4988-8621-2cb7b49d8432"]/acd:LevelLayout/acd:LevelLayoutSectionBase/acd:LevelLayoutItemBase/acd:Name',
namespaces=nsmap)

lxml's xpath method has a namespaces parameter. You can pass it a dict mapping namespace prefixes to namespaces. Then you can refer build XPaths that use the namespace prefix:
xml2 = '''<?xml version="1.0" encoding="UTF-8"?>
<PackageLevelLayout>
<LevelLayouts>
<LevelLayout levelGuid="4a54f032-325e-4988-8621-2cb7b49d8432">
<LevelLayout xmlns="http://schemas.datacontract.org/2004/07/ArcherTech.Common.Domain" xmlns:i="http://www.w3.org/2001/XMLSchema-instance">
<LevelLayoutSectionBase>
<LevelLayoutItemBase>
<Name>Tracking ID</Name>
</LevelLayoutItemBase>
</LevelLayoutSectionBase>
</LevelLayout>
</LevelLayout>
</LevelLayouts>
</PackageLevelLayout>'''
namespaces={'ns': 'http://schemas.datacontract.org/2004/07/ArcherTech.Common.Domain',
'i': 'http://www.w3.org/2001/XMLSchema-instance'}
import lxml.etree as ET
# This is an lxml.etree._Element, not a tree, so don't call it tree
root = ET.XML(xml2)
nodes = root.xpath(
'''/PackageLevelLayout/LevelLayouts/LevelLayout[#levelGuid="4a54f032-325e-4988-8621-2cb7b49d8432"]
/ns:LevelLayout/ns:LevelLayoutSectionBase/ns:LevelLayoutItemBase/ns:Name''', namespaces = namespaces)
print nodes
yields
[<Element {http://schemas.datacontract.org/2004/07/ArcherTech.Common.Domain}Name at 0xb74974dc>]

XML header getting removed after processing with elementtree

i have an xml file and i used Elementtree to add a new tag to the xml file.My xml file before processing is as follows
<?xml version="1.0" encoding="utf-8"?>
<PackageInfo xmlns="http://someurlpackage">
<data ID="http://someurldata1">data1</data >
<data ID="http://someurldata2">data2</data >
<data ID="http://someurldata3">data3</data >
</PackageInfo>
I used following python code to add a new data tag and write it to my xml file
tree = ET.ElementTree(xmlFile)
root = tree.getroot()
elem= ET.Element('data')
elem.attrib['ID']="http://someurldata4"
elem.text='data4'
root[1].append(elem)
tree = ET.ElementTree(root)
tree.write(xmlFile)
But the resultant xml file have <?xml version="1.0" encoding="utf-8"?> absent and the file looks as below
<PackageInfo xmlns="http://someurlpackage">
<data ID="http://someurldata1">data1</data >
<data ID="http://someurldata2">data2</data >
<data ID="http://someurldata3">data3</data >
</PackageInfo>
Is there any way to include the xml header rather than hardcoding the line

It looks like you need optional arguments to the write method to output the declaration.
http://docs.python.org/library/xml.etree.elementtree.html#elementtree-elementtree-objects
tree.write(xmlfile,xml_declaration=True)
I'm afraid I'm not that familiar with xml.etree.ElementTree and it's variation between python releases.
Here's it working with lxml.etree:
>>> from lxml import etree
>>> sample = """<?xml version="1.0" encoding="utf-8"?>
... <PackageInfo xmlns="http://someurlpackage">
... <data ID="http://someurldata1">data1</data >
... <data ID="http://someurldata2">data2</data >
... <data ID="http://someurldata3">data3</data >
... </PackageInfo>"""
>>>
>>> doc = etree.XML(sample)
>>> data = doc.makeelement("data")
>>> data.attrib['ID'] = 'http://someurldata4'
>>> data.text = 'data4'
>>> doc.append(data)
>>> etree.tostring(doc,xml_declaration=True)
'<?xml version=\'1.0\' encoding=\'ASCII\'?>\n<PackageInfo xmlns="http://someurlpackage">\n<data ID="http://someurldata1">data1</data>\n<data ID="http://someurldata2">data2</data>\n<data ID="http://someurldata3">data3</data>\n<data ID="http://someurldata4">data4</data></PackageInfo>'
>>> etree.tostring(doc,xml_declaration=True,encoding='utf-8')
'<?xml version=\'1.0\' encoding=\'utf-8\'?>\n<PackageInfo xmlns="http://someurlpackage">\n<data ID="http://someurldata1">data1</data>\n<data ID="http://someurldata2">data2</data>\n<data ID="http://someurldata3">data3</data>\n<data ID="http://someurldata4">data4</data></PackageInfo>'

try this:::
tree.write(xmlFile, encoding="utf-8")

If you are using python <=2.6
There is no xml_declaration parameter in ElementTree.write()
def write(self, file, encoding="us-ascii"):
def _write(self, file,node, encoding, namespaces):
You can use lxml.etree
install lxml
sample here:
from lxml import etree
document = etree.Element('outer')
node = etree.SubElement(document, 'inner')
print(etree.tostring(document, xml_declaration=True))
BTW:
I find that it is not necessary to write the xml_declaration
Is the XML declaration node mandatory?
There is no XML declaration necessary for a document to be
successfully readable, since there are defaults for both version and
encoding (1.0 and UTF-8, respectively).
At least,it works even if AndroidManifest.xml does not have an xml_declaration
I have tried :-)

Combine XML files similar to ConfigParser's multiple file support

I'm writing an application configuration module that uses XML in its files. Consider the following example:
<?xml version="1.0" encoding="UTF-8"?>
<Settings>
<PathA>/Some/path/to/directory</PathA>
<PathB>/Another/path</PathB>
</Settings>
Now, I'd like to override certain elements in a different file that gets loaded afterwards. Example of the override file:
<?xml version="1.0" encoding="UTF-8"?>
<Settings>
<PathB>/Change/this/path</PathB>
</Settings>
When querying the document (with overrides) with XPath, I'd like to get this as the element tree:
<?xml version="1.0" encoding="UTF-8"?>
<Settings>
<PathA>/Some/path/to/directory</PathA>
<PathB>/Change/this/path</PathB>
</Settings>
This is similar to what Python's ConfigParser does with its read() method, but done with XML. How can I implement this?

You could convert the XML into an instance of Python class:
import lxml.etree as ET
import io
class Settings(object):
def __init__(self,text):
root=ET.parse(io.BytesIO(text)).getroot()
self.settings=dict((elt.tag,elt.text) for elt in root.xpath('/Settings/*'))
def update(self,other):
self.settings.update(other.settings)
text='''\
<?xml version="1.0" encoding="UTF-8"?>
<Settings>
<PathA>/Some/path/to/directory</PathA>
<PathB>/Another/path</PathB>
</Settings>'''
text2='''\
<?xml version="1.0" encoding="UTF-8"?>
<Settings>
<PathB>/Change/this/path</PathB>
</Settings>'''
s=Settings(text)
s2=Settings(text2)
s.update(s2)
print(s.settings)
yields
{'PathB': '/Change/this/path', 'PathA': '/Some/path/to/directory'}

Must you use XML? The same could be achieved with JSON much simpler:
Suppose this is the text from the first config file:
text='''
{
"PathB": "/Another/path",
"PathA": "/Some/path/to/directory"
}
'''
and this is the text from the second:
text2='''{
"PathB": "/Change/this/path"
}'''
Then to merge the to, you simply load each into a dict, and call update:
import json
config=json.loads(text)
config2=json.loads(text2)
config.update(config2)
print(config)
yields the Python dict:
{u'PathB': u'/Change/this/path', u'PathA': u'/Some/path/to/directory'}

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Round-tripping Python's ElementTree from/tostring drops namespaces - python

Related

XML file generating unwanted data

keep original xml declaration when editing with python

parsing xml by python lxml tree.xpath

XML header getting removed after processing with elementtree

Combine XML files similar to ConfigParser's multiple file support

Categories

Resources