Removing ":" from namespace in a XML file parsing - python

I am trying to modify XML file using xml.etree.ElementTree on Python 2.6.6 (due to restrictions) and facing ns0 issue. I looked at this issue and used ET._namespace_map[uri] = prefix as suggested which removed ns0 but the element tags still has the : value. How do we remove it or does it impact the validity of the XML file when we use if for further processing?
Example:
<?xml version="1.0" encoding="UTF-8" ?>
<Seed xmlns="http://www.example.com">
<TagA>
<TagB>B</TagB>
<TagC>c</TagC>
</TagA>
</Seed>
Script
import xml.etree.ElementTree as ET
tree = ET.parse('sample.xml')
root = tree.getroot()
try:
ET.register_namespace("","http://example.com")
except AttributeError:
def register_namespace(prefix, uri):
ET._namespace_map[uri] = prefix
register_namespace("","http://www.example.com")
tree.write('sample.xml')
Note: I could not use lxml or other xml.etree that is supported only from 2.7 version.

Related

Unexpected results when parsing XML via lxml

The output of my xml parsing is not es expected.
The xml file
<?xml version="1.0"?>
<stationaer xsi:schemaLocation="http:/foo.bar" xmlns="http://foo.bar" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<einrichtung>
<name>Name</name>
</einrichtung>
<einrichtung>
<name>Name</name>
</einrichtung>
</stationaer>
I would expect to get something like root.tag == 'stationaer' and child.tag = 'einrichtung'.
See the outpout at the end.
This is the MWE
#!/usr/bin/env python3
import pathlib
import lxml
from lxml import etree
import pandas
xml_src = '''<?xml version="1.0"?>
<stationaer xsi:schemaLocation="http:/foo.bar" xmlns="http://foo.bar" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<einrichtung>
<name>Name</name>
</einrichtung>
<einrichtung>
<name>Name</name>
</einrichtung>
</stationaer>
'''
# tree = etree.parse(file_path)
# root = tree.getroot()
root = etree.fromstring(xml_src)
print(repr(root.tag))
print(repr(root.text))
child = root.getchildren()[0]
print(repr(child.tag))
print(repr(child.text))
The output for root is
'{http://foo.bar}stationaer'
'\n '
and for child
'{http://foo.bar}einrichtung'
'\n '
I don't understand what's going on here and why that URL is in the output.
This is actually not unexpected. The elements in the XML document are bound to the http://foo.bar default namespace. The namespace is declared by xmlns="http://foo.bar" on the root element and the declaration is inherited by all descendants.
The special notation with the namespace URI enclosed in curly braces ({http://foo.bar}stationaer) is never used in XML documents, but it is used by lxml and ElementTree when printing element (tag) names. It can also be used when searching or creating elements that belong to a namespace.
More information:
https://www.w3.org/TR/xml-names/
https://lxml.de/tutorial.html#namespaces
https://docs.python.org/3/library/xml.etree.elementtree.html#parsing-xml-with-namespaces

python iterate xml avoiding namespace

with my python script i want to iterate my xml file searching a specific element tag.
I have some problem related to the namespace of the root tag.
Below my XML structure:
<?xml version="1.0" ?>
<rootTag xmlns="blablabla">
<tag_1>
<sub_tag_1>..something..</sub_tag_1>
</tag_1>
<tag_2>
<sub_tag_2>..something..</sub_tag_2>
</tag_2>
...and so on...
</rootTag>
Below my PYTHON script:
import xml.etree.ElementTree as ET
root = ET.fromstring(xml_taken_from_web)
print(root.tag)
The problem is that output of print is:
{blablabla}rootTag
so when i iter over it all the tag_1, tag_2, and so on tags will have the {blablabla} string so i'm not able to make any check on the tag.
I tried using regular expression in this way
root = re.sub('^{.*?}', '', root.tag)
the problem is that root after that is a string type and so i cannot over it such an Element type
How can i print only rootTag ?
With that just use:
import xml.etree.ElementTree as ET
from lxml import etree
root = ET.fromstring(xml_taken_from_web)
print(etree.QName(root.tag).localname)

python elementree blank output

I am parsing an XML output from VCloud, however I am not able to reach to the values
<?xml version="1.0" encoding="UTF-8"?>
<SupportedVersions xmlns="http://www.vmware.com/vcloud/versions" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.vmware.com/vcloud/versions http://10.10.6.12/api/versions/schema/versions.xsd">
<VersionInfo>
<Version>1.5</Version>
<LoginUrl>https://api.vcd.portal.skyscapecloud.com/api/sessions</LoginUrl>
<MediaTypeMapping>
<MediaType>application/vnd.vmware.vcloud.instantiateVAppTemplateParams+xml</MediaType>
<ComplexTypeName>InstantiateVAppTemplateParamsType</ComplexTypeName>
<SchemaLocation>http://api.vcd.portal.skyscapecloud.com/api/v1.5/schema/master.xsd</SchemaLocation>
</MediaTypeMapping>
<MediaTypeMapping>
<MediaType>application/vnd.vmware.admin.vmwProviderVdcReferences+xml</MediaType>
<ComplexTypeName>VMWProviderVdcReferencesType</ComplexTypeName>
<SchemaLocation>http://api.vcd.portal.skyscapecloud.com/api/v1.5/schema/vmwextensions.xsd</SchemaLocation>
</MediaTypeMapping>
<MediaTypeMapping>
<MediaType>application/vnd.vmware.vcloud.customizationSection+xml</MediaType>
<ComplexTypeName>CustomizationSectionType</ComplexTypeName>
<SchemaLocation>http://api.vcd.portal.skyscapecloud.com/api/v1.5/schema/master.xsd</SchemaLocation>
</MediaTypeMapping>
this is what I have been using
import xml.etree.ElementTree as ET
data = ET.fromstring(content)
versioninfo = data.findall("VersionInfo/Version")
print len(versioninfo)
print versioninfo.text
however this gives a blank output...any suggestions?
Try this:
import xml.etree.ElementTree as ET
data = ET.fromstring(content)
versioninfo = data.find(
"ns:VersionInfo/ns:Version",
namespaces={'ns':'http://www.vmware.com/vcloud/versions'})
print versioninfo.text
Use .find(), not .findall() to return a single element
Your XML uses namespaces. The full path to your desired object is: '{http://www.vmware.com/vcloud/versions}VersionInfo/{http://www.vmware.com/vcloud/versions}Version' By passing in the namespaces parameter, you are able to use the shortcut syntax: ns:VersionInfo/ns:Version.

python lxml with py2exe

I have Generated an XML with dom and i want to use lxml to pretty print the xml.
this is my code for pretty print the xml
def prettify_xml(xml_str):
import lxml.etree as etree
root = etree.fromstring(xml_str)
xml_str = etree.tostring(root, pretty_print=True)
return xml_str
my output should be an xml formatted string.
I got this code from some post in stactoverflow. This works flawlessly when i am compiling wit python itself. But when i convert my project to a binary created from py2exe (my binary is windows service with a namedpipe).I had two problems:
My service was not starting , i solved this by adding lxml.etree in includes option in py2exe function. then on my service started properly.
when xml generation in called here, is the error which I am seeing in my log
'module' object has no attribute 'fromstring'
where do i rectify this error ? And Is my first problem's solution correct ?
my xml generation Code :
from xml.etree import ElementTree
from xml.dom import minidom
from xml.etree.ElementTree import Element, SubElement, tostring, XML
import lxml.etree
def prettify_xml(xml_str):
root = lxml.etree.fromstring(xml_str)
xml_str = lxml.etree.tostring(root, pretty_print=True)
return xml_str
def dll_xml(status):
try:
xml_declaration = '<?xml version="1.0" standalone="no" ?>'
rootTagName='response'
root = Element(rootTagName)
root.set('id' , 'rp001')
parent = SubElement(root, 'command', opcode ='-ac')
# Create children
chdtag1Name = 'mode'
chdtag1Value = 'repreport'
chdtag2Name='status'
chdtag2Value = status
fullchildtag1 = ''+chdtag1Name+' value = "'+chdtag1Value+'"'
fullchildtag2=''+chdtag2Name+' value="'+chdtag2Value+'"'
children = XML('''<root><'''+fullchildtag1+''' /><'''+fullchildtag2+'''/></root> ''')
# Add parent
parent.extend(children)
dll_xml_doc = xml_declaration + tostring(root)
dll_xml_doc = prettify_xml(dll_xml_doc)
return dll_xml_doc
except Exception , error:
log.error("xml_generation_failed : %s" % error)
Try to use PyInstaller instead py2exe. I converted your program to binary .exe with no problem just by running python pyinstaller.py YourPath\xml_a.py.

Emitting namespace specifications with ElementTree in Python

I am trying to emit an XML file with element-tree that contains an XML declaration and namespaces. Here is my sample code:
from xml.etree import ElementTree as ET
ET.register_namespace('com',"http://www.company.com") #some name
# build a tree structure
root = ET.Element("STUFF")
body = ET.SubElement(root, "MORE_STUFF")
body.text = "STUFF EVERYWHERE!"
# wrap it in an ElementTree instance, and save as XML
tree = ET.ElementTree(root)
tree.write("page.xml",
xml_declaration=True,
method="xml" )
However, neither the <?xml tag comes out nor any namespace/prefix information. I'm more than a little confused here.
Although the docs say otherwise, I only was able to get an <?xml> declaration by specifying both the xml_declaration and the encoding.
You have to declare nodes in the namespace you've registered to get the namespace on the nodes in the file. Here's a fixed version of your code:
from xml.etree import ElementTree as ET
ET.register_namespace('com',"http://www.company.com") #some name
# build a tree structure
root = ET.Element("{http://www.company.com}STUFF")
body = ET.SubElement(root, "{http://www.company.com}MORE_STUFF")
body.text = "STUFF EVERYWHERE!"
# wrap it in an ElementTree instance, and save as XML
tree = ET.ElementTree(root)
tree.write("page.xml",
xml_declaration=True,encoding='utf-8',
method="xml")
Output (page.xml)
<?xml version='1.0' encoding='utf-8'?><com:STUFF xmlns:com="http://www.company.com"><com:MORE_STUFF>STUFF EVERYWHERE!</com:MORE_STUFF></com:STUFF>
ElementTree doesn't pretty-print either. Here's pretty-printed output:
<?xml version='1.0' encoding='utf-8'?>
<com:STUFF xmlns:com="http://www.company.com">
<com:MORE_STUFF>STUFF EVERYWHERE!</com:MORE_STUFF>
</com:STUFF>
You can also declare a default namespace and don't need to register one:
from xml.etree import ElementTree as ET
# build a tree structure
root = ET.Element("{http://www.company.com}STUFF")
body = ET.SubElement(root, "{http://www.company.com}MORE_STUFF")
body.text = "STUFF EVERYWHERE!"
# wrap it in an ElementTree instance, and save as XML
tree = ET.ElementTree(root)
tree.write("page.xml",
xml_declaration=True,encoding='utf-8',
method="xml",default_namespace='http://www.company.com')
Output (pretty-print spacing is mine)
<?xml version='1.0' encoding='utf-8'?>
<STUFF xmlns="http://www.company.com">
<MORE_STUFF>STUFF EVERYWHERE!</MORE_STUFF>
</STUFF>
I've never been able to get the <?xml tag out of the element tree libraries programatically so I'd suggest you try something like this.
from xml.etree import ElementTree as ET
root = ET.Element("STUFF")
root.set('com','http://www.company.com')
body = ET.SubElement(root, "MORE_STUFF")
body.text = "STUFF EVERYWHERE!"
f = open('page.xml', 'w')
f.write('<?xml version="1.0" encoding="UTF-8"?>' + ET.tostring(root))
f.close()
Non std lib python ElementTree implementations may have different ways to specify namespaces, so if you decide to move to lxml, the way you declare those will be different.

Categories