python iterate xml avoiding namespace

python iterate xml avoiding namespace - python

with my python script i want to iterate my xml file searching a specific element tag.
I have some problem related to the namespace of the root tag.
Below my XML structure:
<?xml version="1.0" ?>
<rootTag xmlns="blablabla">
<tag_1>
<sub_tag_1>..something..</sub_tag_1>
</tag_1>
<tag_2>
<sub_tag_2>..something..</sub_tag_2>
</tag_2>
...and so on...
</rootTag>
Below my PYTHON script:
import xml.etree.ElementTree as ET
root = ET.fromstring(xml_taken_from_web)
print(root.tag)
The problem is that output of print is:
{blablabla}rootTag
so when i iter over it all the tag_1, tag_2, and so on tags will have the {blablabla} string so i'm not able to make any check on the tag.
I tried using regular expression in this way
root = re.sub('^{.*?}', '', root.tag)
the problem is that root after that is a string type and so i cannot over it such an Element type
How can i print only rootTag ?

With that just use:
import xml.etree.ElementTree as ET
from lxml import etree
root = ET.fromstring(xml_taken_from_web)
print(etree.QName(root.tag).localname)

Related

Unexpected results when parsing XML via lxml

The output of my xml parsing is not es expected.
The xml file
<?xml version="1.0"?>
<stationaer xsi:schemaLocation="http:/foo.bar" xmlns="http://foo.bar" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<einrichtung>
<name>Name</name>
</einrichtung>
<einrichtung>
<name>Name</name>
</einrichtung>
</stationaer>
I would expect to get something like root.tag == 'stationaer' and child.tag = 'einrichtung'.
See the outpout at the end.
This is the MWE
#!/usr/bin/env python3
import pathlib
import lxml
from lxml import etree
import pandas
xml_src = '''<?xml version="1.0"?>
<stationaer xsi:schemaLocation="http:/foo.bar" xmlns="http://foo.bar" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<einrichtung>
<name>Name</name>
</einrichtung>
<einrichtung>
<name>Name</name>
</einrichtung>
</stationaer>
'''
# tree = etree.parse(file_path)
# root = tree.getroot()
root = etree.fromstring(xml_src)
print(repr(root.tag))
print(repr(root.text))
child = root.getchildren()[0]
print(repr(child.tag))
print(repr(child.text))
The output for root is
'{http://foo.bar}stationaer'
'\n '
and for child
'{http://foo.bar}einrichtung'
'\n '
I don't understand what's going on here and why that URL is in the output.

This is actually not unexpected. The elements in the XML document are bound to the http://foo.bar default namespace. The namespace is declared by xmlns="http://foo.bar" on the root element and the declaration is inherited by all descendants.
The special notation with the namespace URI enclosed in curly braces ({http://foo.bar}stationaer) is never used in XML documents, but it is used by lxml and ElementTree when printing element (tag) names. It can also be used when searching or creating elements that belong to a namespace.
More information:
https://www.w3.org/TR/xml-names/
https://lxml.de/tutorial.html#namespaces
https://docs.python.org/3/library/xml.etree.elementtree.html#parsing-xml-with-namespaces

parsing XML in python by using xml.etree.ElementTree

I get an XML file using the request module, then I want to use the xml.etree.ElementTree module to get the output of the element
core-usg-01
but I'm already confused how to do it, im stuck. I tried writing this simple code to get the sysname element, but I get an empty output.
Python code:
import xml.etree.ElementTree as ET
tree = ET.parse('usg.xml')
root = tree.getroot()
print(root.findall('sysname'))
XML file:
<rpc-reply xmlns="urn:ietf:params:xml:ns:netconf:base:1.0" message-id="1">
<data>
<system-state xmlns="urn:ietf:params:xml:ns:yang:ietf-system">
<sysname xmlns="urn:huawei:params:xml:ns:yang:huawei-system">
core-usg-01
</sysname>
</system-state>
</data>
</rpc-reply>

You need to iter() over the root to reach to the child.
for child in root.iter():
print (child.tag, child.attrib)
Which will give you the present children tags and their attributes.
{urn:ietf:params:xml:ns:netconf:base:1.0}rpc-reply {'message-id': '1'}
{urn:ietf:params:xml:ns:netconf:base:1.0}data {}
{urn:ietf:params:xml:ns:yang:ietf-system}system-state {}
{urn:huawei:params:xml:ns:yang:huawei-system}sysname {}
Now you need to loop to your desired tag using following code:
for child in root.findall('.//{urn:ietf:params:xml:ns:yang:ietf-system}system-state'):
temp = child.find('.//{urn:huawei:params:xml:ns:yang:huawei-system}sysname')
print(temp.text)
The output will look like this:
core-usg-01

Try the below one liner
import xml.etree.ElementTree as ET
xml = '''<rpc-reply xmlns="urn:ietf:params:xml:ns:netconf:base:1.0" message-id="1">
<data>
<system-state xmlns="urn:ietf:params:xml:ns:yang:ietf-system">
<sysname xmlns="urn:huawei:params:xml:ns:yang:huawei-system">
core-usg-01
</sysname>
</system-state>
</data>
</rpc-reply>'''
root = ET.fromstring(xml)
print(root.find('.//{urn:huawei:params:xml:ns:yang:huawei-system}sysname').text)
output
core-usg-01

Removing ":" from namespace in a XML file parsing

I am trying to modify XML file using xml.etree.ElementTree on Python 2.6.6 (due to restrictions) and facing ns0 issue. I looked at this issue and used ET._namespace_map[uri] = prefix as suggested which removed ns0 but the element tags still has the : value. How do we remove it or does it impact the validity of the XML file when we use if for further processing?
Example:
<?xml version="1.0" encoding="UTF-8" ?>
<Seed xmlns="http://www.example.com">
<TagA>
<TagB>B</TagB>
<TagC>c</TagC>
</TagA>
</Seed>
Script
import xml.etree.ElementTree as ET
tree = ET.parse('sample.xml')
root = tree.getroot()
try:
ET.register_namespace("","http://example.com")
except AttributeError:
def register_namespace(prefix, uri):
ET._namespace_map[uri] = prefix
register_namespace("","http://www.example.com")
tree.write('sample.xml')
Note: I could not use lxml or other xml.etree that is supported only from 2.7 version.

Create array of values from specific element in XML using Python

I have an XML file which has many elements. I would like to create a list/array of all the values which have a specific element name, in my case "pair:ApplicationNumber".
I've gone over a lot of the other questions however I am not able to find an answer. I know that I can do this by loading the text file and going over it using pandas however, I'm sure there's a much better way.
I was unsuccessful trying ElementTree as well as XML.Dom using minidom
My code currently looks as follows:
import os
from xml.dom import minidom
WindowsUser = os.getenv('username')
XMLPath = os.path.join('C:\\Users', WindowsUser, 'Downloads', 'ApplicationsByCustomerNumber.xml')
xmldoc = minidom.parse(XMLPath)
itemlist = xmldoc.getElementsByTagName('pair:ApplicationNumber')
for s in itemlist:
print(s.attributes['pair:ApplicationNumber'].value)
an example XML file looks as follows:
<?xml version="1.0" encoding="UTF-8"?>
<pair:PatentApplicationList xsi:schemaLocation="urn:us:gov:uspto:pair PatentApplicationList.xsd" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:pair="urn:us:gov:uspto:pair">
<pair:FileHeader>
<pair:FileCreationTimeStamp>2017-07-10T10:52:12.12</pair:FileCreationTimeStamp>
</pair:FileHeader>
<pair:ApplicationStatusData>
<pair:ApplicationNumber>62383607</pair:ApplicationNumber>
<pair:ApplicationStatusCode>20</pair:ApplicationStatusCode>
<pair:ApplicationStatusText>Application Dispatched from Preexam, Not Yet Docketed</pair:ApplicationStatusText>
<pair:ApplicationStatusDate>2016-09-16</pair:ApplicationStatusDate>
<pair:AttorneyDocketNumber>1354-T-02-US</pair:AttorneyDocketNumber>
<pair:FilingDate>2016-09-06</pair:FilingDate>
<pair:LastModifiedTimestamp>2017-05-30T21:40:37.37</pair:LastModifiedTimestamp>
<pair:CustomerNumber>122761</pair:CustomerNumber><pair:LastFileHistoryTransaction>
<pair:LastTransactionDate>2017-05-30</pair:LastTransactionDate>
<pair:LastTransactionDescription>Email Notification</pair:LastTransactionDescription> </pair:LastFileHistoryTransaction>
<pair:ImageAvailabilityIndicator>true</pair:ImageAvailabilityIndicator>
</pair:ApplicationStatusData>
<pair:ApplicationStatusData>
<pair:ApplicationNumber>62292372</pair:ApplicationNumber>
<pair:ApplicationStatusCode>160</pair:ApplicationStatusCode>
<pair:ApplicationStatusText>Abandoned -- Incomplete Application (Pre-examination)</pair:ApplicationStatusText>
<pair:ApplicationStatusDate>2016-11-01</pair:ApplicationStatusDate>
<pair:AttorneyDocketNumber>681-S-23-US</pair:AttorneyDocketNumber>
<pair:FilingDate>2016-02-08</pair:FilingDate>
<pair:LastModifiedTimestamp>2017-06-20T21:59:26.26</pair:LastModifiedTimestamp>
<pair:CustomerNumber>122761</pair:CustomerNumber><pair:LastFileHistoryTransaction>
<pair:LastTransactionDate>2017-06-20</pair:LastTransactionDate>
<pair:LastTransactionDescription>Petition Entered</pair:LastTransactionDescription> </pair:LastFileHistoryTransaction>
<pair:ImageAvailabilityIndicator>true</pair:ImageAvailabilityIndicator>
</pair:ApplicationStatusData>
<pair:ApplicationStatusData>
<pair:ApplicationNumber>62289245</pair:ApplicationNumber>
<pair:ApplicationStatusCode>160</pair:ApplicationStatusCode>
<pair:ApplicationStatusText>Abandoned -- Incomplete Application (Pre-examination)</pair:ApplicationStatusText>
<pair:ApplicationStatusDate>2016-10-26</pair:ApplicationStatusDate>
<pair:AttorneyDocketNumber>1526-P-01-US</pair:AttorneyDocketNumber>
<pair:FilingDate>2016-01-31</pair:FilingDate>
<pair:LastModifiedTimestamp>2017-06-15T21:24:13.13</pair:LastModifiedTimestamp>
<pair:CustomerNumber>122761</pair:CustomerNumber><pair:LastFileHistoryTransaction>
<pair:LastTransactionDate>2017-06-15</pair:LastTransactionDate>
<pair:LastTransactionDescription>Petition Entered</pair:LastTransactionDescription> </pair:LastFileHistoryTransaction>
<pair:ImageAvailabilityIndicator>true</pair:ImageAvailabilityIndicator>
</pair:ApplicationStatusData>
</pair:PatentApplicationList>

The XML in your example is expanding the "pair:" part of the tags according to the schema you've used, so it doesn't match 'pair:ApplicationNumber', even though it looks like it should.
I've used element tree to extract the application numbers as follows (I've just used a local XML file in my examples, rather than the full path in your code)
Example 1:
from xml.etree import ElementTree
tree = ElementTree.parse('ApplicationsByCustomerNumber.xml')
root = tree.getroot()
for item in root:
if 'ApplicationStatusData' in item.tag:
for child in item:
if 'ApplicationNumber' in child.tag:
print child.text
Example 2:
from xml.etree import ElementTree
tree = ElementTree.parse('ApplicationsByCustomerNumber.xml')
root = tree.getroot()
for item in root.iter('{urn:us:gov:uspto:pair}ApplicationStatusData'):
for child in item.iter('{urn:us:gov:uspto:pair}ApplicationNumber'):
print child.text
Hope this may be useful.

python elementree blank output

I am parsing an XML output from VCloud, however I am not able to reach to the values
<?xml version="1.0" encoding="UTF-8"?>
<SupportedVersions xmlns="http://www.vmware.com/vcloud/versions" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.vmware.com/vcloud/versions http://10.10.6.12/api/versions/schema/versions.xsd">
<VersionInfo>
<Version>1.5</Version>
<LoginUrl>https://api.vcd.portal.skyscapecloud.com/api/sessions</LoginUrl>
<MediaTypeMapping>
<MediaType>application/vnd.vmware.vcloud.instantiateVAppTemplateParams+xml</MediaType>
<ComplexTypeName>InstantiateVAppTemplateParamsType</ComplexTypeName>
<SchemaLocation>http://api.vcd.portal.skyscapecloud.com/api/v1.5/schema/master.xsd</SchemaLocation>
</MediaTypeMapping>
<MediaTypeMapping>
<MediaType>application/vnd.vmware.admin.vmwProviderVdcReferences+xml</MediaType>
<ComplexTypeName>VMWProviderVdcReferencesType</ComplexTypeName>
<SchemaLocation>http://api.vcd.portal.skyscapecloud.com/api/v1.5/schema/vmwextensions.xsd</SchemaLocation>
</MediaTypeMapping>
<MediaTypeMapping>
<MediaType>application/vnd.vmware.vcloud.customizationSection+xml</MediaType>
<ComplexTypeName>CustomizationSectionType</ComplexTypeName>
<SchemaLocation>http://api.vcd.portal.skyscapecloud.com/api/v1.5/schema/master.xsd</SchemaLocation>
</MediaTypeMapping>
this is what I have been using
import xml.etree.ElementTree as ET
data = ET.fromstring(content)
versioninfo = data.findall("VersionInfo/Version")
print len(versioninfo)
print versioninfo.text
however this gives a blank output...any suggestions?

Try this:
import xml.etree.ElementTree as ET
data = ET.fromstring(content)
versioninfo = data.find(
"ns:VersionInfo/ns:Version",
namespaces={'ns':'http://www.vmware.com/vcloud/versions'})
print versioninfo.text
Use .find(), not .findall() to return a single element
Your XML uses namespaces. The full path to your desired object is: '{http://www.vmware.com/vcloud/versions}VersionInfo/{http://www.vmware.com/vcloud/versions}Version' By passing in the namespaces parameter, you are able to use the shortcut syntax: ns:VersionInfo/ns:Version.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

python iterate xml avoiding namespace - python

With that just use: import xml.etree.ElementTree as ET from lxml import etree root = ET.fromstring(xml_taken_from_web) print(etree.QName(root.tag).localname)

Related

Unexpected results when parsing XML via lxml

parsing XML in python by using xml.etree.ElementTree

Removing ":" from namespace in a XML file parsing

Create array of values from specific element in XML using Python

python elementree blank output

Categories

Resources