XML accessing elements within the tree with Etree python

XML accessing elements within the tree with Etree python - python

I'm trying to access information within a XML file via python Etree. The XML looks like this:
<events-data>
<dossier-event event-type="new" id="EVT_4573534">
<event-date>
<date>20220816</date>
</event-date>
<event-code>EPIDOSNWIAI</event-code>
<event-text event-text-type="DESCRIPTION">text</event-text>
</dossier-event>
</events-data>
<events-data>
<dossier-event event-type="new" id="EVT_4573535">
<event-date>
<date>20220402</date>
</event-date>
<event-code>EPIDOS PCT</event-code>
<event-text event-text-type="DESCRIPTION">text1</event-text>
</dossier-event>
</events-data>
I want to access the <date> 20220402 </date> and retrieve the date, so 20220402. My attempt for it looks like this:
root_events = ET.fromstring(response_events.content)
for element in root_events.iter('{http://myapi/register}date'):
print(element.text)
The problem: There is an unknown number of<date>[date]</date> before and after this date, but which is not within <events-data> or <event-date>. But if I try to list all tags, attributes or text of <event-date>, it's empty. Can someone explain me how i only access the dates within something like
<event-date>
<date>20220402</date>
</event-date>

If you fix your XML to have a root tag, an XPATH query might work for you:
import xml.etree.ElementTree as ET
for event_date in ET.parse("sample.xml").getroot().findall(".//events-data/dossier-event/event-date/date"):
print(event_date.text)
Produces the following output:
$ python sample.py
20220816
20220402

Related

parsing XML in python by using xml.etree.ElementTree

I get an XML file using the request module, then I want to use the xml.etree.ElementTree module to get the output of the element
core-usg-01
but I'm already confused how to do it, im stuck. I tried writing this simple code to get the sysname element, but I get an empty output.
Python code:
import xml.etree.ElementTree as ET
tree = ET.parse('usg.xml')
root = tree.getroot()
print(root.findall('sysname'))
XML file:
<rpc-reply xmlns="urn:ietf:params:xml:ns:netconf:base:1.0" message-id="1">
<data>
<system-state xmlns="urn:ietf:params:xml:ns:yang:ietf-system">
<sysname xmlns="urn:huawei:params:xml:ns:yang:huawei-system">
core-usg-01
</sysname>
</system-state>
</data>
</rpc-reply>

You need to iter() over the root to reach to the child.
for child in root.iter():
print (child.tag, child.attrib)
Which will give you the present children tags and their attributes.
{urn:ietf:params:xml:ns:netconf:base:1.0}rpc-reply {'message-id': '1'}
{urn:ietf:params:xml:ns:netconf:base:1.0}data {}
{urn:ietf:params:xml:ns:yang:ietf-system}system-state {}
{urn:huawei:params:xml:ns:yang:huawei-system}sysname {}
Now you need to loop to your desired tag using following code:
for child in root.findall('.//{urn:ietf:params:xml:ns:yang:ietf-system}system-state'):
temp = child.find('.//{urn:huawei:params:xml:ns:yang:huawei-system}sysname')
print(temp.text)
The output will look like this:
core-usg-01

Try the below one liner
import xml.etree.ElementTree as ET
xml = '''<rpc-reply xmlns="urn:ietf:params:xml:ns:netconf:base:1.0" message-id="1">
<data>
<system-state xmlns="urn:ietf:params:xml:ns:yang:ietf-system">
<sysname xmlns="urn:huawei:params:xml:ns:yang:huawei-system">
core-usg-01
</sysname>
</system-state>
</data>
</rpc-reply>'''
root = ET.fromstring(xml)
print(root.find('.//{urn:huawei:params:xml:ns:yang:huawei-system}sysname').text)
output
core-usg-01

Python reading xml

I am newbie on Python programming. I have requirement where I need to read the xml structure and build the new soap request xml by adding namespace like here is the example what I have
Below XML which i get from other system:
<foo>
<bar>
<type foobar="1"/>
<type foobar="2"/>
</bar>
</foo>
I want final result like below
<?xml version="1.0"?>
<soa:foo xmlns:soa="https://www.w3schools.com/furniture">
<soa:bar>
<soa:type foobar="1"/>
<soa:type foobar="2"/>
</soa:bar>
</soa:foo>
I tried to look in python document but not able to find

One option is to use lxml to iterate over all of the elements and add the namespace uri to the .tag property.
You can use register_namespace() to bind the uri to the desired prefix.
Example...
from lxml import etree
tree = etree.parse("input.xml")
etree.register_namespace("soa", "https://www.w3schools.com/furniture")
for elem in tree.iter():
elem.tag = f"{{https://www.w3schools.com/furniture}}{elem.tag}"
print(etree.tostring(tree, pretty_print=True).decode())
Printed output...
<soa:foo xmlns:soa="https://www.w3schools.com/furniture">
<soa:bar>
<soa:type foobar="1"/>
<soa:type foobar="2"/>
</soa:bar>
</soa:foo>

Parsing nested attributes

Good day dear developers.
I can't fully parse an xml file.
The structure looks like:
<foo>
<bar1 id="1">
<bar2>
<foobar id="2">name1</foobar>
<foobar id="3">name2</foobar>
</bar2>
</bar1>
</foo>
I used the xml.etree library so I use code like:
source.get('Id')
so i get the first attribute
to get a nested tag i use code like:
source.find('bar/foobar').text
The question is how to get next nested attributes? ( Id =2 and id = 3)
It shows an error when i'm trying to use some stuff with slash
source.get('bar/id')
and other tries give me just the first attribute which i already got, also the second nested attribute has the same name Id.
Thank you for the help in advance.

Below is a working example
import xml.etree.ElementTree as ET
xml = '''<foo>
<bar1 id="1">
<bar2>
<foobar id="2">name1</foobar>
<foobar id="3">name2</foobar>
</bar2>
</bar1>
</foo>'''
root = ET.fromstring(xml)
ids = [f.attrib.get('id') for f in root.findall('.//foobar')]
print(ids)
output
['2','3']

You need to specify a working XPATH expression, like:
foobars = source.findall('bar1/bar2/foobar')
for elem in foobars:
print(elem.get('id'))
Output:
2
3

It works now for one line, but what if we have several bar1? Like this
<foo>
<bar1 id="1">
<bar2>
<foobar id="2">name1</foobar>
<foobar id="3">name2</foobar>
</bar2>
</bar1>
<bar1 id="2">
<bar2>
<foobar id="2">name3</foobar>
<foobar id="3">name4</foobar>
</bar2>
</bar1>
</foo>
The loop (findall=> for)will print all of it(4 ids), but i need just 2 of them for each row

XPath with LXML Element

I am trying to parse an XML document using lxml etree. The XML doc I am parsing looks like this:
<?xml version="1.0" encoding="UTF-8"?>
<metadata xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://www.openarchives.org/OAI/2.0/">\t
<codeBook version="2.5" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="ddi:codebook:2_5" xsi:schemaLocation="ddi:codebook:2_5 http://www.ddialliance.org/Specification/DDI-Codebook/2.5/XMLSchema/codebook.xsd">
<docDscr>
<citation>
<titlStmt>
<titl>Test Title</titl>
</titlStmt>
<prodStmt>
<prodDate/>
</prodStmt>
</citation>
</docDscr>
<stdyDscr>
<citation>
<titlStmt>
<titl>Test Title 2</titl>
<IDNo agency="UKDA">101</IDNo>
</titlStmt>
<rspStmt>
<AuthEnty>TestAuthEntry</AuthEnty>
</rspStmt>
<prodStmt>
<copyright>Yes</copyright>
</prodStmt>
<distStmt/>
<verStmt>
<version date="">1</version>
</verStmt>
</citation>
<stdyInfo>
<subject>
<keyword>2009</keyword>
<keyword>2010</keyword>
<topcClas>CLASS</topcClas>
<topcClas>ffdsf</topcClas>
</subject>
<abstract>This is an abstract piece of text.</abstract>
<sumDscr>
<timePrd event="single">2020</timePrd>
<nation>UK</nation>
<anlyUnit>Test</anlyUnit>
<universe>test</universe>
<universe>hello</universe>
<dataKind>fdsfdsf</dataKind>
</sumDscr>
</stdyInfo>
<method>
<dataColl>
<timeMeth>test timemeth</timeMeth>
<dataCollector>test data collector</dataCollector>
<sampProc>test sampprocess</sampProc>
<deviat>test deviat</deviat>
<collMode>test collMode</collMode>
<sources/>
</dataColl>
</method>
<dataAccs>
<setAvail>
<accsPlac>Test accsPlac</accsPlac>
</setAvail>
<useStmt>
<restrctn>NONE</restrctn>
</useStmt>
</dataAccs>
<othrStdyMat>
<relPubl>122</relPubl>
<relPubl>12332</relPubl>
</othrStdyMat>
</stdyDscr>
</codeBook>
</metadata>
I wrote the following code to try and process it:
from lxml import etree
import pdb
f = open('/vagrant/out2.xml', 'r')
xml_str = f.read()
xml_doc = etree.fromstring(xml_str)
f.close()
From what I understand from the lxml xpath docs, I should be able to get the text from a specific element as follows:
xml_doc.xpath('/metadata/codeBook/docDscr/citation/titlStmt/titl/text()')
However, when I run this it returns an empty array.
The only xpath I can get to return something is using a wildcard:
xml_doc.xpath('*')
Which returns [<Element {ddi:codebook:2_5}codeBook at 0x7f8da8a413f8>].
I've read through the docs and I'm not understanding what is going wrong with this. Any help is appreciated.

You need to take the default namespace into account so instead of
xml_doc.xpath('/metadata/codeBook/docDscr/citation/titlStmt/titl/text()')
use
xml_doc.xpath.xpath(
'/oai:metadata/ddi:codeBook/ddi:docDscr/ddi:citation/ddi:titlStmt/ddi:titl/text()',
namespaces={
'oai': 'http://www.openarchives.org/OAI/2.0/',
'ddi': 'ddi:codebook:2_5'
}
)

parse xml with lxml including namespace

I need to get some info after a specific tag in lxml.
the xml doc looks like this
<?xml version="1.0" encoding="ISO-8859-1"?>
<web-app xmlns="http://java.sun.com/xml/ns/j2ee"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://java.sun.com/xml/ns/j2ee http://java.sun.com/xml/
ns/j2ee/web-app_2_4.xsd"
version="2.4">
<display-name>Community Bank</display-name>
<description>WebGoat for Cigital</description>
<context-param>
<param-name>PropertiesPath</param-name>
<param-value>/WEB-INF/properties.txt</param-value>
<description>This is the path to the properties file from the servlet root</description>
</context-param>
<servlet>
<servlet-name>Index</servlet-name>
<servlet-class>com.cigital.boi.servlet.index</servlet-class>
</servlet>
<servlet-mapping>
<servlet-name>Index</servlet-name>
<url-pattern>/index</url-pattern>
</servlet-mapping>
<servlet-mapping>
<servlet-name>Index</servlet-name>
<url-pattern>/index.html</url-pattern>
</servlet-mapping>
I want to read com.cigital.boi.servlet.index .
I have used this code to read everything under servlets
context = etree.parse(handle)
list = parser.xpath('//servlet')
print list
list contains nothing
more info : iterating over the context field i found these lines.
<Element {http://java.sun.com/xml/ns/j2ee}servlet-name at 2ad19e6eca48>
<Element {http://java.sun.com/xml/ns/j2ee}servlet-class at 2ad19e6ecaf8>
I am thinking as I have not included name space while searching , output is empty list.
please suggest hoe to read "com.cigital.boi.servlet.index" in the servlet-class tag

Try following:
from lxml import etree
context = etree.parse(handle)
print next(x.text for x in context.xpath('.//*[local-name()="servlet-class"]'))
Alternative:
from lxml import etree
context = etree.parse(handle)
nsmap = context.getroot().nsmap.copy()
nsmap['xmlns'] = nsmap.pop(None)
print next(x.text for x in context.xpath('.//xmlns:servlet-class', namespaces=nsmap))

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

XML accessing elements within the tree with Etree python - python

Related

parsing XML in python by using xml.etree.ElementTree

Python reading xml

Parsing nested attributes

XPath with LXML Element

parse xml with lxml including namespace

Categories

Resources