How can i parse a xml file in python

How can i parse a xml file in python - python

I have an xml file(basically the file is a jenkins slave config.xml file) from where i have to get certain values.
So, i tried parsing the xml file using Element Tree something like this
tree = ET.parse(config.xml)
root = tree.getroot()
print root
for item in root.findall('slave'):
and then i am saving this parsed xml file in a text file, now i want to get the value within this tag
I can do it through bash but i want to know how can we do this in python
Here goes the bash code
cat test.xml | sed -n 's:.*<label>\(.*\)</label>.*:\1:p'
Here is a sample jenkins slave config.xml file
<slave>
<name>some_name</name>
<description/>
<remoteFS>some_value</remoteFS>
<numExecutors>xx</numExecutors>
<mode>EXCLUSIVE</mode>
<retentionStrategy class="xxxx"/>
<launcher class="xxxxx" plugin="xxxxx">
<host>xxx.x.x.xx</host>
<port>xx</port>
<credentialsId>xxxxxxx-xxx-xxxx-xxxx-xxxxxxxxxxxx</credentialsId>
<maxNumRetries>0</maxNumRetries>
<retryWaitTime>0</retryWaitTime>
<sshHostKeyVerificationStrategy class="hudson.plugins.sshslaves.verifiers.NonVerifyingKeyVerificationStrategy/></launcher>
<label>some_label</label>
</slave>
Similarly as label i need other values as well such as hostname, port etc.

You can iterate recursively using .iter() to find elements. Check the official documentation.
Here is an example to print the label and host text from slave node.
Update: The code.py is modified to additionally print class attribute value of launcher tag. It uses element.attrib to get the attributes of a tag. More can be found in official documentation of parsing XML.
test.xml:
<slave>
<name>some_name</name>
<description/>
<remoteFS>some_value</remoteFS>
<numExecutors>xx</numExecutors>
<mode>xxx</mode>
<retentionStrategy class="xxxx"/>
<launcher class="xxxxx" plugin="xxxxx">
<host>xxx.x.x.xx</host>
<port>xx</port>
<credentialsId>xxxxxxxx</credentialsId>
<maxNumRetries>x</maxNumRetries>
<retryWaitTime>x</retryWaitTime>
<sshHostKeyVerificationStrategy class="hudson.plugins.sshslaves.verifiers.NonVerifyingKeyVerificationStrategy"/>
</launcher>
<label>somelabel</label>
</slave>
code.py:
import xml.etree.ElementTree as ET
tree = ET.parse("test.xml")
root = tree.getroot()
for item in root.iter('slave'):
for label in item.iter("label"):
print label.text
for host in item.iter("host"):
print host.text
for launcher in item.iter("launcher"):
print launcher.attrib["class"]
Output:
somelabel
xxx.x.x.xx
xxxxx

Related

Update the xml using python3 at specific subelement?

I am trying to update below xml file in python 3 using import xml.etree.ElementTree as ET but not able to add anything between tags
Issue I am facing not able to get/fetch the tag after fileSets.
Can someone let me know how we could update the xml?
abc.xml
<assembly
xmlns="http://maven.apache.org/plugins/maven-assembly-plugin/assembly/1.1.2"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="
http://maven.apache.org/plugins/maven-assembly-plugin/assembly/1.1.2
http://maven.apache.org/xsd/assembly-1.1.2-xsd"
>
<id></id>
<formats>
<format>zip</format>
</formats>
<fileSets>
<fileSet>
<outputDirectory>/<outputDirectory>
<directory>../</directory>
<useDefaultExcludes>false</useDefaultExcludes>
<includes>
</includes>
</fileSet>
</fileSets>
</assembly>
Expected output:(file names will be added dynamically)
abc.xml
<assembly
xmlns="http://maven.apache.org/plugins/maven-assembly-plugin/assembly/1.1.2"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="
http://maven.apache.org/plugins/maven-assembly-plugin/assembly/1.1.2
http://maven.apache.org/xsd/assembly-1.1.2-xsd"
>
<id></id>
<formats>
<format>zip</format>
</formats>
<fileSets>
<fileSet>
<outputDirectory>/<outputDirectory>
<directory>../</directory>
<useDefaultExcludes>false</useDefaultExcludes>
<includes>
<include>abc.text</include>
<include>def.text</include>
<include>ghi.text</include>
</includes>
</fileSet>
</fileSets>
</assembly>
I am trying this and it prints me all four element inside this files but doesn't know how to access includes and then add something inside this abc.txt and so on.
import xml.etree.ElementTree as ET
tree = ET.parse(abc.xml)
root = tree.getroot()
for actor in root.findall('{http://maven.apache.org/plugins/maven-assembly-plugin/assembly/1.1.2}fileSets'):
for name in actor.findall('{http://maven.apache.org/plugins/maven-assembly-plugin/assembly/1.1.2}fileSet'):
print(name)

You don't have to do anything with fileSets orfileSet. Since you want to add children to includes, get that element directly.
import xml.etree.ElementTree as ET
# Ensure that the proper prefix is used in the output (in this case, no prefix at all)
ET.register_namespace("", "http://maven.apache.org/plugins/maven-assembly-plugin/assembly/1.1.2")
tree = ET.parse("abc.xml")
# Find the 'includes' element (.// means search the whole document).
# {*} is a wildcard and matches any namespace (Python 3.8)
includes = tree.find(".//{*}includes")
# Create three new 'include' elements
include1 = ET.Element("include")
include1.text = "abc.text"
include2 = ET.Element("include")
include2.text = "def.text"
include3 = ET.Element("include")
include3.text = "ghi.text"
# Add the new elements as children of 'includes'
includes.append(include1)
includes.append(include2)
includes.append(include3)

Python reading xml

I am newbie on Python programming. I have requirement where I need to read the xml structure and build the new soap request xml by adding namespace like here is the example what I have
Below XML which i get from other system:
<foo>
<bar>
<type foobar="1"/>
<type foobar="2"/>
</bar>
</foo>
I want final result like below
<?xml version="1.0"?>
<soa:foo xmlns:soa="https://www.w3schools.com/furniture">
<soa:bar>
<soa:type foobar="1"/>
<soa:type foobar="2"/>
</soa:bar>
</soa:foo>
I tried to look in python document but not able to find

One option is to use lxml to iterate over all of the elements and add the namespace uri to the .tag property.
You can use register_namespace() to bind the uri to the desired prefix.
Example...
from lxml import etree
tree = etree.parse("input.xml")
etree.register_namespace("soa", "https://www.w3schools.com/furniture")
for elem in tree.iter():
elem.tag = f"{{https://www.w3schools.com/furniture}}{elem.tag}"
print(etree.tostring(tree, pretty_print=True).decode())
Printed output...
<soa:foo xmlns:soa="https://www.w3schools.com/furniture">
<soa:bar>
<soa:type foobar="1"/>
<soa:type foobar="2"/>
</soa:bar>
</soa:foo>

How can we parse xml data that contains nodes with xml namespace tags in python?

I am getting XML as a response so I want to parse it. I tried many python libraries but not get my desired results. So if you can help, it will be really appreciative.
The following code returns None:
xmlResponse = ET.fromstring(context.response_document)
a = xmlResponse.findall('.//Body')
print(a)
Sample XML Data:
<S:Envelope
xmlns:S="http://www.w3.org/2003/05/soap-envelope">
<S:Header>
<wsa:Action s:mustUnderstand="1"
xmlns:s="http://www.w3.org/2003/05/soap-envelope"
xmlns:wsa="http://www.w3.org/2005/08/addressing">urn:ihe:iti:2007:RegistryStoredQueryResponse
</wsa:Action>
</S:Header>
<S:Body>
<query:AdhocQueryResponse status="urn:oasis:names:tc:ebxml-regrep:ResponseStatusType:Success"
xmlns:query="urn:oasis:names:tc:ebxml-regrep:xsd:query:3.0">
<rim:RegistryObjectList
xmlns:rim="u`enter code here`rn:oasis:names:tc:ebxml-regrep:xsd:rim:3.0"/>
</query:AdhocQueryResponse>
</S:Body>
</S:Envelope>
I want to get status from it which is in Body. If you can suggest some changes of some library then please help me. Thanks

Given the following base code:
import xml.etree.ElementTree as ET
root = ET.fromstring(xml)
Let's build on top of it to get your desired output.
Your initial find for .//Body x-path returns NONE because it doesn't exist in your XML response.
Each tag in your XML has a namespace associated with it. More info on xml namespaces can be found here.
Consider the following line with xmlns value (xml-namespace):
<S:Envelope xmlns:S="http://www.w3.org/2003/05/soap-envelope">
The value of namespace S is set to be http://www.w3.org/2003/05/soap-envelope.
Replacing S in {S}Envelope with value set above will give you the resulting tag to find in your XML:
root.find('{http://www.w3.org/2003/05/soap-envelope}Envelope') #top most node
We would need to do the same for <S:Body>.
To get<S:Body> elements and it's child nodes you can do the following:
body_node = root.find('{http://www.w3.org/2003/05/soap-envelope}Body')
for response_child_node in list(body_node):
print(response_child_node.tag) #tag of the child node
print(response_child_node.get('status')) #the status you're looking for
Outputs:
{urn:oasis:names:tc:ebxml-regrep:xsd:query:3.0}AdhocQueryResponse
urn:oasis:names:tc:ebxml-regrep:ResponseStatusType:Success
Alternatively
You can also directly find all {query}AdhocQueryResponse in your XML using:
response_nodes = root.findall('.//{urn:oasis:names:tc:ebxml-regrep:xsd:query:3.0}AdhocQueryResponse')
for response in response_nodes:
print(response.get('status'))
Outputs:
urn:oasis:names:tc:ebxml-regrep:ResponseStatusType:Success

Create array of values from specific element in XML using Python

I have an XML file which has many elements. I would like to create a list/array of all the values which have a specific element name, in my case "pair:ApplicationNumber".
I've gone over a lot of the other questions however I am not able to find an answer. I know that I can do this by loading the text file and going over it using pandas however, I'm sure there's a much better way.
I was unsuccessful trying ElementTree as well as XML.Dom using minidom
My code currently looks as follows:
import os
from xml.dom import minidom
WindowsUser = os.getenv('username')
XMLPath = os.path.join('C:\\Users', WindowsUser, 'Downloads', 'ApplicationsByCustomerNumber.xml')
xmldoc = minidom.parse(XMLPath)
itemlist = xmldoc.getElementsByTagName('pair:ApplicationNumber')
for s in itemlist:
print(s.attributes['pair:ApplicationNumber'].value)
an example XML file looks as follows:
<?xml version="1.0" encoding="UTF-8"?>
<pair:PatentApplicationList xsi:schemaLocation="urn:us:gov:uspto:pair PatentApplicationList.xsd" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:pair="urn:us:gov:uspto:pair">
<pair:FileHeader>
<pair:FileCreationTimeStamp>2017-07-10T10:52:12.12</pair:FileCreationTimeStamp>
</pair:FileHeader>
<pair:ApplicationStatusData>
<pair:ApplicationNumber>62383607</pair:ApplicationNumber>
<pair:ApplicationStatusCode>20</pair:ApplicationStatusCode>
<pair:ApplicationStatusText>Application Dispatched from Preexam, Not Yet Docketed</pair:ApplicationStatusText>
<pair:ApplicationStatusDate>2016-09-16</pair:ApplicationStatusDate>
<pair:AttorneyDocketNumber>1354-T-02-US</pair:AttorneyDocketNumber>
<pair:FilingDate>2016-09-06</pair:FilingDate>
<pair:LastModifiedTimestamp>2017-05-30T21:40:37.37</pair:LastModifiedTimestamp>
<pair:CustomerNumber>122761</pair:CustomerNumber><pair:LastFileHistoryTransaction>
<pair:LastTransactionDate>2017-05-30</pair:LastTransactionDate>
<pair:LastTransactionDescription>Email Notification</pair:LastTransactionDescription> </pair:LastFileHistoryTransaction>
<pair:ImageAvailabilityIndicator>true</pair:ImageAvailabilityIndicator>
</pair:ApplicationStatusData>
<pair:ApplicationStatusData>
<pair:ApplicationNumber>62292372</pair:ApplicationNumber>
<pair:ApplicationStatusCode>160</pair:ApplicationStatusCode>
<pair:ApplicationStatusText>Abandoned -- Incomplete Application (Pre-examination)</pair:ApplicationStatusText>
<pair:ApplicationStatusDate>2016-11-01</pair:ApplicationStatusDate>
<pair:AttorneyDocketNumber>681-S-23-US</pair:AttorneyDocketNumber>
<pair:FilingDate>2016-02-08</pair:FilingDate>
<pair:LastModifiedTimestamp>2017-06-20T21:59:26.26</pair:LastModifiedTimestamp>
<pair:CustomerNumber>122761</pair:CustomerNumber><pair:LastFileHistoryTransaction>
<pair:LastTransactionDate>2017-06-20</pair:LastTransactionDate>
<pair:LastTransactionDescription>Petition Entered</pair:LastTransactionDescription> </pair:LastFileHistoryTransaction>
<pair:ImageAvailabilityIndicator>true</pair:ImageAvailabilityIndicator>
</pair:ApplicationStatusData>
<pair:ApplicationStatusData>
<pair:ApplicationNumber>62289245</pair:ApplicationNumber>
<pair:ApplicationStatusCode>160</pair:ApplicationStatusCode>
<pair:ApplicationStatusText>Abandoned -- Incomplete Application (Pre-examination)</pair:ApplicationStatusText>
<pair:ApplicationStatusDate>2016-10-26</pair:ApplicationStatusDate>
<pair:AttorneyDocketNumber>1526-P-01-US</pair:AttorneyDocketNumber>
<pair:FilingDate>2016-01-31</pair:FilingDate>
<pair:LastModifiedTimestamp>2017-06-15T21:24:13.13</pair:LastModifiedTimestamp>
<pair:CustomerNumber>122761</pair:CustomerNumber><pair:LastFileHistoryTransaction>
<pair:LastTransactionDate>2017-06-15</pair:LastTransactionDate>
<pair:LastTransactionDescription>Petition Entered</pair:LastTransactionDescription> </pair:LastFileHistoryTransaction>
<pair:ImageAvailabilityIndicator>true</pair:ImageAvailabilityIndicator>
</pair:ApplicationStatusData>
</pair:PatentApplicationList>

The XML in your example is expanding the "pair:" part of the tags according to the schema you've used, so it doesn't match 'pair:ApplicationNumber', even though it looks like it should.
I've used element tree to extract the application numbers as follows (I've just used a local XML file in my examples, rather than the full path in your code)
Example 1:
from xml.etree import ElementTree
tree = ElementTree.parse('ApplicationsByCustomerNumber.xml')
root = tree.getroot()
for item in root:
if 'ApplicationStatusData' in item.tag:
for child in item:
if 'ApplicationNumber' in child.tag:
print child.text
Example 2:
from xml.etree import ElementTree
tree = ElementTree.parse('ApplicationsByCustomerNumber.xml')
root = tree.getroot()
for item in root.iter('{urn:us:gov:uspto:pair}ApplicationStatusData'):
for child in item.iter('{urn:us:gov:uspto:pair}ApplicationNumber'):
print child.text
Hope this may be useful.

parse xml with lxml including namespace

I need to get some info after a specific tag in lxml.
the xml doc looks like this
<?xml version="1.0" encoding="ISO-8859-1"?>
<web-app xmlns="http://java.sun.com/xml/ns/j2ee"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://java.sun.com/xml/ns/j2ee http://java.sun.com/xml/
ns/j2ee/web-app_2_4.xsd"
version="2.4">
<display-name>Community Bank</display-name>
<description>WebGoat for Cigital</description>
<context-param>
<param-name>PropertiesPath</param-name>
<param-value>/WEB-INF/properties.txt</param-value>
<description>This is the path to the properties file from the servlet root</description>
</context-param>
<servlet>
<servlet-name>Index</servlet-name>
<servlet-class>com.cigital.boi.servlet.index</servlet-class>
</servlet>
<servlet-mapping>
<servlet-name>Index</servlet-name>
<url-pattern>/index</url-pattern>
</servlet-mapping>
<servlet-mapping>
<servlet-name>Index</servlet-name>
<url-pattern>/index.html</url-pattern>
</servlet-mapping>
I want to read com.cigital.boi.servlet.index .
I have used this code to read everything under servlets
context = etree.parse(handle)
list = parser.xpath('//servlet')
print list
list contains nothing
more info : iterating over the context field i found these lines.
<Element {http://java.sun.com/xml/ns/j2ee}servlet-name at 2ad19e6eca48>
<Element {http://java.sun.com/xml/ns/j2ee}servlet-class at 2ad19e6ecaf8>
I am thinking as I have not included name space while searching , output is empty list.
please suggest hoe to read "com.cigital.boi.servlet.index" in the servlet-class tag

Try following:
from lxml import etree
context = etree.parse(handle)
print next(x.text for x in context.xpath('.//*[local-name()="servlet-class"]'))
Alternative:
from lxml import etree
context = etree.parse(handle)
nsmap = context.getroot().nsmap.copy()
nsmap['xmlns'] = nsmap.pop(None)
print next(x.text for x in context.xpath('.//xmlns:servlet-class', namespaces=nsmap))

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How can i parse a xml file in python - python

Related

Update the xml using python3 at specific subelement?

Python reading xml

How can we parse xml data that contains nodes with xml namespace tags in python?

Create array of values from specific element in XML using Python

parse xml with lxml including namespace

Categories

Resources