Issue with python script while parsing pom file in project - python

I'm having issue extracting version number using python script. Its returning none while running the script. Can someone help me on this ?
Python Script:
import xml.etree.ElementTree as ET
tree = ET.parse('pom.xml')
root = tree.getroot()
releaseVersion = root.find("version")
print(releaseVersion)
pom.xml
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns="http://maven.apache.org/POM/4.0.0"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 https://maven.apache.org/xsd/maven-4.0.0.xsd">
<artifactId>watcher</artifactId>
<version>0.0.1-SNAPSHOT</version>
<groupId>com.test</groupId>
<name>file</name>
<packaging>jar</packaging>
<parent>
<artifactId>spring-boot-starter-parent</artifactId>
<groupId>org.springframework.boot</groupId>
<relativePath/>
<version>2.6.1</version>
</parent>
</project>

You're not taking into account that all of your elements are in a default namespace defined by xmlns="http://maven.apache.org/POM/4.0.0" in your <project> element.
So you have to create your query with this namespace.
import xml.etree.ElementTree as ET
tree = ET.parse('pom.xml')
root = tree.getroot()
NS = { 'maven' : 'http://maven.apache.org/POM/4.0.0' }
releaseVersion = root.find("maven:version",NS)
print(releaseVersion.text)
Here NS = { ... } defines the namespace (in the following referred to by its prefix maven) used in the following XPath expression.

Your pom.xml has a namespace xmlns="http://maven.apache.org/POM/4.0.0" in the project tag.
If you must search with fullname, you need to follow {namespace}tag
>> root.find("{http://maven.apache.org/POM/4.0.0}version")
<Element '{http://maven.apache.org/POM/4.0.0}version' at 0x0000014635EC0A40>
But if you don't bother you can search with {*}tag
>> root.find("{*}version")
<Element '{http://maven.apache.org/POM/4.0.0}version' at 0x0000014635EC0A40>

Related

Unexpected results when parsing XML via lxml

The output of my xml parsing is not es expected.
The xml file
<?xml version="1.0"?>
<stationaer xsi:schemaLocation="http:/foo.bar" xmlns="http://foo.bar" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<einrichtung>
<name>Name</name>
</einrichtung>
<einrichtung>
<name>Name</name>
</einrichtung>
</stationaer>
I would expect to get something like root.tag == 'stationaer' and child.tag = 'einrichtung'.
See the outpout at the end.
This is the MWE
#!/usr/bin/env python3
import pathlib
import lxml
from lxml import etree
import pandas
xml_src = '''<?xml version="1.0"?>
<stationaer xsi:schemaLocation="http:/foo.bar" xmlns="http://foo.bar" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<einrichtung>
<name>Name</name>
</einrichtung>
<einrichtung>
<name>Name</name>
</einrichtung>
</stationaer>
'''
# tree = etree.parse(file_path)
# root = tree.getroot()
root = etree.fromstring(xml_src)
print(repr(root.tag))
print(repr(root.text))
child = root.getchildren()[0]
print(repr(child.tag))
print(repr(child.text))
The output for root is
'{http://foo.bar}stationaer'
'\n '
and for child
'{http://foo.bar}einrichtung'
'\n '
I don't understand what's going on here and why that URL is in the output.
This is actually not unexpected. The elements in the XML document are bound to the http://foo.bar default namespace. The namespace is declared by xmlns="http://foo.bar" on the root element and the declaration is inherited by all descendants.
The special notation with the namespace URI enclosed in curly braces ({http://foo.bar}stationaer) is never used in XML documents, but it is used by lxml and ElementTree when printing element (tag) names. It can also be used when searching or creating elements that belong to a namespace.
More information:
https://www.w3.org/TR/xml-names/
https://lxml.de/tutorial.html#namespaces
https://docs.python.org/3/library/xml.etree.elementtree.html#parsing-xml-with-namespaces

How to get the content of child->child->child->child in XML file using Python

<?xml version="1.0" encoding="UTF-8"?>
<Document xmlns="urn:iso:std:iso:20022:tech:xsd:camt.056.001.01" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<FIToFIPmtCxlReq>
<Assgnmt>
<Id>TEST-ISO-81</Id>
<Assgnr>
<Agt>
<FinInstnId>
<BIC>CCCCGB2L</BIC>
</FinInstnId>
</Agt>
</Assgnr>
<Assgne>
<Agt>
<FinInstnId>
<BIC>MMMMGB2L</BIC>
</FinInstnId>
</Agt>
</Assgne>
<CreDtTm>2009-03-24T11:22:59</CreDtTm>
</Assgnmt>
<TxInf>
<CxlId>103012345</CxlId>
<Case>
<Id>ISO_TEST_CASE</Id>
<Cretr>
<Agt>
<FinInstnId>
<BIC>MMMMGB2L</BIC>
</FinInstnId>
</Agt>
</Cretr>
</Case>
</TxInf>
</Undrlyg>
</FIToFIPmtCxlReq>
</Document>
Here I want to get the content of "TxInf" like all its child and child of child and the data.
What I have tried is :
import xml.etree.ElementTree as ET
from xml.etree import ElementTree
tree = ET.parse('R3-CAMT.056.001.07-ISO-V.XML')
root = tree.getroot()
for element in root.iter():
if element.tag == "{urn:iso:std:iso:20022:tech:xsd:camt.056.001.01}TxInf":
tree._setroot(element.tag)
print(root.tag)
print(root.attrib)
Please suggest if I can change the root with _setroot or any other possible method
Try something along these lines on your code to see if it works:
for r in root.findall(".//*"):
if 'TxInf' in r.tag:
print(ET.tostring(r))
By the way, it may be easier to do it with lxml, if you can use it.

python lxml element attrib issue

I have to build a XML file that looks like the following:
<?xml version='1.0' encoding='ISO-8859-1'?>
<Document protocol="OCI" xmlns="C">
<sessionId>xmlns=874587878</sessionId>
<command xmlns="http://www.w3.org/2001/XMLSchema-instance" xsi:type="UserGetRegistrationListRequest">
<userId>data</userId>
</command>
</Document>
I got everything working except for the command attrib xsi:type="UserGetRegistrationListRequest"
I can't get the : in the attrib of the command element.
Can someone please help me with this issue?
I am using Python 3.5.
My current code is
from lxml import etree
root = etree.Element("Document", protocol="OCI", xmlns="C")
print(root.tag)
root.append(etree.Element("sessionId") )
sessionId=root.find("sessionId")
sessionId.text = "xmlns=78546587854"
root.append(etree.Element("command", xmlns="http://www.w3.org/2001/XMLSchema-instance",xsitype = "UserGetRegistrationListRequest" ) )
command=root.find("command")
userID = etree.SubElement(command, "userId")
userID.text = "data"
print(etree.tostring(root, pretty_print=True))
tree = etree.ElementTree(root)
tree.write('output.xml', pretty_print=True, xml_declaration=True, encoding="ISO-8859-1")
and then i get this back
<?xml version='1.0' encoding='ISO-8859-1'?>
<Document protocol="OCI" xmlns="C">
<sessionId>xmlns=78546587854</sessionId>
<command xmlns="http://www.w3.org/2001/XMLSchema-instance" xsitype="UserGetRegistrationListRequest">
<userId>data</userId>
</command>
QName can be used to create the xsi:type attribute.
from lxml import etree
root = etree.Element("Document", protocol="OCI", xmlns="C")
# Create sessionId element
sessionId = etree.SubElement(root, "sessionId")
sessionId.text = "xmlns=78546587854"
# Create xsi:type attribute using QName
xsi_type = etree.QName("http://www.w3.org/2001/XMLSchema-instance", "type")
# Create command element, with xsi:type attribute
command = etree.SubElement(root, "command", {xsi_type: "UserGetRegistrationListRequest"})
# Create userId element
userID = etree.SubElement(command, "userId")
userID.text = "data"
Resulting XML (with the proper xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" declaration):
<?xml version='1.0' encoding='ISO-8859-1'?>
<Document protocol="OCI" xmlns="C">
<sessionId>xmlns=78546587854</sessionId>
<command xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:type="UserGetRegistrationListRequest">
<userId>data</userId>
</command>
</Document>
Note that the xsi prefix does not need to be explicitly defined in the Python code. lxml defines default prefixes for some well-known namespace URIs, including xsi for http://www.w3.org/2001/XMLSchema-instance.

Python: ignoring namespaces in xml.etree.ElementTree?

How can I tell ElementTree to ignore namespaces in an XML file?
For example, I would prefer to query modelVersion (as in statement 1) rather than {http://maven.apache.org/POM/4.0.0}modelVersion (as in statement 2).
pom="""
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0
http://maven.apache.org/maven-v4_0_0.xsd">
<modelVersion>4.0.0</modelVersion>
</project>
"""
from xml.etree import ElementTree
ElementTree.register_namespace("","http://maven.apache.org/POM/4.0.0")
root = ElementTree.fromstring(pom)
print 1,root.findall('modelVersion')
print 2,root.findall('{http://maven.apache.org/POM/4.0.0}modelVersion')
1 []
2 [<Element '{http://maven.apache.org/POM/4.0.0}modelVersion' at 0x1006bff10>]
There appears to be no straight-forward pathway, thus I'd simply wrap the find calls, e.g.
from xml.etree import ElementTree as ET
POM = """
<project xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns="http://maven.apache.org/POM/4.0.0">
<modelVersion>4.0.0</modelVersion>
</project>
"""
NSPS = {'foo' : "http://maven.apache.org/POM/4.0.0"}
# sic!
def findall(node, tag):
return node.findall('foo:' + tag, NSPS)
root = ET.fromstring(POM)
print(map(ET.tostring, findall(root, 'modelVersion')))
output:
['<ns0:modelVersion xmlns:ns0="http://maven.apache.org/POM/4.0.0">4.0.0</ns0:modelVersion>\n']
Here's what I'm presently doing, which makes me incredibly confident that there's a better way.
$ cat pom.xml |
tr '\n' ' ' |
sed 's/<project [^>]*>/<project>/' |
myprogram |
sed 's/<project>/<project xmlns="http:\/\/maven.apache.org\/POM\/4.0.0" xmlns:xsi="http:\/\/www.w3.org\/2001\/XMLSchema-instance" xsi:schemaLocation="http:\/\/maven.apache.org\/POM\/4.0.0 http:\/\/maven.apache.org\/maven-v4_0_0.xsd">/'
Rather than ignore, another approach would be to remove the namespaces in the tree, so there's no need to 'ignore' because they aren't there - see nonagon's answer to this question (and my extension of that to include namespaces on attributes): Python ElementTree module: How to ignore the namespace of XML files to locate matching element when using the method "find", "findall"
Here's the equivalent solution without using the shell. Basic idea:
translate <project junk...> to <project>
perform "clean" processing without worrying about the namespace
translate <project> back to <project junk...>
with the new code:
pom="""
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
<modelVersion>4.0.0</modelVersion>
</project>
"""
short_project="""<project>"""
long_project="""<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">"""
import re,sys
from xml.etree import ElementTree
# eliminate namespace specs
pom=re.compile('<project [^>]*>').sub(short_project,pom)
root = ElementTree.fromstring(pom)
ElementTree.dump(root)
print 1,root.findall('modelVersion')
print 2,root.findall('{http://maven.apache.org/POM/4.0.0}modelVersion')
mv=root.findall('modelVersion')
# restore the namespace specs
pom=ElementTree.tostring(root)
pom=re.compile(short_project).sub(long_project,pom)

How to get node's value of an XML in Python?

suppose i have an xml file:
<?xml version="1.0" encoding="utf-8" ?>
<configuration>
<quarkSettings>
<UpdatePath></UpdatePath>
<Version>Development</Version>
<Project>ABC</Project>
</quarkSettings>
</configuration>
now i want get Project's value. I have written following code:
import xml.etree.ElementTree as ET
doc1 = ET.parse("Configuration.xml")
for e in doc1.find("Project"):
project =e.text
but it doesn't give the value.
i got the answer:
import xml.etree.ElementTree as ET
doc1 = ET.parse(get_path_for_config_Quark_Release)
root = doc1.getroot()
for element in root.findall("quarkSettings"):
project = element.find("Project").text

Categories