Find sibling element text value in XML using Python ElementTree - python

I am using Python 3.7 with Anaconda, and I am trying to get the text value from the XML element following the element with text = CX.PAIR1.BORROWER.FICO. So in this case I would like to return '779'.
As can be seen with the following XML segment, these tags are not unique and have no names or attributes, so only the contained text can be used to locate the data.
<CustomField>
<id>CustomField/63</id>
<fieldName>CX.WLS.RATETYPE</fieldName>
<stringValue>FIX30</stringValue>
</CustomField>
<CustomField>
<id>CustomField/64</id>
<fieldName>CX.PAIR1.BORROWER.FICO</fieldName>
<stringValue>779</stringValue>
<numericValue>779.0</numericValue>
</CustomField>
<CustomField>
<id>CustomField/65</id>
<fieldName>CX.PAIRS16</fieldName>
<stringValue>779</stringValue>
<numericValue>779.0</numericValue>
</CustomField>
I have tried various forms of this:
Borrower_FICO = root.find('.//*[fieldName = "CX.PAIR1.BORROWER.FICO"]/following-sibling::node()')
and..
Borrower_FICO = root.find('.//*[text() = "CX.PAIR1.BORROWER.FICO"]/following-sibling::node()')
but can't see to pull the data into my variable Borrower_FICO
What am I getting wrong here?

following-sibling is not supported by ElementTree. It does work with lxm though.
I converted to lxml and got it to work using a loop:
for elem in root.iter():
if elem.text == "CX.PAIR1.BORROWER.FICO" :
Borrower_FICO = elem.getnext().text

Below: (no need for lxml)
import xml.etree.ElementTree as ET
xml = '''<r><CustomField>
<id>CustomField/63</id>
<fieldName>CX.WLS.RATETYPE</fieldName>
<stringValue>FIX30</stringValue>
</CustomField>
<CustomField>
<id>CustomField/64</id>
<fieldName>CX.PAIR1.BORROWER.FICO</fieldName>
<stringValue>779</stringValue>
<numericValue>779.0</numericValue>
</CustomField>
<CustomField>
<id>CustomField/65</id>
<fieldName>CX.PAIRS16</fieldName>
<stringValue>779</stringValue>
<numericValue>779.0</numericValue>
</CustomField></r>'''
root = ET.fromstring(xml)
entry = root.find(".//CustomField/[fieldName='CX.PAIR1.BORROWER.FICO']")
print(entry.find('stringValue').text)
output:
779

Related

Get Element Text Using xml mini dom python

i am trying to get the text of an element using mini dom, in the following code , i have also tried getText() Method as suggested here , but i am unable to get the desired output, following is my code. I dont get the Text value from the element i am trying to work on.
import xml.dom.minidom
doc = xml.dom.minidom.parse("DL_INVOICE_DETAIL_TCB.xml")
results = doc.getElementsByTagName("G_TRANSACTIONS")
def getText(nodelist):
rc = []
for node in nodelist:
if node.nodeType == node.TEXT_NODE:
rc.append(node.data)
return ''.join(rc)
for result in results:
for element in result.getElementsByTagName("INVOICE_NUMBER"):
print(element.nodeType)
print(element.nodeValue)
Following is my XML sample
<LIST_G_TRANSACTIONS>
<G_TRANSACTIONS>
<INVOICE_NUMBER>31002</INVOICE_NUMBER>
<TRANSACTION_CLASS>Invoice</TRANSACTION_CLASS>
</G_TRANSACTIONS>
</LIST_G_TRANSACTIONS>
I am using the following
A minidom based answer
from xml.dom import minidom
xml = """\
<LIST_G_TRANSACTIONS>
<G_TRANSACTIONS>
<INVOICE_NUMBER>31002</INVOICE_NUMBER>
<TRANSACTION_CLASS>Invoice1</TRANSACTION_CLASS>
</G_TRANSACTIONS>
<G_TRANSACTIONS>
<INVOICE_NUMBER>31006</INVOICE_NUMBER>
<TRANSACTION_CLASS>Invoice2</TRANSACTION_CLASS>
</G_TRANSACTIONS>
</LIST_G_TRANSACTIONS>"""
dom = minidom.parseString(xml)
invoice_numbers = [int(x.firstChild.data) for x in dom.getElementsByTagName("INVOICE_NUMBER")]
print(invoice_numbers)
output
[31002, 31006]
If using ElementTree is fine with you, here is the code:
import xml.etree.ElementTree as ET
xml = '''<LIST_G_TRANSACTIONS>
<G_TRANSACTIONS>
<INVOICE_NUMBER>31002</INVOICE_NUMBER>
<TRANSACTION_CLASS>Invoice1</TRANSACTION_CLASS>
</G_TRANSACTIONS>
<G_TRANSACTIONS>
<INVOICE_NUMBER>31006</INVOICE_NUMBER>
<TRANSACTION_CLASS>Invoice2</TRANSACTION_CLASS>
</G_TRANSACTIONS>
</LIST_G_TRANSACTIONS>'''
root = ET.fromstring(xml)
invoice_numbers = [entry.text for entry in list(root.findall('.//INVOICE_NUMBER'))]
print(invoice_numbers)
output
['31002', '31006']

How to get the xml element as a string with namespace using ElementTree in python?

I need to get the elements from xml as a string. I am trying with below xml format.
<xml>
<prot:data xmlns:prot="prot">
<product-id-template>
<prot:ProductId>PRODUCT_ID</prot:ProductId>
</product-id-template>
<product-name-template>
<prot:ProductName>PRODUCT_NAME</prot:ProductName>
</product-name-template>
<dealer-template>
<xsi:Dealer xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">DEALER</xsi:Dealer>
</dealer-template>
</prot:data>
</xml>
And I tried with below code:
from xml.etree import ElementTree as ET
def get_template(xpath, namespaces):
tree = ET.parse('cdata.xml')
elements = tree.getroot()
for element in elements.findall(xpath, namespaces=namespaces):
return element
namespace = {"prot" : "prot"}
aa = get_template(".//prot:ProductId", namespace)
print(ET.tostring(aa).decode())
Actual output:
<ns0:ProductId xmlns:ns0="prot">PRODUCT_ID</ns0:ProductId>
Expected output:
<prot:ProductId>PRODUCT_ID</prot:ProductId>
I should not remove the xmlns from the document where it presents in the document. And It has to be removed where it not presents. Example product-id-template is not containing the xmlns so it needs to be retrieved without xmlns. And dealer-template contains the xmlns so it needs to be retrieved with xmlns.
How to achieve this?
You can remove xmlns with regex.
import re
# ...
with_ns = ET.tostring(aa).decode()
no_ns = re.sub(' xmlns(:\w+)?="[^"]+"', '', with_ns)
print(no_ns)
UPDATE: You can do a very wild thing. Although I can't recommend it, because I'm not a Python expert.
I just checked the source code and found that I can do this hack:
def my_serialize_xml(write, elem, qnames, namespaces,
short_empty_elements, **kwargs):
ET._serialize_xml(write, elem, qnames,
None, short_empty_elements, **kwargs)
ET._serialize["xml"] = my_serialize_xml
I just defined my_serialize_xml, which calls ElementTree._serialize_xml with namespaces=None. And then, in dictionary ElementTree._serialize, I changed value for key "xml" to my_serialize_xml. So when you call ElementTree.tostring, it will use my_serialize_xml.
If you want to try it, just place the code(above) after from xml.etree import ElementTree as ET (but before using the ET).

How to parse the xml with xmlns attribute using python

<?xml version="1.0" ?>
<school xmlns="loyo:22:2.2">
<profile>
<student xmlns="loyo:5:542">
<marks>
<mark java="java:/lo">
<ca1>200</ca1>
</mark>
</marks>
</student>
</profile>
</school>
I trying to access the ca1 text. I am using etree but I cannot access it. I'm using below code.
import xml.etree.ElementTree as ET
tree = ET.parse('mca.xml')
root = tree.getroot()
def getElementsData(xpath):
elements = list()
if root.findall(xpath):
for elem in root.findall(xpath):
elements.append(elem.text)
return elements
else:
raise SystemExit("Invalid xpath provided")
t = getElementsData('.//ca1')
for i in t:
print(i)
I tried in different way to access it I don't know the exact problem. Is it recording file type issue?
Your document has namespaces on nodes school and student, you need to incorporate the namespaces in your search. Since you are looking for ca1, which is under student, you will need to specify the namespace that student node has:
import xml.etree.ElementTree as ET
tree = ET.parse('mca.xml')
root = tree.getroot()
def getElementsData(xpath, namespaces):
elements = root.findall(xpath, namespaces)
if elements == []:
raise SystemExit("Invalid xpath provided")
return elements
namespaces = {'ns_school': 'loyo:22:2.2', 'ns_student': 'loyo:5:542'}
elements = getElementsData('.//ns_student:ca1', namespaces)
for element in elements:
print(element)
Notes
Since your namespaces have no names, I gave them such names as ns_school, ns_student, but these name can be anything (e.g. ns1, mystudent, ...)
In a more complex system, I recommend raising some other kinds of errors and let the caller decide whether or not to exit.
How about traversing like this
import xml.etree.ElementTree
e = xml.etree.ElementTree.parse('test.xml').getroot()
data = e.getchildren()[0].getchildren()[0].getchildren()[0].getchildren()[0].getchildren()[0].text
print(data)
Try the following xpath
tree.xpath('//ca1//text()')[0].strip()

How to use xmltodict to get items out of an xml file

I am trying to easily access values from an xml file.
<artikelen>
<artikel nummer="121">
<code>ABC123</code>
<naam>Highlight pen</naam>
<voorraad>231</voorraad>
<prijs>0.56</prijs>
</artikel>
<artikel nummer="123">
<code>PQR678</code>
<naam>Nietmachine</naam>
<voorraad>587</voorraad>
<prijs>9.99</prijs>
</artikel>
..... etc
If i want to acces the value ABC123, how do I get it?
import xmltodict
with open('8_1.html') as fd:
doc = xmltodict.parse(fd.read())
print(doc[fd]['code'])
Using your example:
import xmltodict
with open('artikelen.xml') as fd:
doc = xmltodict.parse(fd.read())
If you examine doc, you'll see it's an OrderedDict, ordered by tag:
>>> doc
OrderedDict([('artikelen',
OrderedDict([('artikel',
[OrderedDict([('#nummer', '121'),
('code', 'ABC123'),
('naam', 'Highlight pen'),
('voorraad', '231'),
('prijs', '0.56')]),
OrderedDict([('#nummer', '123'),
('code', 'PQR678'),
('naam', 'Nietmachine'),
('voorraad', '587'),
('prijs', '9.99')])])]))])
The root node is called artikelen, and there a subnode artikel which is a list of OrderedDict objects, so if you want the code for every article, you would do:
codes = []
for artikel in doc['artikelen']['artikel']:
codes.append(artikel['code'])
# >>> codes
# ['ABC123', 'PQR678']
If you specifically want the code only when nummer is 121, you could do this:
code = None
for artikel in doc['artikelen']['artikel']:
if artikel['#nummer'] == '121':
code = artikel['code']
break
That said, if you're parsing XML documents and want to search for a specific value like that, I would consider using XPath expressions, which are supported by ElementTree.
This is using xml.etree
You can try this:
for artikelobj in root.findall('artikel'):
print artikelobj.find('code')
if you want to extract a specific code based on the attribute 'nummer' of artikel, then you can try this:
for artikelobj in root.findall('artikel'):
if artikel.get('nummer') == 121:
print artikelobj.find('code')
this will print only the code you want.
You can use lxml package using XPath Expression.
from lxml import etree
f = open("8_1.html", "r")
tree = etree.parse(f)
expression = "/artikelen/artikel[1]/code"
l = tree.xpath(expression)
code = next(i.text for i in l)
print code
# ABC123
The thing to notice here is the expression. /artikelen is the root element. /artikel[1] chooses the first artikel element under root(Notice first element is not at index 0). /code is the child element under artikel[1]. You can read more about at lxml and xpath syntax.
To read .xml files :
import lxml.etree as ET
root = ET.parse(filename).getroot()
value = root.node1.node2.variable_name.text

Accesing values in xml file with namespaces in python 2.7 lxml

I'm following this link to try to get values of several tags:
Parsing XML with namespace in Python via 'ElementTree'
In this link there is no problem to access to the root tag like this:
import sys
from lxml import etree as ET
doc = ET.parse('file.xml')
namespaces_rdf = {'rdf': 'http://www.w3.org/1999/02/22-rdf-syntax-ns#'} # add more as needed
namespaces_dcat = {'dcat': 'http://www.w3.org/ns/dcat#'} # add more as needed
namespaces_dct = {'dct': 'http://purl.org/dc/terms/'}
print doc.findall('rdf:RDF', namespaces_rdf)
print doc.findall('dcat:Dataset', namespaces_dcat)
print doc.findall('dct:identifier', namespaces_dct)
OUTPUT:
[]
[<Element {http://www.w3.org/ns/dcat#}Dataset at 0x2269b98>]
[]
I get only access to dcat:Dataset, and I can't see how to access the value of rdf:about
And later access to dct:identifier
Of course, once I have accessed to this info, I need to acces to dcat:distribution info
This is my example file, generated with ckanext-dcat:
<?xml version="1.0" encoding="utf-8"?>
<rdf:RDF
xmlns:dct="http://purl.org/dc/terms/"
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:dcat="http://www.w3.org/ns/dcat#"
>
<dcat:Dataset rdf:about="http://www.myweb.com/dataset/ec631628-2f46-4f17-a685-d62a37466c01">
<dct:identifier>ec631628-2f46-4f17-a685-d62a37466c01</dct:identifier>
<dct:description>FOO-Description</dct:description>
<dct:title>FOO-title</dct:title>
<dcat:keyword>keyword1</dcat:keyword>
<dcat:keyword>keyword2</dcat:keyword>
<dct:issued rdf:datatype="http://www.w3.org/2001/XMLSchema#dateTime">2014-10-08T08:55:04.566618</dct:issued>
<dct:modified rdf:datatype="http://www.w3.org/2001/XMLSchema#dateTime">2015-06-25T11:04:10.328902</dct:modified>
<dcat:distribution>
<dcat:Distribution rdf:about="http://www.myweb.com/dataset/ec631628-2f46-4f17-a685-d62a37466c01/resource/f5707551-6bf3-468f-9a96-b4184cc51d1f">
<dct:title>FOO-title-1</dct:title>
<dct:description>FOO-Description-1</dct:description>
<dcat:accessURL>http://www.myweb.com/dataset/ec631628-2f46-4f17-a685-d62a37466c01/resource/f5707551-6bf3-468f-9a96-b4184cc51d1f/download/myxls.xls</dcat:accessURL>
<dct:format>XLS</dct:format>
</dcat:Distribution>
</dcat:distribution>
<dcat:distribution>
<dcat:Distribution rdf:about="http://www.myweb.com/dataset/ec631628-2f46-4f17-a685-d62a37466c01/resource/74c1acc8-b2b5-441b-afb2-d072d0d00a7f">
<dct:format>XLS</dct:format>
<dct:title>FOO-title-2</dct:title>
<dct:description>FOO-Description-2</dct:description>
<dcat:accessURL>http://www.myweb.com/dataset/ec631628-2f46-4f17-a685-d62a37466c01/resource/74c1acc8-b2b5-441b-afb2-d072d0d00a7f/download/myxls.xls</dcat:accessURL>
</dcat:Distribution>
</dcat:distribution>
</dcat:Dataset>
</rdf:RDF>
Any idea on how to access this info??
Thanks
UPDATE:
Well, I need to access rdf:about in:
<dcat:Dataset rdf:about="http://www.myweb.com/dataset/ec631628-2f46-4f17-a685-d62a37466c01">
so with this code taken from:
Parse xml with lxml - extract element value
for node in doc.xpath('//dcat:Dataset', namespaces=namespaces):
# Iterate over attributes
for attrib in node.attrib:
print '#' + attrib + '=' + node.attrib[attrib]
I get this output:
[<Element {http://www.w3.org/ns/dcat#}Dataset at 0x23d8ee0>]
#{http://www.w3.org/1999/02/22-rdf-syntax-ns#}about=http://www.myweb.com/dataset/ec631628-2f46-4f17-a685-d62a37466c01
So, the question is:
How can I ask if the attribute is about to take this value, because in other files I have several tags.
UPDATE 2: Fixed how I get about value (clark notations)
for node in doc.xpath('//dcat:Dataset', namespaces=namespaces):
# Iterate over attributes
for attrib in node.attrib:
if attrib.endswith('about'):
#do my jobs
Well, almost finished, but I have last question: I need to know when I access my
<dct:title>
to which belongs, I have:
<dcat:Dataset rdf:about="http://www.myweb.com/dataset/ec631628-2f46-4f17-a685-d62a37466c01">
<dct:title>FOO-title</dct:title>
<dcat:Distribution rdf:about="http://www.myweb.com/dataset/ec631628-2f46-4f17-a685-d62a37466c01/resource/f5707551-6bf3-468f-9a96-b4184cc51d1f">
<dct:title>FOO-title-1</dct:title>
<dcat:Distribution rdf:about="http://www.myweb.com/dataset/ec631628-2f46-4f17-a685-d62a37466c01/resource/74c1acc8-b2b5-441b-afb2-d072d0d00a7f">
<dct:title>FOO-title-2</dct:title>
If I do something like this I get:
for node in doc.xpath('//dct:title', namespaces=namespaces):
print node.tag, node.text
{http://purl.org/dc/terms/}title FOO-title
{http://purl.org/dc/terms/}title FOO-title-1
{http://purl.org/dc/terms/}title FOO-title-2
Thanks
Use the xpath() method with namespaces named argument:
namespaces = {
'rdf': 'http://www.w3.org/1999/02/22-rdf-syntax-ns#',
'dcat': 'http://www.w3.org/ns/dcat#',
'dct': 'http://purl.org/dc/terms/'
}
print(doc.xpath('//rdf:RDF', namespaces=namespaces))
print(doc.xpath('//dcat:Dataset', namespaces=namespaces))
print(doc.xpath('//dct:identifier', namespaces=namespaces))

Categories