Finding element in xml with python - python

I am trying to parse XML before converting it's content into lists and then into CSV. Unfortunately, I think my search terms for finding the initial element are failing, causing subsequent searches further down the hierarchy. I am new to XML, so I've tried variations on namespace dictionaries and including the namespace references... The simplified XML is given below:
<?xml version="1.0" encoding="utf-8"?>
<StationList xmlns:xsd="http://www.w3.org/2001/XMLSchema"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns:add="http://www.govtalk.gov.uk/people/AddressAndPersonalDetails"
xmlns:com="http://nationalrail.co.uk/xml/common" xsi:schemaLocation="http://internal.nationalrail.co.uk/xml/XsdSchemas/External/Version4.0/nre-station-v4-0.xsd"
xmlns="http://nationalrail.co.uk/xml/station">
<Station xsi:schemaLocation="http://internal.nationalrail.co.uk/xml/XsdSchemas/External/Version4.0/nre-station-v4-0.xsd">
<ChangeHistory>
<com:ChangedBy>spascos</com:ChangedBy>
<com:LastChangedDate>2018-11-07T00:00:00.000Z</com:LastChangedDate>
</ChangeHistory>
<Name>Aber</Name>
</Station>​
The Code I am using to try to extract the com/...xml/station / ChangedBy element is below
tree = ET.parse(rootfilepath + "NRE_Station_Dataset_2019_raw.xml")
root = tree.getroot()
#get at the tags and their data
#for elem in tree.iter():
# print(f"this the tag {elem.tag} and this is the data: {elem.text}")
#open file for writing
station_data = open(rootfilepath + 'station_data.csv','w')
csvwriter = csv.writer(station_data)
station_head = []
count = 0
#inspiration for this code: http://blog.appliedinformaticsinc.com/how-to- parse-and-convert-xml-to-csv-using-python/
#this is where it goes wrong; some combination of the namespace and the tag can't find anything in line 27, 'StationList'
for member in root.findall('{http://nationalrail.co.uk/xml/station}Station'):
station = []
if count == 0:
changedby = member.find('{http://nationalrail.co.uk/xml/common}ChangedBy').tag
station_head.append(changedby)
name = member.find('{http://nationalrail.co.uk/xml/station}Name').tag
station_head.append(name)
count = count+1
changedby = member.find('{http://nationalrail.co.uk/xml/common}ChangedBy').text
station.append(changedby)
name = member.find('{http://nationalrail.co.uk/xml/station}Name').text
station.append(name)
csvwriter.writerow(station)
I have tried:
using dictionaries of namespaces but that results in nothing being found at all
using hard coded namespaces but that results in "Attribute Error: 'NoneType' object has no attribute 'tag'
Thanks in advance for all and any assistance.

First of all your XML is invalid (</StationList> is absent at the end of a file).
Assuming you have valid XML file:
<?xml version="1.0" encoding="utf-8"?>
<StationList xmlns:xsd="http://www.w3.org/2001/XMLSchema"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns:add="http://www.govtalk.gov.uk/people/AddressAndPersonalDetails"
xmlns:com="http://nationalrail.co.uk/xml/common" xsi:schemaLocation="http://internal.nationalrail.co.uk/xml/XsdSchemas/External/Version4.0/nre-station-v4-0.xsd"
xmlns="http://nationalrail.co.uk/xml/station">
<Station xsi:schemaLocation="http://internal.nationalrail.co.uk/xml/XsdSchemas/External/Version4.0/nre-station-v4-0.xsd">
<ChangeHistory>
<com:ChangedBy>spascos</com:ChangedBy>
<com:LastChangedDate>2018-11-07T00:00:00.000Z</com:LastChangedDate>
</ChangeHistory>
<Name>Aber</Name>
</Station>​
</StationList>
Then you can convert your XML to JSON and simply address to the required value:
import xmltodict
with open('file.xml', 'r') as f:
data = xmltodict.parse(f.read())
changed_by = data['StationList']['Station']['ChangeHistory']['com:ChangedBy']
Output:
spascos

Try lxml:
#!/usr/bin/env python3
from lxml import etree
ns = {"com": "http://nationalrail.co.uk/xml/common"}
with open("so.xml") as f:
tree = etree.parse(f)
for t in tree.xpath("//com:ChangedBy/text()", namespaces=ns):
print(t)
Output:
spascos

You can use Beautifulsoup which is an html and xml parser
from bs4 import BeautifulSoup
fd = open(rootfilepath + "NRE_Station_Dataset_2019_raw.xml")
soup = BeautifulSoup(fd,'lxml-xml')
for i in soup.findAll('ChangeHistory'):
print(i.ChangedBy.text)

Related

How to access UBL 2.1 xml tag using python

I need to access the tags in UBL 2.1 and modify them depend on the on the user input on python.
So, I used the ElementTree library to access the tags and modify them.
Here is a sample of the xml code:
<ns0:Invoice xmlns:ns0="urn:oasis:names:specification:ubl:schema:xsd:Invoice-2" xmlns:ns1="urn:oasis:names:specification:ubl:schema:xsd:CommonBasicComponents-2" xmlns:ns2="urn:oasis:names:specification:ubl:schema:xsd:CommonAggregateComponents-2">
<ns1:ProfileID>reporting:1.0</ns1:ProfileID>
<ns1:ID>0</ns1:ID>
<ns1:UUID>dbdf65eb-5d66-47e6-bb0c-a84bbf7baa30</ns1:UUID>
<ns1:IssueDate>2022-11-05</ns1:IssueDate>
The issue :
I want to access the tags but it is doesn't modifed and enter the loop
I tried both ways:
mytree = ET.parse('test.xml')
myroot = mytree.getroot()
for x in myroot.find({xmlns:ns1=urn:oasis:names:specification:ubl:schema:xsd:CommonBasicComponents-2}IssueDate}"):
x.text = '1999'
mytree.write('test.xml')
mytree = ET.parse('test.xml')
myroot = mytree.getroot()
for x in myroot.iter('./Invoice/AllowanceCharge/ChargeIndicator'):
x.text = str('true')
mytree.write('test.xml')
None of them worked and modify the tag.
So the questions is : How can I reach the specific tag and modify it?
If you correct the namespace and the brakets in your for loop it works for a valid XML like (root tag must be closed!):
Input:
<?xml version="1.0" encoding="utf-8"?>
<ns0:Invoice xmlns:ns0="urn:oasis:names:specification:ubl:schema:xsd:Invoice-2" xmlns:ns1="urn:oasis:names:specification:ubl:schema:xsd:CommonBasicComponents-2" xmlns:ns2="urn:oasis:names:specification:ubl:schema:xsd:CommonAggregateComponents-2">
<ns1:ProfileID>reporting:1.0</ns1:ProfileID>
<ns1:ID>0</ns1:ID>
<ns1:UUID>dbdf65eb-5d66-47e6-bb0c-a84bbf7baa30</ns1:UUID>
<ns1:IssueDate>2022-11-05</ns1:IssueDate>
</ns0:Invoice>
Your repaired code:
import xml.etree.ElementTree as ET
tree = ET.parse('test.xml')
root = tree.getroot()
for elem in root.findall("{urn:oasis:names:specification:ubl:schema:xsd:CommonBasicComponents-2}IssueDate"):
elem.text = '1999'
tree.write('test_changed.xml', encoding='utf-8', xml_declaration=True)
ET.dump(root)
Output:
<ns0:Invoice xmlns:ns0="urn:oasis:names:specification:ubl:schema:xsd:Invoice-2" xmlns:ns1="urn:oasis:names:specification:ubl:schema:xsd:CommonBasicComponents-2">
<ns1:ProfileID>reporting:1.0</ns1:ProfileID>
<ns1:ID>0</ns1:ID>
<ns1:UUID>dbdf65eb-5d66-47e6-bb0c-a84bbf7baa30</ns1:UUID>
<ns1:IssueDate>1999</ns1:IssueDate>
</ns0:Invoice>

How to use ElemenTree for reading XML files in Python?

I've got an XML file which looks like this:
<?xml version="1.0"?>
-<Object>
<ID>Object_01</ID>
<Location>Manchester</Location>
<Date>01-01-2020</Date>
<Time>15u59m05s</Time>
-<Max_25Hz>
<25Hz>0.916631065043311</25Hz>
<25Hz>0.797958008447961</25Hz>
</Max_25Hz>
-<Max_75Hz>
<75Hz>1.96599232706463</75Hz>
<75Hz>1.48317837078523</75Hz>
</Max_75Hz>
</Object>
I still don't really understand the difference between attributes and text. With below code I tried to receive all the values using text.
import xml.etree.ElementTree as ET
root = r'c:\data\FF\Desktop\My_files\XML-files\Object_01.xml'
tree = ET.parse(root)
root = tree.getroot()
for elem in root:
for subelem in elem:
print(subelem.text)
Expected output:
Object_01
Manchester
01-01-2020
15u59m05s
0.916631065043311
0.797958008447961
1.96599232706463
1.48317837078523
Received output:
0.916631065043311
0.797958008447961
1.96599232706463
1.48317837078523
I tried to do to same with .attributes in the hope to receive all the 'column' names but then I received:
{}
{}
{}
{}
You can access them directly above the for-loop.
Ex:
tree = ET.ElementTree(ET.fromstring(X))
root = tree.getroot()
for elem in root:
print(elem.text) #! Access them Here
for subelem in elem:
print(subelem.text)
Output:
Object_01
Manchester
01-01-2020
15u59m05s
0.916631065043311
0.797958008447961
1.96599232706463
1.48317837078523
You could give a try to https://github.com/martinblech/xmltodict.
It is almost a replacement for json module. This allows to read an xml file into a python dict. This simplifies greatly accessing the xml content.
Something like:
from xmldict import *
root = r'c:\data\FF\Desktop\My_files\XML-files\Object_01.xml'
with open(root) as file:
xmlStr = file.read()
xmldict = xml.parse(xmlStr)
print (xmldict['Object']['Id'])

Python lxml: how to fetch XML tag names with xpath selector?

I'm trying to parse the following XML using Python and lxml:
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="/bind9.xsl"?>
<isc version="1.0">
<bind>
<statistics version="2.2">
<memory>
<summary>
<TotalUse>1232952256
</TotalUse>
<InUse>835252452
</InUse>
<BlockSize>598212608
</BlockSize>
<ContextSize>52670016
</ContextSize>
<Lost>0
</Lost>
</summary>
</memory>
</statistics>
</bind>
</isc>
The goal is to extract the tag name and text of every element under bind/statistics/memory/summary in order to produce the following mapping:
TotalUse: 1232952256
InUse: 835252452
BlockSize: 598212608
ContextSize: 52670016
Lost: 0
I've managed to extract the element values, but I can't figure out the xpath expression to get the element tag names.
A sample script:
from lxml import etree as et
def main():
xmlfile = "bind982.xml"
location = "bind/statistics/memory/summary/*"
label_selector = "??????" ## what to put here...?
value_selector = "text()"
with open(xmlfile, "r") as data:
xmldata = et.parse(data)
etree = xmldata.getroot()
statlist = etree.xpath(location)
for stat in statlist:
label = stat.xpath(label_selector)[0]
value = stat.xpath(value_selector)[0]
print "{0}: {1}".format(label, value)
if __name__ == '__main__':
main()
I know I could use value = stat.tag instead of stat.xpath(), but the script must be sufficiently generic to also process other pieces of XML where the label selector is different.
What xpath selector would return an element's tag name?
Simply use XPath's name(), and remove the zero index since this returns a string and not list.
from lxml import etree as et
def main():
xmlfile = "ExtractXPathTagName.xml"
location = "bind/statistics/memory/summary/*"
label_selector = "name()" ## what to put here...?
value_selector = "text()"
with open(xmlfile, "r") as data:
xmldata = et.parse(data)
etree = xmldata.getroot()
statlist = etree.xpath(location)
for stat in statlist:
label = stat.xpath(label_selector)
value = stat.xpath(value_selector)[0]
print("{0}: {1}".format(label, value).strip())
if __name__ == '__main__':
main()
Output
TotalUse: 1232952256
InUse: 835252452
BlockSize: 598212608
ContextSize: 52670016
Lost: 0
I think you don't need XPath for the two values, the element nodes have properties tag and text so use for instance a list comprehension:
[(element.tag, element.text) for element in etree.xpath(location)]
Or if you really want to use XPath
result = [(element.xpath('name()'), element.xpath('string()')) for element in etree.xpath(location)]
You could of course also construct a list of dictionaries:
result = [{ element.tag : element.text } for element in root.xpath(location)]
or
result = [{ element.xpath('name()') : element.xpath('string()') } for element in etree.xpath(location)]

LXML add an element into root

Im trying to take two elements from one file (file1.xml), and write them onto the end of another file (file2.xml). I am able to get them to print out, but am stuck trying to write them onto file2.xml! Help !
filename = "file1.xml"
appendtoxml = "file2.xml"
output_file = appendtoxml.replace('.xml', '') + "_editedbyed.xml"
parser = etree.XMLParser(remove_blank_text=True)
tree = etree.parse(filename, parser)
etree.tostring(tree)
root = tree.getroot()
a = root.findall(".//Device")
b = root.findall(".//Speaker")
for r in a:
print etree.tostring(r)
for e in b:
print etree.tostring(e)
NewSub = etree.SubElement (root, "Audio(just writes audio..")
print NewSub
I want the results of a, b to be added onto the end of outputfile.xml in the root.
Parse both the input file and the file you wish to append to.
Use root.append(elt) to append Element, elt, to root.
Then use tree.write to write the new tree to a file (e.g. appendtoxml):
Note: The links above point to documentation for xml.etree from the standard
library. Since lxml's API tries to be compatible with the standard library's
xml.etree, the standard library documentation applies to lxml as well (at
least for these methods). See http://lxml.de/api.html for information on where
the APIs differ.
import lxml.etree as ET
filename = "file1.xml"
appendtoxml = "file2.xml"
output_file = appendtoxml.replace('.xml', '') + "_editedbyed.xml"
parser = ET.XMLParser(remove_blank_text=True)
tree = ET.parse(filename, parser)
root = tree.getroot()
out_tree = ET.parse(appendtoxml, parser)
out_root = out_tree.getroot()
for path in [".//Device", ".//Speaker"]:
for elt in root.findall(path):
out_root.append(elt)
out_tree.write(output_file, pretty_print=True)
If file1.xml contains
<?xml version="1.0"?>
<root>
<Speaker>boozhoo</Speaker>
<Device>waaboo</Device>
<Speaker>anin</Speaker>
<Device>gigiwishimowin</Device>
</root>
and file2.xml contains
<?xml version="1.0"?>
<root>
<Speaker>jubal</Speaker>
<Device>crane</Device>
</root>
then file2_editedbyed.xml will contain
<root>
<Speaker>jubal</Speaker>
<Device>crane</Device>
<Device>waaboo</Device>
<Device>gigiwishimowin</Device>
<Speaker>boozhoo</Speaker>
<Speaker>anin</Speaker>
</root>

Parsing XML tags through Python and replace it using xml.dom.minidom

My XML file test.xml contains the following tags
<?xml version="1.0" encoding="ISO-8859-1"?>
<AppName>
<out>This is a sample output with <test>default</test> text </out>
<AppName>
I have written a python code which does the following till now:
from xml.dom.minidom import parseString
list = {'test':'example'}
file = open('test.xml','r')
data = file.read()
file.close()
dom = parseString(data)
if (len(dom.getElementsByTagName('out'))!=0):
xmlTag = dom.getElementsByTagName('out')[0].toxml()
out = xmlTag.replace('<out>','').replace('</out>','')
print out
The output of the following program is This is a sample output with <test>default</test> text
You will also notice i have a list with list = {'test':'example'} defined.
I want to check if in the out there is a tag which is listed in the list, will be replaced with the corresponding value, else the default value.
In this case, the output should be:
This is a sample output with example text
This will do more or less what you want:
from xml.dom.minidom import parseString, getDOMImplementation
test_xml = '''<?xml version="1.0" encoding="ISO-8859-1"?>
<AppName>
<out>This is a sample output with <test>default</test> text </out>
</AppName>'''
replacements = {'test':'example'}
dom = parseString(test_xml)
if (len(dom.getElementsByTagName('out'))!=0):
xmlTag = dom.getElementsByTagName('out')[0]
children = xmlTag.childNodes
text = ""
for c in children:
if c.nodeType == c.TEXT_NODE:
text += c.data
else:
if c.nodeName in replacements.keys():
text += replacements[c.nodeName]
else: # not text, nor a listed tag
text += c.toxml()
print text
Notice that I used replacements rather than list. In python terms, it's a dictionary, not a list, so that's a confusing name. It's also a builtin function, so you should avoid using it as a name.
If you want a dom object rather than just text, you'll need to take a different approach.

Categories