I know there are several xml parsers for python, but I dont know which one would be good to parse outputs of mysql xml, I havent been successfully yet. The output looks like:
<resultset statement="select * from table where id > 5" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<row>
<field name="name">first</field>
<field name="login">2021-08-16 13:44:35</field>
</row>
<row>
<field name="name">second</field>
<field name="login">2021-08-18 13:44:35</field>
</row>
</resultset>
because the structure is quite simple here, I come about to write my own parser, but I would guess there should be already something to cover this case?!
Output should be a list of dicts with columns as keys and the value as the content of the row/column
see below
import xml.etree.ElementTree as ET
xml = '''<resultset statement="select * from table where id > 5" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<row>
<field name="name">first</field>
<field name="login">2021-08-16 13:44:35</field>
</row>
<row>
<field name="name">second</field>
<field name="login">2021-08-18 13:44:35</field>
</row>
</resultset>'''
data = []
root = ET.fromstring(xml)
for row in root.findall('.//row'):
fields = []
for field in row.findall('field'):
fields.append((field.attrib['name'], field.text))
data.append(fields)
print(data)
output
[[('name', 'first'), ('login', '2021-08-16 13:44:35')], [('name', 'second'), ('login', '2021-08-18 13:44:35')]]
Related
The structure of the code is as shown below:
This is an xml file
<ROOT>
<data>
<record>
<field name="Country or Area">Afghanistan</field>
<field name="Year">2020</field>
<field name="Item">Gross Domestic Product (GDP)</field>
<field name="Value">508.453721937094</field>
</record>
<record>
<field name="Country or Area">Afghanistan</field>
<field name="Year">2019</field>
<field name="Item">Gross Domestic Product (GDP)</field>
<field name="Value">496.940552822825</field>
</record>
</data>
</ROOT>
I have tried, i've tried other methods but no luck
from lxml import objectify
xml = objectify.parse('GDP_pc.xml')
root = xml.getroot()
data=[]
for i in range(len(root.getchildren())):
data.append([child.text for child in root.getchildren()[i].getchildren()])
df = pd.DataFrame(data)
df.columns = ['Country or Area', 'Year', 'Item', 'Value',]
Have you tried the pandas method pd.read_xml()?
It reads and transform a xml file into a dataframe.
Just to the following:
df = pd.read_xml('GDP_pc.xml')
You can read more about it on the official documentation
See below
import xml.etree.ElementTree as ET
import pandas as pd
xml = '''<ROOT>
<data>
<record>
<field name="Country or Area">Afghanistan</field>
<field name="Year">2020</field>
<field name="Item">Gross Domestic Product (GDP)</field>
<field name="Value">508.453721937094</field>
</record>
<record>
<field name="Country or Area">Afghanistan</field>
<field name="Year">2019</field>
<field name="Item">Gross Domestic Product (GDP)</field>
<field name="Value">496.940552822825</field>
</record>
</data>
</ROOT>'''
data = []
root = ET.fromstring(xml)
for rec in root.findall('.//record'):
data.append({field.attrib['name']: field.text for field in rec.findall('field')})
df = pd.DataFrame(data)
print(df)
output
Country or Area Year Item Value
0 Afghanistan 2020 Gross Domestic Product (GDP) 508.453721937094
1 Afghanistan 2019 Gross Domestic Product (GDP) 496.940552822825
I am new with python and I am looking for advices on what is the best approach to do the following task:
I have an xml file looking like this
<component xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.spiritconsortium.org/XMLSchema/SPIRIT/1685-2009 http://www.spiritconsortium.org/XMLSchema/SPIRIT/1685-2009/index.xsd">
<memoryMaps>
<memoryMap>
<name>name</name>
<description>description</description>
<peripheral>
<name>periph</name>
<description>description</description>
<baseAddress>0x0</baseAddress>
<range>0x8</range>
<width>32</width>
<register>
<name>reg1</name>
<displayName>reg1</displayName>
<addressOffset>0x0</addressOffset>
<size>32</size>
<access>read-write</access>
<reset>
<value>0x00000002</value>
<mask>0xFFFFFFFF</mask>
</reset>
<field>
</field>
</register>
</peripheral>
</memoryMap>
</memoryMaps>
</component>
I want to do some modifications to modify the node of "reset" to become 2 separate nodes, one for "resetValue" and another "resetMask" but keeping same data in "value" and "mask" extracted into "resetValue" and "resetMask" as follow:
........
<access>read-write</access>
<resetValue>0x00000002</resetValue>
<resetMask>0xFFFFFFFF</resetMask>
<field>
.............
I managed the part of parsing my xml file with success, now I can't know how to start this first modification.
Thank you to guide me.
code that create 2 sub elements under 'register' and remove the unneeded element 'reset'
import xml.etree.ElementTree as ET
xml = '''<component xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.spiritconsortium.org/XMLSchema/SPIRIT/1685-2009 http://www.spiritconsortium.org/XMLSchema/SPIRIT/1685-2009/index.xsd">
<memoryMaps>
<memoryMap>
<name>name</name>
<description>description</description>
<peripheral>
<name>periph</name>
<description>description</description>
<baseAddress>0x0</baseAddress>
<range>0x8</range>
<width>32</width>
<register>
<name>reg1</name>
<displayName>reg1</displayName>
<addressOffset>0x0</addressOffset>
<size>32</size>
<access>read-write</access>
<reset>
<value>0x00000002</value>
<mask>0xFFFFFFFF</mask>
</reset>
<field>
</field>
</register>
</peripheral>
</memoryMap>
</memoryMaps>
</component>'''
root = ET.fromstring(xml)
register = root.find('.//register')
value = register.find('.//reset/value').text
mask = register.find('.//reset/mask').text
v = ET.SubElement(register, 'resetValue')
v.text = value
m = ET.SubElement(register, 'resetMask')
m.text = mask
register.remove(register.find('reset'))
ET.dump(root)
output
<?xml version="1.0" encoding="UTF-8"?>
<component xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.spiritconsortium.org/XMLSchema/SPIRIT/1685-2009 http://www.spiritconsortium.org/XMLSchema/SPIRIT/1685-2009/index.xsd">
<memoryMaps>
<memoryMap>
<name>name</name>
<description>description</description>
<peripheral>
<name>periph</name>
<description>description</description>
<baseAddress>0x0</baseAddress>
<range>0x8</range>
<width>32</width>
<register>
<name>reg1</name>
<displayName>reg1</displayName>
<addressOffset>0x0</addressOffset>
<size>32</size>
<access>read-write</access>
<field />
<resetValue>0x00000002</resetValue>
<resetMask>0xFFFFFFFF</resetMask>
</register>
</peripheral>
</memoryMap>
</memoryMaps>
</component>
I have the following XML:
<?xml version="1.0" encoding="UTF-8"?>
<data>
<columns>
<Leftover index="5">Leftover</Leftover>
<NODE5 index="6"></NODE5>
<NODE6 index="7"></NODE6>
<NODE8 index="9"></NODE8>
<Nomenk__Nr_ index="2">Nomenk.
Nr.</Nomenk__Nr_>
<Year index="8">2020</Year>
<Name index="1">Name</Name>
<Value_code index="3">Value code</Value_code>
</columns>
<records>
<record index="1">
<Leftover>Leftover</Leftover>
<NODE5>Test1</NODE5>
<NODE6>Test2</NODE6>
<NODE8>Test3</NODE8>
<Nomenk__Nr_></Nomenk__Nr_>
<Name></Name>
<Value_code></Value_code>
</record>
... (it repeats itself with different values and the index value increments)
My code is:
import lxml
import lxml.etree as et
xml = open('C:\outputfile.xml', 'rb')
xml_content = xml.read()
tree = et.fromstring(xml_content)
for bad in tree.xpath("//records[#index=\'*\']/NODE5"):
bad.getparent().remove(bad) # here I grab the parent of the element to call the remove directly on it
result = (et.tostring(tree, pretty_print=True, xml_declaration=True))
f = open( 'outputxml.xml', 'w' )
f.write( str(result) )
f.close()
What I need to do is to remove the NODE5, NODE6, NODE8. I tried using a wildcard and then specifying one of the nodes (see line 6) but that seems to not have worked... I'm also getting a syntax error right after the loop on the first character but the code executes.
My problem is also that the encoding by lxml is set to ASCII afterwards when the file is "exported".
UPDATE
I am getting this error on line 8:
return = ...
^
SyntaxError: invalid syntax
I took some code from https://stackoverflow.com/a/7981894/1987598
What I need to do is to remove the NODE5, NODE6, NODE8.
below
import xml.etree.ElementTree as ET
xml = '''<?xml version="1.0" encoding="UTF-8"?>
<data>
<columns>
<Leftover index="5">Leftover</Leftover>
<NODE5 index="6" />
<NODE6 index="7" />
<NODE8 index="9" />
<Nomenk__Nr_ index="2">Nomenk.
Nr.</Nomenk__Nr_>
<Year index="8">2020</Year>
<Name index="1">Name</Name>
<Value_code index="3">Value code</Value_code>
</columns>
<records>
<record index="1">
<Leftover>Leftover</Leftover>
<NODE5>Test1</NODE5>
<NODE6>Test2</NODE6>
<NODE8>Test3</NODE8>
<Nomenk__Nr_ />
<Name />
<Value_code />
</record>
<record index="21">
<Leftover>Leftover</Leftover>
<NODE5>Test11</NODE5>
<NODE6>Test21</NODE6>
<NODE8>Test39</NODE8>
<Nomenk__Nr_ />
<Name />
<Value_code />
</record>
</records>
</data>'''
root = ET.fromstring(xml)
col = root.find('./columns')
for x in ['5','6','8']:
nodes_to_remove = col.findall('./NODE{}'.format(x))
for node in nodes_to_remove:
col.remove(node)
records = root.find('./records')
records_lst = records.findall('./record'.format(x))
for r in records_lst:
for x in ['5','6','8']:
nodes_to_remove = r.findall('./NODE{}'.format(x))
for node in nodes_to_remove:
r.remove(node)
ET.dump(root)
output
<data>
<columns>
<Leftover index="5">Leftover</Leftover>
<Nomenk__Nr_ index="2">Nomenk.
Nr.</Nomenk__Nr_>
<Year index="8">2020</Year>
<Name index="1">Name</Name>
<Value_code index="3">Value code</Value_code>
</columns>
<records>
<record index="1">
<Leftover>Leftover</Leftover>
<Nomenk__Nr_ />
<Name />
<Value_code />
</record>
<record index="2">
<Leftover>Leftover</Leftover>
<Nomenk__Nr_ />
<Name />
<Value_code />
</record>
</records>
</data>
XML = """<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<Entities TotalResults="101" PageSize="100" PageNumber="1">
<Entity Type="run">
<Fields>
<Field Name="host">
<Value>osdc-vw64</Value>
</Field>
<Field Name="status">
<Value>Passed</Value>
</Field>
<Field Name="owner">
<Value>Aspeg</Value>
</Field>
<Field Name="user-template-01">
<Value>1941896</Value>
</Field>
<Field Name="test-id">
<Value>72769</Value>
</Field>
</Fields>
</Entity>
<Entity Type="run">
<Fields>
<Field Name="host">
<Value>osdc-57</Value>
</Field>
<Field Name="status">
<Value>Passed</Value>
</Field>
<Field Name="owner">
<Value>spana</Value>
</Field>
<Field Name="user-template-01">
<Value>1941896</Value>
</Field>
<Field Name="test-id">
<Value>72769</Value>
</Field>
</Fields>
</Entity>
</Entities>"""
I have used :
from xml.etree import ElementTree as ET
root = ET.fromstring(XML)
print root.tag
I do not know how to go ahead now ...
The easiest way would be to use PyQuery (if you understand jQuery selectors):
from pyquery import PyQuery
query = PyQuery(xml);
host = query("[Name='host'] value").text()
test_id = query("[Name='test-id'] value").text()
Since you have multiple elements with Name='host', you should iterate over Entities:
from pyquery import PyQuery
def process_Entity(entity):
pass #do something
query = PyQuery(xml);
query("Entity").each(process_Entity)
import xml.etree.ElementTree as ET
tree = ET.parse('hai.xml')
root = tree.getroot()
for child in root:
print child.tag, child.attrib
for a in child:
print a.tag
for b in a:
print b.attrib , b[0].text
Using lxml.etree:
import lxml.etree as ET
XML = """ your string here """
tree = ET.fromstring(XML) # you may get errors here because of encoding
# if so, re.sub(r"\bencoding="[^"]+?", '', XML) works
info_you_need = {entity: {field.get("Name"): field.find("Value").text for field in entity.findall("Fields/Field")} for entity in tree.findall("Entity")}
N.B. I'm pretty awful with the lxml module, someone may come up with a much better solution than this :) My output was:
{<Element Entity at 0x2af4e18>: {'user-template-01': '1941896', 'owner': 'spana', 'test-id': '72769', 'status': 'Passed', 'host': 'osdc-57'},
<Element Entity at 0x2af4e40>: {'user-template-01': '1941896', 'owner': 'Aspeg', 'test-id': '72769', 'status': 'Passed', 'host': 'osdc-vw64'}}
Let's suppose we have the XML file with the structure as follows.
<?xml version="1.0" ?>
<searchRetrieveResponse xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.loc.gov/zing/srw/ http://www.loc.gov/standards/sru/sru1-1archive/xml-files/srw-types.xsd" xmlns="http://www.loc.gov/zing/srw/">
<records xmlns:ns1="http://www.loc.gov/zing/srw/">
<record>
<recordData>
<record xmlns="">
<datafield tag="000">
<subfield code="a">123</subfield>
<subfield code="b">456</subfield>
</datafield>
<datafield tag="001">
<subfield code="a">789</subfield>
<subfield code="b">987</subfield>
</datafield>
</record>
</recordData>
</record>
<record>
<recordData>
<record xmlns="">
<datafield tag="000">
<subfield code="a">123</subfield>
<subfield code="b">456</subfield>
</datafield>
<datafield tag="001">
<subfield code="a">789</subfield>
<subfield code="b">987</subfield>
</datafield>
</record>
</recordData>
</record>
</records>
</searchRetrieveResponse>
I need to parse out:
The content of the "subfield" (e.g. 123 in the example above) and
Attribute values (e.g. 000 or 001)
I wonder how to do that using lxml and XPath. Pasted below is my initial code and I kindly ask someone to explain me, how to parse out values.
import urllib, urllib2
from lxml import etree
url = "https://dl.dropbox.com/u/540963/short_test.xml"
fp = urllib2.urlopen(url)
doc = etree.parse(fp)
fp.close()
ns = {'xsi':'http://www.loc.gov/zing/srw/'}
for record in doc.xpath('//xsi:record', namespaces=ns):
print record.xpath("xsi:recordData/record/datafield[#tag='000']", namespaces=ns)
I would be more direct in your XPath: go straight for the elements you want, in this case datafield.
>>> for df in doc.xpath('//datafield'):
# Iterate over attributes of datafield
for attrib_name in df.attrib:
print '#' + attrib_name + '=' + df.attrib[attrib_name]
# subfield is a child of datafield, and iterate
subfields = df.getchildren()
for subfield in subfields:
print 'subfield=' + subfield.text
Also, lxml appears to let you ignore the namespace, maybe because your example only uses one namespace?
I would just go with
for df in doc.xpath('//datafield'):
print df.attrib
for sf in df.getchildren():
print sf.text
Also you don't need urllib, you can directly parse XML with HTTP
url = "http://dl.dropbox.com/u/540963/short_test.xml" #doesn't work with https though
doc = etree.parse(url)
Try the following working code :
import urllib2
from lxml import etree
url = "https://dl.dropbox.com/u/540963/short_test.xml"
fp = urllib2.urlopen(url)
doc = etree.parse(fp)
fp.close()
for record in doc.xpath('//datafield'):
print record.xpath("./#tag")[0]
for x in record.xpath("./subfield/text()"):
print "\t", x