Parse xml with lxml - extract element value - python

Let's suppose we have the XML file with the structure as follows.
<?xml version="1.0" ?>
<searchRetrieveResponse xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.loc.gov/zing/srw/ http://www.loc.gov/standards/sru/sru1-1archive/xml-files/srw-types.xsd" xmlns="http://www.loc.gov/zing/srw/">
<records xmlns:ns1="http://www.loc.gov/zing/srw/">
<record>
<recordData>
<record xmlns="">
<datafield tag="000">
<subfield code="a">123</subfield>
<subfield code="b">456</subfield>
</datafield>
<datafield tag="001">
<subfield code="a">789</subfield>
<subfield code="b">987</subfield>
</datafield>
</record>
</recordData>
</record>
<record>
<recordData>
<record xmlns="">
<datafield tag="000">
<subfield code="a">123</subfield>
<subfield code="b">456</subfield>
</datafield>
<datafield tag="001">
<subfield code="a">789</subfield>
<subfield code="b">987</subfield>
</datafield>
</record>
</recordData>
</record>
</records>
</searchRetrieveResponse>
I need to parse out:
The content of the "subfield" (e.g. 123 in the example above) and
Attribute values (e.g. 000 or 001)
I wonder how to do that using lxml and XPath. Pasted below is my initial code and I kindly ask someone to explain me, how to parse out values.
import urllib, urllib2
from lxml import etree
url = "https://dl.dropbox.com/u/540963/short_test.xml"
fp = urllib2.urlopen(url)
doc = etree.parse(fp)
fp.close()
ns = {'xsi':'http://www.loc.gov/zing/srw/'}
for record in doc.xpath('//xsi:record', namespaces=ns):
print record.xpath("xsi:recordData/record/datafield[#tag='000']", namespaces=ns)

I would be more direct in your XPath: go straight for the elements you want, in this case datafield.
>>> for df in doc.xpath('//datafield'):
# Iterate over attributes of datafield
for attrib_name in df.attrib:
print '#' + attrib_name + '=' + df.attrib[attrib_name]
# subfield is a child of datafield, and iterate
subfields = df.getchildren()
for subfield in subfields:
print 'subfield=' + subfield.text
Also, lxml appears to let you ignore the namespace, maybe because your example only uses one namespace?

I would just go with
for df in doc.xpath('//datafield'):
print df.attrib
for sf in df.getchildren():
print sf.text
Also you don't need urllib, you can directly parse XML with HTTP
url = "http://dl.dropbox.com/u/540963/short_test.xml" #doesn't work with https though
doc = etree.parse(url)

Try the following working code :
import urllib2
from lxml import etree
url = "https://dl.dropbox.com/u/540963/short_test.xml"
fp = urllib2.urlopen(url)
doc = etree.parse(fp)
fp.close()
for record in doc.xpath('//datafield'):
print record.xpath("./#tag")[0]
for x in record.xpath("./subfield/text()"):
print "\t", x

Related

how to parse xml output of mysql in python

I know there are several xml parsers for python, but I dont know which one would be good to parse outputs of mysql xml, I havent been successfully yet. The output looks like:
<resultset statement="select * from table where id > 5" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<row>
<field name="name">first</field>
<field name="login">2021-08-16 13:44:35</field>
</row>
<row>
<field name="name">second</field>
<field name="login">2021-08-18 13:44:35</field>
</row>
</resultset>
because the structure is quite simple here, I come about to write my own parser, but I would guess there should be already something to cover this case?!
Output should be a list of dicts with columns as keys and the value as the content of the row/column
see below
import xml.etree.ElementTree as ET
xml = '''<resultset statement="select * from table where id > 5" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<row>
<field name="name">first</field>
<field name="login">2021-08-16 13:44:35</field>
</row>
<row>
<field name="name">second</field>
<field name="login">2021-08-18 13:44:35</field>
</row>
</resultset>'''
data = []
root = ET.fromstring(xml)
for row in root.findall('.//row'):
fields = []
for field in row.findall('field'):
fields.append((field.attrib['name'], field.text))
data.append(fields)
print(data)
output
[[('name', 'first'), ('login', '2021-08-16 13:44:35')], [('name', 'second'), ('login', '2021-08-18 13:44:35')]]

Extracting first element in xml - using xxx.find() leads to nonetype error?

The structure of the xml-file basically looks like this, it's bibliographic data in the format MARC21-xml (used by libraries all over the place):
<?xml version="1.0" encoding="UTF-8"?><collection xmlns="http://www.loc.gov/MARC21/slim">
<record type="Bibliographic">
<leader> ... </leader>
<controlfield> ... </controlfield>
...
<controlfield> ... </controlfield>
<datafield tag="123" ... >
<subfield code="x"> ... </subfield>
...
<subfield code="x"> ... </subfield>
</datafield>
<datafield tag="456" ...>
There's a proper example-file here: https://www.loc.gov/standards/marcxml/Sandburg/sandburg.xml - however, this only represents one item (e.g. a specific book), usually these file contain hundreds to thousands of the records - so the record tag with all its content is repeatable.
There file I work with has over 10,000 record-tags in it (all representing different items), all of which have a datafield with the tag "082" and then several subfields.
I am now trying to extract the text in the subfield with the code="a" - however, since this field is also repeatable and some records have two of those, I always only want the first one. My current code, which extracts the text for ALL subfields code="a" in these datafields looks like this:
for child in record.findall("{http://www.loc.gov/MARC21/slim}datafield[#tag='082']"):
for subelement in child:
if subelement.attrib['code'] == "a":
ddc = subelement.text
ddccoll.append(ddc)
This works, but, as I said, returns too many elements, if I run it and then print the length of my list it returns 10277, however, only 10123 records are in this file, so there's a few too many, probably due to its repeatability.
I tried using find instead of findall, but then get the error message that `TypeError:
TypeError Traceback (most recent call last)
<ipython-input-23-fae786776bcf> in <module>
18 idcoll.append("nicht vorhanden")
19
---> 20 for child in record.find("{http://www.loc.gov/MARC21/slim}datafield[#tag='082']"):
21 for subelement in child:
22 if subelement.attrib['code'] == "a":
TypeError: 'NoneType' object is not iterable
I am not exactly sure why, since the field 082 should be present in every single record - but since I am actually really after the subfield, this is probably not the right approach anyway. Now I have tried to go one layer deeper and simply look for the first subelement with the code a with the following code:
for child in record.findall("{http://www.loc.gov/MARC21/slim}datafield[#tag='082']"):
for subelement in child.find("{http://www.loc.gov/MARC21/slim}subfield[#code='a']"):
if subelement:
ddc = subelement.text
ddccoll.append(ddc)
However, this doesn't return resp. append anything, if I print the length of the list afterwards it says "0". I have also done the same for authors and the ids and it's working for those. I am trying to get this right so that afterwards I can create a Dataframe with authors, ids, titles etc.
I am currently completely stuck at this: Is the path wrong? Is there another, simpler, better way of doing this?
I assume that you have read your XML with the following code:
import xml.etree.ElementTree as et
tree = et.parse('Input.xml')
root = tree.getroot()
To reach your wanted elements you can use the following code:
# Namespace dictionary
ns = {'slim': 'http://www.loc.gov/MARC21/slim'}
# Process "datafield" elements with the required "tag" attribute
for it in root.findall('.//slim:datafield[#tag="082"]', ns):
print(f'{it.tag:10}, {it.attrib}')
# Find the first child with "code" == "a"
child = it.find('slim:*[#code="a"]', ns)
if isinstance(child, et.Element): # Something found
print(f' {child.tag:10}, {child.attrib}, {child.text}')
else:
print(' Nothing found')
In the above sample I included only print statements for the elements
found, but you can do with them anything you wish.
Using the following source XML:
<?xml version="1.0" encoding="UTF-8"?>
<collection xmlns="http://www.loc.gov/MARC21/slim">
<record type="Bibliographic">
<leader>...</leader>
<controlfield>...</controlfield>
<datafield tag="082" id="1">
<subfield code="a">a1</subfield>
<subfield code="x">x1</subfield>
<subfield code="a">a2</subfield>
</datafield>
<datafield tag="456" id="2">
<subfield code="a">a3</subfield>
</datafield>
<datafield tag="082" id="3">
<subfield code="a">a4</subfield>
<subfield code="x">x2</subfield>
<subfield code="a">a5</subfield>
</datafield>
</record>
<record type="Bibliographic">
<leader>...</leader>
<controlfield>...</controlfield>
<datafield tag="082" id="4">
<subfield code="a">a6</subfield>
<subfield code="x">x3</subfield>
<subfield code="a">a7</subfield>
</datafield>
<datafield tag="456" id="5">
<subfield code="a">a8</subfield>
</datafield>
</record>
</collection>
I got the following result:
{http://www.loc.gov/MARC21/slim}datafield, {'tag': '082', 'id': '1'}
{http://www.loc.gov/MARC21/slim}subfield, {'code': 'a'}, a1
{http://www.loc.gov/MARC21/slim}datafield, {'tag': '082', 'id': '3'}
{http://www.loc.gov/MARC21/slim}subfield, {'code': 'a'}, a4
{http://www.loc.gov/MARC21/slim}datafield, {'tag': '082', 'id': '4'}
{http://www.loc.gov/MARC21/slim}subfield, {'code': 'a'}, a6

modify node and extract data from xml file in python

I am new with python and I am looking for advices on what is the best approach to do the following task:
I have an xml file looking like this
<component xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.spiritconsortium.org/XMLSchema/SPIRIT/1685-2009 http://www.spiritconsortium.org/XMLSchema/SPIRIT/1685-2009/index.xsd">
<memoryMaps>
<memoryMap>
<name>name</name>
<description>description</description>
<peripheral>
<name>periph</name>
<description>description</description>
<baseAddress>0x0</baseAddress>
<range>0x8</range>
<width>32</width>
<register>
<name>reg1</name>
<displayName>reg1</displayName>
<addressOffset>0x0</addressOffset>
<size>32</size>
<access>read-write</access>
<reset>
<value>0x00000002</value>
<mask>0xFFFFFFFF</mask>
</reset>
<field>
</field>
</register>
</peripheral>
</memoryMap>
</memoryMaps>
</component>
I want to do some modifications to modify the node of "reset" to become 2 separate nodes, one for "resetValue" and another "resetMask" but keeping same data in "value" and "mask" extracted into "resetValue" and "resetMask" as follow:
........
<access>read-write</access>
<resetValue>0x00000002</resetValue>
<resetMask>0xFFFFFFFF</resetMask>
<field>
.............
I managed the part of parsing my xml file with success, now I can't know how to start this first modification.
Thank you to guide me.
code that create 2 sub elements under 'register' and remove the unneeded element 'reset'
import xml.etree.ElementTree as ET
xml = '''<component xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.spiritconsortium.org/XMLSchema/SPIRIT/1685-2009 http://www.spiritconsortium.org/XMLSchema/SPIRIT/1685-2009/index.xsd">
<memoryMaps>
<memoryMap>
<name>name</name>
<description>description</description>
<peripheral>
<name>periph</name>
<description>description</description>
<baseAddress>0x0</baseAddress>
<range>0x8</range>
<width>32</width>
<register>
<name>reg1</name>
<displayName>reg1</displayName>
<addressOffset>0x0</addressOffset>
<size>32</size>
<access>read-write</access>
<reset>
<value>0x00000002</value>
<mask>0xFFFFFFFF</mask>
</reset>
<field>
</field>
</register>
</peripheral>
</memoryMap>
</memoryMaps>
</component>'''
root = ET.fromstring(xml)
register = root.find('.//register')
value = register.find('.//reset/value').text
mask = register.find('.//reset/mask').text
v = ET.SubElement(register, 'resetValue')
v.text = value
m = ET.SubElement(register, 'resetMask')
m.text = mask
register.remove(register.find('reset'))
ET.dump(root)
output
<?xml version="1.0" encoding="UTF-8"?>
<component xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.spiritconsortium.org/XMLSchema/SPIRIT/1685-2009 http://www.spiritconsortium.org/XMLSchema/SPIRIT/1685-2009/index.xsd">
<memoryMaps>
<memoryMap>
<name>name</name>
<description>description</description>
<peripheral>
<name>periph</name>
<description>description</description>
<baseAddress>0x0</baseAddress>
<range>0x8</range>
<width>32</width>
<register>
<name>reg1</name>
<displayName>reg1</displayName>
<addressOffset>0x0</addressOffset>
<size>32</size>
<access>read-write</access>
<field />
<resetValue>0x00000002</resetValue>
<resetMask>0xFFFFFFFF</resetMask>
</register>
</peripheral>
</memoryMap>
</memoryMaps>
</component>

How to read the following XML and get values for "host","status","owner","user-template-01" and "test-id"?

XML = """<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<Entities TotalResults="101" PageSize="100" PageNumber="1">
<Entity Type="run">
<Fields>
<Field Name="host">
<Value>osdc-vw64</Value>
</Field>
<Field Name="status">
<Value>Passed</Value>
</Field>
<Field Name="owner">
<Value>Aspeg</Value>
</Field>
<Field Name="user-template-01">
<Value>1941896</Value>
</Field>
<Field Name="test-id">
<Value>72769</Value>
</Field>
</Fields>
</Entity>
<Entity Type="run">
<Fields>
<Field Name="host">
<Value>osdc-57</Value>
</Field>
<Field Name="status">
<Value>Passed</Value>
</Field>
<Field Name="owner">
<Value>spana</Value>
</Field>
<Field Name="user-template-01">
<Value>1941896</Value>
</Field>
<Field Name="test-id">
<Value>72769</Value>
</Field>
</Fields>
</Entity>
</Entities>"""
I have used :
from xml.etree import ElementTree as ET
root = ET.fromstring(XML)
print root.tag
I do not know how to go ahead now ...
The easiest way would be to use PyQuery (if you understand jQuery selectors):
from pyquery import PyQuery
query = PyQuery(xml);
host = query("[Name='host'] value").text()
test_id = query("[Name='test-id'] value").text()
Since you have multiple elements with Name='host', you should iterate over Entities:
from pyquery import PyQuery
def process_Entity(entity):
pass #do something
query = PyQuery(xml);
query("Entity").each(process_Entity)
import xml.etree.ElementTree as ET
tree = ET.parse('hai.xml')
root = tree.getroot()
for child in root:
print child.tag, child.attrib
for a in child:
print a.tag
for b in a:
print b.attrib , b[0].text
Using lxml.etree:
import lxml.etree as ET
XML = """ your string here """
tree = ET.fromstring(XML) # you may get errors here because of encoding
# if so, re.sub(r"\bencoding="[^"]+?", '', XML) works
info_you_need = {entity: {field.get("Name"): field.find("Value").text for field in entity.findall("Fields/Field")} for entity in tree.findall("Entity")}
N.B. I'm pretty awful with the lxml module, someone may come up with a much better solution than this :) My output was:
{<Element Entity at 0x2af4e18>: {'user-template-01': '1941896', 'owner': 'spana', 'test-id': '72769', 'status': 'Passed', 'host': 'osdc-57'},
<Element Entity at 0x2af4e40>: {'user-template-01': '1941896', 'owner': 'Aspeg', 'test-id': '72769', 'status': 'Passed', 'host': 'osdc-vw64'}}

Parsing XML file etree module

I'm reading XML file using Etree module. Im using following code to print the value of <page> and <title> tags. My code working fine. But I want little change. If the <page id='...'> attribute id is exists then print the value of tag. Is it possible? thanks
import xml.etree.cElementTree as etree
from pprint import pprint
tree = etree.parse('find_title.xml')
for value in tree.getiterator(tag='title'):
print value.text
for value in tree.getiterator(tag='page'):
pprint(value.attrib)
Here is my xml File.
<mediawiki>
<siteinfo>
<sitename>Wiki</sitename>
<namespaces>
<namespace key="-2" case="first-letter">Media</namespace>
</namespaces>
</siteinfo>
<page id="31239628" orglength="6822" newlength="4524" stub="0" categories="0" outlinks="1" urls="10">
<title>Title</title>
<categories></categories>
<links>15099779</links>
<urls>
</urls>
<text>
Books
</text>
</page>
</mediawiki>
for el in tree.getiterator(tag='page'):
page_id = el.get('id', None) # returns second arg if id not exists
if page_id:
print page_id, el.find('title').text
else:
pprint(el.attrib)
Edit: Updated for commment: "Thanks can i print page_id and title at same time? Means 31239628 - Title"
The element.get() method is used to retrieve option attribute values in a tag:
>>> page_id = tree.find('page').get('id')
>>> if page_id:
print page_id
31239628

Categories