Trouble getting XML elements using ElementTree - python

I'm trying to parse XML document in Python, so that I can do manipulations on the data and write out a new file. The full file that I'm working with is here, but here is an excerpt:
<?xml version="1.0" encoding="UTF-8"?>
<FMPXMLRESULT xmlns="http://www.filemaker.com/fmpxmlresult">
<ERRORCODE>0</ERRORCODE>
<PRODUCT BUILD="09-11-2013" NAME="FileMaker" VERSION="ProAdvanced 12.0v5"/>
<DATABASE DATEFORMAT="M/d/yyyy" LAYOUT="" NAME="All gigs 88-07.fmp12" RECORDS="746" TIMEFORMAT="h:mm:ss a"/>
<METADATA>
<FIELD EMPTYOK="YES" MAXREPEAT="1" NAME="Country" TYPE="TEXT"/>
<FIELD EMPTYOK="YES" MAXREPEAT="1" NAME="Year" TYPE="TEXT"/>
<FIELD EMPTYOK="YES" MAXREPEAT="1" NAME="City" TYPE="TEXT"/>
<FIELD EMPTYOK="YES" MAXREPEAT="1" NAME="State" TYPE="TEXT"/>
<FIELD EMPTYOK="YES" MAXREPEAT="1" NAME="Theater" TYPE="TEXT"/>
</METADATA>
<RESULTSET FOUND="746">
<ROW MODID="3" RECORDID="32">
<COL>
<DATA/>
</COL>
<COL>
<DATA>1996</DATA>
</COL>
<COL>
<DATA>Pompano Beach</DATA>
</COL>
<COL>
<DATA>FL</DATA>
</COL>
<COL>
<DATA>First Presbyterian Church</DATA>
</COL>
</ROW>
<ROW MODID="3" RECORDID="33">
<COL>
<DATA/>
</COL>
<COL>
<DATA>1996</DATA>
</COL>
<COL>
<DATA>Hilton Head</DATA>
</COL>
<COL>
<DATA>SC</DATA>
</COL>
<COL>
<DATA>Self Family Arts Center</DATA>
</COL>
</ROW>
<!-- snip many more ROW elements -->
</RESULTSET>
</FMPXMLRESULT>
Eventually, I want to use the information from the METADATA field to parse the columns in the RESULTSET, but for now I’m having trouble just getting a handle on the data. Here is what I’ve tried to get the contents of the METADATA element:
import xml.etree.ElementTree as ET
tree = ET.parse('giglist.xml')
root = tree.getroot()
print root
metadata = tree.find("METADATA")
print metadata
This prints out:
<Element '{http://www.filemaker.com/fmpxmlresult}FMPXMLRESULT' at 0x10f982cd0>
None
Why is metadata empty? Am I misusing the find() method?

You need to handle namespaces.
But, since there is only a default namespace given, you can find the element by using the following syntax:
import xml.etree.ElementTree as ET
ns = 'http://www.filemaker.com/fmpxmlresult'
tree = ET.parse('giglist.xml')
root = tree.getroot()
metadata = root.find("{%s}METADATA" % ns)
print metadata # prints <Element '{http://www.filemaker.com/fmpxmlresult}METADATA' at 0x103ccbe90>
Here are the relevant threads you may want to see:
Is there a key for the default namespace when creating dictionary for use with xml.etree.ElementTree.findall() in Python?
Parsing XML with namespace in Python via 'ElementTree'
UPD (getting the list of results):
import xml.etree.ElementTree as ET
ns = 'http://www.filemaker.com/fmpxmlresult'
tree = ET.parse('giglist.xml')
root = tree.getroot()
keys = [field.attrib['NAME'] for field in root.findall(".//{%(ns)s}METADATA/{%(ns)s}FIELD" % {'ns': ns})]
results = [dict(zip(keys, [col.text for col in row.findall(".//{%(ns)s}COL/{%(ns)s}DATA" % {'ns': ns})]))
for row in root.findall(".//{%(ns)s}RESULTSET/{%(ns)s}ROW" % {'ns': ns})]
print results
Prints:
[
{'City': 'Pompano Beach', 'Country': None, 'State': 'FL', 'Theater': 'First Presbyterian Church', 'Year': '1996'},
{'City': 'Hilton Head', 'Country': None, 'State': 'SC', 'Theater': 'Self Family Arts Center', 'Year': '1996'}
]

Related

modify node and extract data from xml file in python

I am new with python and I am looking for advices on what is the best approach to do the following task:
I have an xml file looking like this
<component xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.spiritconsortium.org/XMLSchema/SPIRIT/1685-2009 http://www.spiritconsortium.org/XMLSchema/SPIRIT/1685-2009/index.xsd">
<memoryMaps>
<memoryMap>
<name>name</name>
<description>description</description>
<peripheral>
<name>periph</name>
<description>description</description>
<baseAddress>0x0</baseAddress>
<range>0x8</range>
<width>32</width>
<register>
<name>reg1</name>
<displayName>reg1</displayName>
<addressOffset>0x0</addressOffset>
<size>32</size>
<access>read-write</access>
<reset>
<value>0x00000002</value>
<mask>0xFFFFFFFF</mask>
</reset>
<field>
</field>
</register>
</peripheral>
</memoryMap>
</memoryMaps>
</component>
I want to do some modifications to modify the node of "reset" to become 2 separate nodes, one for "resetValue" and another "resetMask" but keeping same data in "value" and "mask" extracted into "resetValue" and "resetMask" as follow:
........
<access>read-write</access>
<resetValue>0x00000002</resetValue>
<resetMask>0xFFFFFFFF</resetMask>
<field>
.............
I managed the part of parsing my xml file with success, now I can't know how to start this first modification.
Thank you to guide me.
code that create 2 sub elements under 'register' and remove the unneeded element 'reset'
import xml.etree.ElementTree as ET
xml = '''<component xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.spiritconsortium.org/XMLSchema/SPIRIT/1685-2009 http://www.spiritconsortium.org/XMLSchema/SPIRIT/1685-2009/index.xsd">
<memoryMaps>
<memoryMap>
<name>name</name>
<description>description</description>
<peripheral>
<name>periph</name>
<description>description</description>
<baseAddress>0x0</baseAddress>
<range>0x8</range>
<width>32</width>
<register>
<name>reg1</name>
<displayName>reg1</displayName>
<addressOffset>0x0</addressOffset>
<size>32</size>
<access>read-write</access>
<reset>
<value>0x00000002</value>
<mask>0xFFFFFFFF</mask>
</reset>
<field>
</field>
</register>
</peripheral>
</memoryMap>
</memoryMaps>
</component>'''
root = ET.fromstring(xml)
register = root.find('.//register')
value = register.find('.//reset/value').text
mask = register.find('.//reset/mask').text
v = ET.SubElement(register, 'resetValue')
v.text = value
m = ET.SubElement(register, 'resetMask')
m.text = mask
register.remove(register.find('reset'))
ET.dump(root)
output
<?xml version="1.0" encoding="UTF-8"?>
<component xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.spiritconsortium.org/XMLSchema/SPIRIT/1685-2009 http://www.spiritconsortium.org/XMLSchema/SPIRIT/1685-2009/index.xsd">
<memoryMaps>
<memoryMap>
<name>name</name>
<description>description</description>
<peripheral>
<name>periph</name>
<description>description</description>
<baseAddress>0x0</baseAddress>
<range>0x8</range>
<width>32</width>
<register>
<name>reg1</name>
<displayName>reg1</displayName>
<addressOffset>0x0</addressOffset>
<size>32</size>
<access>read-write</access>
<field />
<resetValue>0x00000002</resetValue>
<resetMask>0xFFFFFFFF</resetMask>
</register>
</peripheral>
</memoryMap>
</memoryMaps>
</component>

How do you properly fetch from this nested XML?

I have the following XML:
<?xml version="1.0" encoding="UTF-8"?>
<data>
<columns>
<Leftover index="5">Leftover</Leftover>
<NODE5 index="6"></NODE5>
<NODE6 index="7"></NODE6>
<NODE8 index="9"></NODE8>
<Nomenk__Nr_ index="2">Nomenk.
Nr.</Nomenk__Nr_>
<Year index="8">2020</Year>
<Name index="1">Name</Name>
<Value_code index="3">Value code</Value_code>
</columns>
<records>
<record index="1">
<Leftover>Leftover</Leftover>
<NODE5>Test1</NODE5>
<NODE6>Test2</NODE6>
<NODE8>Test3</NODE8>
<Nomenk__Nr_></Nomenk__Nr_>
<Name></Name>
<Value_code></Value_code>
</record>
... (it repeats itself with different values and the index value increments)
My code is:
import lxml
import lxml.etree as et
xml = open('C:\outputfile.xml', 'rb')
xml_content = xml.read()
tree = et.fromstring(xml_content)
for bad in tree.xpath("//records[#index=\'*\']/NODE5"):
bad.getparent().remove(bad) # here I grab the parent of the element to call the remove directly on it
result = (et.tostring(tree, pretty_print=True, xml_declaration=True))
f = open( 'outputxml.xml', 'w' )
f.write( str(result) )
f.close()
What I need to do is to remove the NODE5, NODE6, NODE8. I tried using a wildcard and then specifying one of the nodes (see line 6) but that seems to not have worked... I'm also getting a syntax error right after the loop on the first character but the code executes.
My problem is also that the encoding by lxml is set to ASCII afterwards when the file is "exported".
UPDATE
I am getting this error on line 8:
return = ...
^
SyntaxError: invalid syntax
I took some code from https://stackoverflow.com/a/7981894/1987598
What I need to do is to remove the NODE5, NODE6, NODE8.
below
import xml.etree.ElementTree as ET
xml = '''<?xml version="1.0" encoding="UTF-8"?>
<data>
<columns>
<Leftover index="5">Leftover</Leftover>
<NODE5 index="6" />
<NODE6 index="7" />
<NODE8 index="9" />
<Nomenk__Nr_ index="2">Nomenk.
Nr.</Nomenk__Nr_>
<Year index="8">2020</Year>
<Name index="1">Name</Name>
<Value_code index="3">Value code</Value_code>
</columns>
<records>
<record index="1">
<Leftover>Leftover</Leftover>
<NODE5>Test1</NODE5>
<NODE6>Test2</NODE6>
<NODE8>Test3</NODE8>
<Nomenk__Nr_ />
<Name />
<Value_code />
</record>
<record index="21">
<Leftover>Leftover</Leftover>
<NODE5>Test11</NODE5>
<NODE6>Test21</NODE6>
<NODE8>Test39</NODE8>
<Nomenk__Nr_ />
<Name />
<Value_code />
</record>
</records>
</data>'''
root = ET.fromstring(xml)
col = root.find('./columns')
for x in ['5','6','8']:
nodes_to_remove = col.findall('./NODE{}'.format(x))
for node in nodes_to_remove:
col.remove(node)
records = root.find('./records')
records_lst = records.findall('./record'.format(x))
for r in records_lst:
for x in ['5','6','8']:
nodes_to_remove = r.findall('./NODE{}'.format(x))
for node in nodes_to_remove:
r.remove(node)
ET.dump(root)
output
<data>
<columns>
<Leftover index="5">Leftover</Leftover>
<Nomenk__Nr_ index="2">Nomenk.
Nr.</Nomenk__Nr_>
<Year index="8">2020</Year>
<Name index="1">Name</Name>
<Value_code index="3">Value code</Value_code>
</columns>
<records>
<record index="1">
<Leftover>Leftover</Leftover>
<Nomenk__Nr_ />
<Name />
<Value_code />
</record>
<record index="2">
<Leftover>Leftover</Leftover>
<Nomenk__Nr_ />
<Name />
<Value_code />
</record>
</records>
</data>

How can I preserve whitespaces with python 2.7 lxml?

I have a huge xml file (thousands of lines) and I need to change some attribute parameters.
Xml looks like this:
<person id="name" name="pers_name">
<group id="Common">
<emotion id="smile">
<texture texture="smile" x="-131" y="-17" />
<effect name="name1" x="51" y="438" />
<effect name="name2" x="61" y="419" />
<effect name="name3" x="55" y="312" />
</emotion>
</group>
</person>
After I did it and wrote it with tree.write(path, encoding='utf-8', xml_declaration=True) I lost whitespaces before closing tag.
How can I preserve it?
<person id="name" name="pers_name">
<group id="Common">
<emotion id="smile">
<texture texture="smile" x="-131" y="-17"/>
<effect name="name1" x="51" y="438"/>
<effect name="name2" x="61" y="419"/>
<effect name="name3" x="55" y="312"/>
</emotion>
</group>
</person>
Code
from lxml import etree
# Offsets
x_offset = -10
y_offset = -20
tree = etree.parse(path)
XML = tree.getroot()
for effect in XML.iter('effect'):
texture_offset_x = int(effect.get('texture_offset_x')) + x_offset
texture_offset_y = int(effect.get('texture_offset_y')) + y_offset
effect.set('texture_offset_x', str(texture_offset_x))
effect.set('texture_offset_y', str(texture_offset_y))
tree.write(path, encoding='utf-8', xml_declaration=True)

python element tree iterparse filter nodes and children

I am trying to use elementTree's iterparse function to filter nodes based on the text and write them to a new file. I am using iterparse becuase the input file is large (100+ MB)
input.xml
<xmllist>
<page id="1">
<title>movie title 1</title>
<text>this is a moviein theatres/text>
</page>
<page id="2">
<title>movie title 2</title>
<text>this is a horror film</text>
</page>
<page id="3">
<title></title>
<text>actor in film</text>
</page>
<page id="4">
<title>some other topic</title>
<text>nothing related</text>
</page>
</xmllist>
Expected output (all pages where the text has "movie" or "film" in them)
<xmllist>
<page id="1">
<title>movie title 1</title>
<text>this is a movie<n theatres/text>
</page>
<page id="2">
<title>movie title 2</title>
<text>this is a horror film</text>
</page>
<page id="3">
<title></title>
<text>actor in film</text>
</page>
</xmllist>
Current code
import xml.etree.cElementTree as etree
from xml.etree.cElementTree import dump
output_file=open('/tmp/outfile.xml','w')
for event, elem in iter(etree.iterparse("/tmp/test.xml", events=('start','end'))):
if event == "end" and elem.tag == "page": #need to add condition to search for strings
output_file.write(elem)
elem.clear()
How do I add the regular expression to filter based on page's text attribute?
You're looking for a child, not an attribute, so it's simplest to analyze the title as it "passes by" in the iteration and remember the result until you get the end of the resulting page:
import re
good_page = False
for event, elem in iter(etree.iterparse("/tmp/test.xml", events=('start','end'))):
if event == 'end':
if elem.tag = 'title':
good_page = re.search(r'film|movie', elem.text)
elif elem.tag == 'page':
if good_page:
output_file.write(elem)
good_page = False
elem.clear()
The re.search will return None if not found, and the if treats that as false, so we're avoiding the writing of pages without a title as well as ones whose title's text does not match your desired RE.

Parse xml with lxml - extract element value

Let's suppose we have the XML file with the structure as follows.
<?xml version="1.0" ?>
<searchRetrieveResponse xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.loc.gov/zing/srw/ http://www.loc.gov/standards/sru/sru1-1archive/xml-files/srw-types.xsd" xmlns="http://www.loc.gov/zing/srw/">
<records xmlns:ns1="http://www.loc.gov/zing/srw/">
<record>
<recordData>
<record xmlns="">
<datafield tag="000">
<subfield code="a">123</subfield>
<subfield code="b">456</subfield>
</datafield>
<datafield tag="001">
<subfield code="a">789</subfield>
<subfield code="b">987</subfield>
</datafield>
</record>
</recordData>
</record>
<record>
<recordData>
<record xmlns="">
<datafield tag="000">
<subfield code="a">123</subfield>
<subfield code="b">456</subfield>
</datafield>
<datafield tag="001">
<subfield code="a">789</subfield>
<subfield code="b">987</subfield>
</datafield>
</record>
</recordData>
</record>
</records>
</searchRetrieveResponse>
I need to parse out:
The content of the "subfield" (e.g. 123 in the example above) and
Attribute values (e.g. 000 or 001)
I wonder how to do that using lxml and XPath. Pasted below is my initial code and I kindly ask someone to explain me, how to parse out values.
import urllib, urllib2
from lxml import etree
url = "https://dl.dropbox.com/u/540963/short_test.xml"
fp = urllib2.urlopen(url)
doc = etree.parse(fp)
fp.close()
ns = {'xsi':'http://www.loc.gov/zing/srw/'}
for record in doc.xpath('//xsi:record', namespaces=ns):
print record.xpath("xsi:recordData/record/datafield[#tag='000']", namespaces=ns)
I would be more direct in your XPath: go straight for the elements you want, in this case datafield.
>>> for df in doc.xpath('//datafield'):
# Iterate over attributes of datafield
for attrib_name in df.attrib:
print '#' + attrib_name + '=' + df.attrib[attrib_name]
# subfield is a child of datafield, and iterate
subfields = df.getchildren()
for subfield in subfields:
print 'subfield=' + subfield.text
Also, lxml appears to let you ignore the namespace, maybe because your example only uses one namespace?
I would just go with
for df in doc.xpath('//datafield'):
print df.attrib
for sf in df.getchildren():
print sf.text
Also you don't need urllib, you can directly parse XML with HTTP
url = "http://dl.dropbox.com/u/540963/short_test.xml" #doesn't work with https though
doc = etree.parse(url)
Try the following working code :
import urllib2
from lxml import etree
url = "https://dl.dropbox.com/u/540963/short_test.xml"
fp = urllib2.urlopen(url)
doc = etree.parse(fp)
fp.close()
for record in doc.xpath('//datafield'):
print record.xpath("./#tag")[0]
for x in record.xpath("./subfield/text()"):
print "\t", x

Categories