Python ElementTree - Search children/grandchildren in poorly written XML - python

I'm trying to parse through a poorly coded XML and output the Node Name and content of a tag (only if it exists), and only if string name=content > 30 day(s).
Thus far I can search the children elements using ElementTree, but I need help with the poorly nested info. I can't change the XML because it's a vendor provided report. I'm a complete newbie, so please coach me on what I need to do or provide for better help. Thanks in advance.
Example File:
<?xml version="1.0" encoding="UTF-8"?>
<ReportSection>
<ReportHead>
<Criteria>
<HeadStuff value=Dont Care>
</HeadStuff>
</Criteria>
</ReportHead>
<ReportBody>
<ReportSection name="UpTime" category="rule">
<ReportSection name="NodeName.domain.net" category="node">
<String name="node">NodeName.domain.net</String>
<String name="typeName">Windows Server</String>
<OID>-1y2p0ij32e8c8:-1y2p0idhghwg6</OID>
<ReportSection name="UpTime" category="element">
<ReportSection name="2015-09-20 18:50:10.0" category="version">
<String name="version">UpTime</String>
<OID>-1y2p0ij32e8cj:-1y2p0ibspofhp</OID>
<Integer name="changeType">2</Integer>
<String name="changeTypeName">Modified</String>
<Timestamp name="changeTime" displayvalue="9/20/15 6:50 PM">1442793010000</Timestamp>
<ReportSection name="versionContent" category="versionContent">
<String name="content">12 day(s), 7 hour(s), 33 minute(s), 8 second(s)</String>
<String name="content"></String>
</ReportSection>
</ReportSection>
</ReportSection>
</ReportSection>
</ReportSection>
</ReportBody>
</ReportSection>

The idea would be to locate the content node, extract how many days are there, then check the value if needed, and locate the node name. Example (using lxml.etree):
import re
from lxml import etree
pattern = re.compile(r"^(\d+) day\(s\)")
data = """your XML here"""
tree = etree.fromstring(data)
content = tree.findtext(".//String[#name='content']")
if content:
match = pattern.search(content)
if match:
days = int(match.group(1))
# TODO: check the days if needed
node = tree.findtext(".//String[#name='node']")
print node, days
Prints:
NodeName.domain.net 12

Related

Python XML Element Tree finding the value of an XML tag

I'm trying to retrieve the value of a particular xml tag in an XML file. The problem is that it returns a memory address instead of the actual value.
Already tried multiple approaches using other libraries as well. Nothing really yielded the result.
from xml.etree import ElementTree
tree = ElementTree.parse('C:\\Users\\Sid\\Desktop\\Test.xml')
root = tree.getroot()
items = root.find("items")
item= items.find("item")
print(item)
Expected was 1 2 3 4. Actual : Memory address.
XML File is :
<data>
<items>
<item>1</item>
</items>
<items>
<item>2</item>
</items>
<items>
<item>3</item>
</items>
<items>
<item>4</item>
</items>
</data>
Using BeautifulSoup:
from bs4 import BeautifulSoup
import urllib
test = '''<data>
<items>
<item>1</item>
</items>
<items>
<item>2</item>
</items>
<items>
<item>3</item>
</items>
<items>
<item>4</item>
</items>
</data>'''
soup = BeautifulSoup(test, 'html.parser')
data = soup.find_all("item")
for d in data:
print(d.text)
OUTPUT:
1
2
3
4
Using XML Element Tree:
from xml.etree import ElementTree
tree = ElementTree.parse('list.txt')
root = tree.getroot()
items = root.findall("items")
for elem in items:
desired_tag = elem.find("item")
print(desired_tag.text)
OUTPUT:
1
2
3
4
EDIT:
If you want them printed in a line separated by spaces.
print(desired_tag.text, "\t", end = "")

parse a section of an XML file with python

Im new to both python and xml. Have looked at the previous posts on the topic, and I cant figure out how to do exactly what I need to. Although it seems to be simple enough in principle.
<Project>
<Items>
<Item>
<Code>A456B</Code>
<Database>
<Data>
<Id>mountain</Id>
<Value>12000</Value>
</Data>
<Data>
<Id>UTEM</Id>
<Value>53.2</Value>
</Data>
</Database>
</Item>
<Item>
<Code>A786C</Code>
<Database>
<Data>
<Id>mountain</Id>
<Value>5000</Value>
</Data>
<Data>
<Id>UTEM</Id>
<Value></Value>
</Data>
</Database>
</Item>
</Items>
</Project>
All I want to do is extract all of the Codes, Values and ID's, which is no problem.
import xml.etree.cElementTree as ET
name = 'example tree.xml'
tree = ET.parse(name)
root = tree.getroot()
codes=[]
ids=[]
val=[]
for db in root.iter('Code'):
codes.append(db.text)
for ID in root.iter('Id'):
ids.append(ID.text)
for VALUE in root.iter('Value'):
val.append(VALUE.text)
print codes
print ids
print val
['A456B', 'A786C']
['mountain', 'UTEM', 'mountain', 'UTEM']
['12000', '53.2', '5000', None]
I want to know which Ids and Values go with which Code. Something like a dictionary of dictionaries maybe OR perhaps a list of DataFrames with the row index being the Id, and the column header being Code.
for example
A456B = {mountain:12000, UTEM:53.2}
A786C = {mountain:5000, UTEM: None}
Eventually I want to use the Values to feed an equation.
Note that the real xml file might not contain the same number of Ids and Values in each Code. Also, Id and Value might be different from one Code section to another.
Sorry if this question is elementary, or unclear...I've only been doing python for a month :/
BeautifulSoup is a very useful module for parsing HTML and XML.
from bs4 import BeautifulSoup
import os
# read the file into a BeautifulSoup object
soup = BeautifulSoup(open(os.getcwd() + "\\input.txt"))
results = {}
# parse the data, and put it into a dict, where the values are dicts
for item in soup.findAll('item'):
# assemble dicts on the fly using a dict comprehension:
# http://stackoverflow.com/a/14507637/4400277
results[item.code.text] = {data.id.text:data.value.text for data in item.findAll('data')}
>>> results
{u'A786C': {u'mountain': u'5000', u'UTEM': u''},
u'A456B': {u'mountain': u'12000', u'UTEM': u'53.2'}
This might be what you want:
import xml.etree.cElementTree as ET
name = 'test.xml'
tree = ET.parse(name)
root = tree.getroot()
codes={}
for item in root.iter('Item'):
code = item.find('Code').text
codes[code] = {}
for datum in item.iter('Data'):
if datum.find('Value') is not None:
value = datum.find('Value').text
else:
value = None
if datum.find('Id') is not None:
id = datum.find('Id').text
codes[code][id] = value
print codes
This produces:
{'A456B' : {'mountain' : '12000', 'UTEM' : '53.2'}, 'A786C' : {'mountain' : '5000', 'UTEM' : None}}
This iterates over all Item tags, and for each one, creates a dict key pointing to a dict of id/value pairs. An id/data pair is only created if the Id tag is not empty.

How to copy certain information from a text file to XML using Python?

We get order e-mails whenever a buyer makes a purchase; these e-mails are sent in a text format with some relevant and some irrelevant information. I am trying to write a python program which will read the text and then build an XML file (using ElementTree) which we can important into other software.
Unfortunately I do not quite know the proper terms for some of this, so please bear with the overlong explanations.
The problem is that I cannot figure out how to make it work with more than one product on the order. The program currently goes through each order and puts the data in a dictionary.
while file_length_dic != 0:
#goes line by line and adds each value (and it's name) to a dictionary
#keys are the first have a sentence followed by a distinguishing number
for line in raw_email:
colon_loc = line.index(':')
end_loc = len(line)
data_type = line[0:colon_loc] + "_" + file_length
data_variable = line[colon_loc+2:end_loc].lstrip(' ')
xml_dic[data_type] = data_variable
if line.find("URL"):
break
file_lenght_dic -= 1
How can I get this dictionary values into XML? For example, under the main "JOB" element there will be a sub-element ITEMNUMBER and then SALESMANN and QUANTITY. How can I fill out multiple sets?
<JOB>
<ITEM>
<ITEMNUMBER>36322</ITEMNUMBER>
<SALESMANN>17</SALESMANN>
<QUANTITY>2</QUANTITY>
</ITEM>
<ITEM>
<ITEMNUMBER>22388</ITEMNUMBER>
<SALESMANN>5</SALESMANN>
<QUANTITY>8</QUANTITY>
</ITEM>
</JOB>
As far as I can tell, ElementTree will only let me but the data into the first set of children but I can't imagine this must be so. I also do not know in advance how many items are with each order; it can be anywhere from 1 to 150 and the program needs to scale easily.
Should I be using a different library? lxml looks powerful but again, I do not know what it is exactly I am looking for.
Here's a simple example. Note that the basic ElementTree doesn't pretty print, so I included a pretty print function from the ElementTree author.
If you provide an actual example of the input file and dictionary it would be easier to target your specific case. I just Put some data in a dictionary to show how to iterate over it and generate some XML.
from xml.etree import ElementTree as et
def indent(elem, level=0):
i = "\n" + level*" "
if len(elem):
if not elem.text or not elem.text.strip():
elem.text = i + " "
if not elem.tail or not elem.tail.strip():
elem.tail = i
for elem in elem:
indent(elem, level+1)
if not elem.tail or not elem.tail.strip():
elem.tail = i
else:
if level and (not elem.tail or not elem.tail.strip()):
elem.tail = i
D = {36322:(17,2),22388:(5,8)}
job = et.Element('JOB')
for itemnumber,(salesman,quantity) in D.items():
item = et.SubElement(job,'ITEMNUMBER').text = str(itemnumber)
et.SubElement(job,'SALESMAN').text = str(salesman)
et.SubElement(job,'QUANTITY').text = str(quantity)
indent(job)
et.dump(job)
Output:
<JOB>
<ITEMNUMBER>36322</ITEMNUMBER>
<SALESMAN>17</SALESMAN>
<QUANTITY>2</QUANTITY>
<ITEMNUMBER>22388</ITEMNUMBER>
<SALESMAN>5</SALESMAN>
<QUANTITY>8</QUANTITY>
</JOB>
Although as #alko mentioned, a more structured XML might be:
job = et.Element('JOB')
for itemnumber,(salesman,quantity) in D.items():
item = et.SubElement(job,'ITEM')
et.SubElement(item,'NUMBER').text = str(itemnumber)
et.SubElement(item,'SALESMAN').text = str(salesman)
et.SubElement(item,'QUANTITY').text = str(quantity)
Output:
<JOB>
<ITEM>
<NUMBER>36322</NUMBER>
<SALESMAN>17</SALESMAN>
<QUANTITY>2</QUANTITY>
</ITEM>
<ITEM>
<NUMBER>22388</NUMBER>
<SALESMAN>5</SALESMAN>
<QUANTITY>8</QUANTITY>
</ITEM>
</JOB>
Your XML structure do not seem valid to me. How can one tell which salesman refers which itemnumber?
Probably, you need something like
<JOB>
<ITEM>
<NUMBER>36322</NUMBER>
<SALESMANN>17</SALESMANN>
<QUANTITY>2</QUANTITY>
</ITEM>
<ITEM>
<NUMBER>22388</NUMBER>
<SALESMANN>5</SALESMANN>
<QUANTITY>8</QUANTITY>
</ITEM>
</JOB>
For a list of serialization techniques, refer to Serialize Python dictionary to XML
Sample with dicttoxml:
import dicttoxml
from xml.dom.minidom import parseString
xml = dicttoxml.dicttoxml({'JOB':[{'NUMBER':36322,
'QUANTITY': 2,
'SALESMANN': 17}
]}, root=False)
dom = parseString(xml)
and output
>>> print(dom.toprettyxml())
<?xml version="1.0" ?>
<JOB type="list">
<item type="dict">
<SALESMANN type="int">
17
</SALESMANN>
<NUMBER type="int">
36322
</NUMBER>
<QUANTITY type="int">
2
</QUANTITY>
</item>
</JOB>

Getting unique value when the same tag is in children's tree in XML with Python

I have getElementText as follows which works pretty well with [0] as the XML that I'm working on doesn't have the duplicate tag.
from xml.dom import minidom
def getElementText(element, tagName):
return str(element.getElementsByTagName(tagName)[0].firstChild.data)
doc = minidom.parse("/Users/smcho/Desktop/hello.xml")
outputTree = doc.getElementsByTagName("Output")[0]
print getElementText(outputTree, "Number")
However, when I parse the following XML, I can't get the value <Number>0</Number> but <ConnectedTerminal><Number>1</Number></ConnectedTerminal> with getElementText(outputTree, "Number"), because the getElementText function returns the first of the two elements with the tag "Number".
<Output>
<ConnectedTerminal>
<Node>5</Node>
<Number>1</Number>
</ConnectedTerminal>
<Type>int8</Type>
<Number>0</Number>
</Output>
Any solution to this problem? Is there any way to get only <Number>0</Number> or <ConnectedTerminal><Number>1</Number></ConnectedTerminal>.
If lxml is an option (it's much nicer than minidomyou) can do:
from lxml import etree
doc = etree.fromstring(xml)
node = doc.find('Number')
print node.text # 0
node = doc.xpath('//ConnectedTerminal/Number')[0]
print node.text # 1
Also see the xpath tutorial.
There's not a direct DOM method to do this, no. But it's fairly easy to write one:
def getChildElementsByTagName(element, tag):
children= []
for child in element.childNodes:
if child.nodeType==child.ELEMENT_NODE and tag in (child.tagName, '*'):
children.push(child)
return children
Plus here's a safer text-getting function, so you don't have to worry about multiple nodes, missing nodes due to blank strings, or CDATA sections.
def getTextContent(element):
texts= []
for child in element.childNodes:
if child.nodeType==child.ELEMENT_NODE:
texts.append(getTextContent(child))
elif child.nodeType==child.TEXT_NODE:
texts.append(child.data)
return u''.join(texts)
then just:
>>> getTextContent(getChildElementsByTagName(doc, u'Number')[0])
u'0'
>>> getTextContent(getChildElementsByTagName(doc, u'Output')[0].getElementsByTagName(u'Number')[0])
u'1'

Using XPath in ElementTree

My XML file looks like the following:
<?xml version="1.0"?>
<ItemSearchResponse xmlns="http://webservices.amazon.com/AWSECommerceService/2008-08-19">
<Items>
<Item>
<ItemAttributes>
<ListPrice>
<Amount>2260</Amount>
</ListPrice>
</ItemAttributes>
<Offers>
<Offer>
<OfferListing>
<Price>
<Amount>1853</Amount>
</Price>
</OfferListing>
</Offer>
</Offers>
</Item>
</Items>
</ItemSearchResponse>
All I want to do is extract the ListPrice.
This is the code I am using:
>> from elementtree import ElementTree as ET
>> fp = open("output.xml","r")
>> element = ET.parse(fp).getroot()
>> e = element.findall('ItemSearchResponse/Items/Item/ItemAttributes/ListPrice/Amount')
>> for i in e:
>> print i.text
>>
>> e
>>
Absolutely no output. I also tried
>> e = element.findall('Items/Item/ItemAttributes/ListPrice/Amount')
No difference.
What am I doing wrong?
There are 2 problems that you have.
1) element contains only the root element, not recursively the whole document. It is of type Element not ElementTree.
2) Your search string needs to use namespaces if you keep the namespace in the XML.
To fix problem #1:
You need to change:
element = ET.parse(fp).getroot()
to:
element = ET.parse(fp)
To fix problem #2:
You can take off the xmlns from the XML document so it looks like this:
<?xml version="1.0"?>
<ItemSearchResponse>
<Items>
<Item>
<ItemAttributes>
<ListPrice>
<Amount>2260</Amount>
</ListPrice>
</ItemAttributes>
<Offers>
<Offer>
<OfferListing>
<Price>
<Amount>1853</Amount>
</Price>
</OfferListing>
</Offer>
</Offers>
</Item>
</Items>
</ItemSearchResponse>
With this document you can use the following search string:
e = element.findall('Items/Item/ItemAttributes/ListPrice/Amount')
The full code:
from elementtree import ElementTree as ET
fp = open("output.xml","r")
element = ET.parse(fp)
e = element.findall('Items/Item/ItemAttributes/ListPrice/Amount')
for i in e:
print i.text
Alternate fix to problem #2:
Otherwise you need to specify the xmlns inside the srearch string for each element.
The full code:
from elementtree import ElementTree as ET
fp = open("output.xml","r")
element = ET.parse(fp)
namespace = "{http://webservices.amazon.com/AWSECommerceService/2008-08-19}"
e = element.findall('{0}Items/{0}Item/{0}ItemAttributes/{0}ListPrice/{0}Amount'.format(namespace))
for i in e:
print i.text
Both print:
2260
from xml.etree import ElementTree as ET
tree = ET.parse("output.xml")
namespace = tree.getroot().tag[1:].split("}")[0]
amount = tree.find(".//{%s}Amount" % namespace).text
Also, consider using lxml. It's way faster.
from lxml import ElementTree as ET
Element tree uses namespaces so all the elements in your xml have name like
{http://webservices.amazon.com/AWSECommerceService/2008-08-19}Items
So make the search include the namespace
e.g.
search = '{http://webservices.amazon.com/AWSECommerceService/2008-08-19}Items/{http://webservices.amazon.com/AWSECommerceService/2008-08-19}Item/{http://webservices.amazon.com/AWSECommerceService/2008-08-19}ItemAttributes/{http://webservices.amazon.com/AWSECommerceService/2008-08-19}ListPrice/{http://webservices.amazon.com/AWSECommerceService/2008-08-19}Amount'
element.findall( search )
gives the element corresponding to 2260
I ended up stripping out the xmlns from the raw xml like that:
def strip_ns(xml_string):
return re.sub('xmlns="[^"]+"', '', xml_string)
Obviously be very careful with this, but it worked well for me.
One of the most straight forward approach and works even with python 3.0 and other versions is like below:
It just takes the root and starts getting into it till we get the
specified "Amount" tag
from xml.etree import ElementTree as ET
tree = ET.parse('output.xml')
root = tree.getroot()
#print(root)
e = root.find(".//{http://webservices.amazon.com/AWSECommerceService/2008-08-19}Amount")
print(e.text)

Categories