Search for specific XML element Attribute values - python

Using Python ElementTree to construct and edit test messages:
Part of XML as follows:
<FIXML>
<TrdMtchRpt TrdID="$$+TrdID#" RptTyp="0" TrdDt="20120201" MtchTyp="4" LastMkt="ABCD" LastPx="104.11">
The key TrdID contain values beginning with $$ to identify that this value is variable data and needs to be amended once the message is constructed from a template, in this case to the next sequential number (stored in a dictionary - the overall idea is to load a dictionary from a file with the attribute key listed and the associated value such as the next sequential number e.g. dictionary file contains $$+TrdID# 12345 using space as the delim).
So far my script iterates the parsed XML and examines each indexed element in turn. There will be several fields in the xml file that require updating so I need to avoid using hard coded references to element tags.
How can I search the element/attribute to identify if the attribute contains a key where the corresponding value starts with or contains the specific string $$?
And for reasons unknown to me we cannot use lxml!

You can use XPath.
import lxml.etree as etree
import StringIO from StringIO
xml = """<FIXML>
<TrdMtchRpt TrdID="$$+TrdID#"
RptTyp="0"
TrdDt="20120201"
MtchTyp="4"
LastMkt="ABCD"
LastPx="104.11"/>
</FIXML>"""
tree = etree.parse(StringIO(xml))
To find elements TrdMtchRpt where the attribute TrdID starts with $$:
r = tree.xpath("//TrdMtchRpt[starts-with(#TrdID, '$$')]")
r[0].tag == 'TrdMtchRpt'
r[0].get("TrdID") == '$$+TrdID#'
If you want to find any element where at least one attribute starts with $$ you can do this:
r = tree.xpath("//*[starts-with(#*, '$$')]")
r[0].tag == 'TrdMtchRpt'
r[0].get("TrdID") == '$$+TrdID#'
Look at the documentation:
http://lxml.de/xpathxslt.html#the-xpath-method
http://www.w3schools.com/xpath/xpath_functions.asp#string
http://www.w3schools.com/xpath/xpath_syntax.asp

You can use ElementTree package. It gives you an object with a hierarchical data structure from XML document.

Related

Getting values from an XML file that has deep keys and values

I have a very large xml file produced from an application whose part of tree is as below:
There are several items under 'item' from 0 to 7. These names are always named as numbers it can range from 0 to any number.
Each of these items will have multiple items all with same structure as per the above tree. Only item 0 to 7 is variable all other structure remains same.
under I have a value <bbmds_questiontype>: which can be Multiple Choice or Matching or Essays.
What I need is to have a list the values of <mat_formattedtext>. ie. the output is supposed to be:
<0>
<bbmds_questiontype>Multiple Choice</bbmds_questiontype>
<mat_formattedtext>This is first question </mat_formattedtext></0>
<1>
<bbmds_questiontype>Multiple Choice</bbmds_questiontype>
<mat_formattedtext>This is second question </mat_formattedtext> </1>
<2>
<bbmds_questiontype>Essay</bbmds_questiontype>
<mat_formattedtext>This is first question </mat_formattedtext> </2>
....
I have tried several solution included xml tree, xmltodict all getting complicated as filters to be applied across different branches of children
import xmltodict
with open("C:/Users/SS/Desktop/moodlexml/00001_questions.dat") as fd:
doc = xmltodict.parse(fd.read())
shortened=doc['questestinterop']['assessment']['section']['item'] # == u'an attribute'
Any advice will be appreciated to proceed further.
Have you tried to use bs4 parsing, its simple
Check it out
https://linuxhint.com/parse_xml_python_beautifulsoup/

Find for multiple tags' values with lxml

I am using lxml to parse an XML like this sample one:
<compounddef xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" id="d2/db7/class_foo" kind="class">
<compoundname>FooClass</compoundname>
<sectiondef kind="public-type">
<memberdef kind="typedef" id="d2/db7/class_bar">
<type><ref refid="d3/d73/struct_foo" kindref="compound">StructFoo</ref></type>
<definition>StructFooDefinition</definition>
</memberdef>
</sectiondef>
</compounddef>
I'm trying to get the element with <refid> "d3/d73/struct_foo" and with the <definition> containing the text "Foo".
There could be many refid with that value and many definitions containing Foo, but only one has this combination.
I am able to first find all the elements with that refid and then filter this list by checking which of them containts "Foo" in the , but since I'm working with a really big XML file (~1GB) and the application is time sensitive, I wanted to avoid this.
I tried combining the various etree paths using the keyword 'and' or '//precede:...', but without success.
My last try was:
self.dox_tree_root_.xpath(".//compounddef[#kind = 'class']//memberdef[#kind='typedef'][/type/ref[#refid='%s'] and contains(definition, 'name')]" % (independent_type_refid, name)))
but it is giving me an error.
Is there a way to combine the two filters inside one command?
You can use XPATH
//a[.//ref[#refid="12345"] and contains(c, "Good")]
If I understand your correctly, this should get you close enough:
.//compounddef[#kind = 'class']//memberdef[#kind='typedef'][./type/ref[#refid='d3/d73/struct_foo']][contains(.//definition, 'Foo')]//definition
Output:
StructFooDefinition

Extracting XML Element and Attribute Data with Python 3

I'm looking to extract the extract the values of a particular attribute from a particular element, using Python 3.
An example of the element in question (Atom3d):
<Atom3d ID="18" Mapping="43" Parent="2" Name="C7"
XYZ="0.0148299997672439,0.283699989318848,1.0291999578476" Connections="33,39"
TemperatureType="Isotropic" IsotropicTemperature="0.0677"
AnisotropicTemperature="0,0,0,0,0,0,0,0,0" Occupancy="0.708" Components="C"/>
I need to extract the XYZ value, and further need to take this value and separate the comma-separated numbers within it. I need to use these numbers in another input file of a different format, so I was thinking to assign them to three separate variables and take it from there.
I'm very inexperienced with Python, and completely so when it comes to XML. I'm not sure of which libraries I would need to use, if such libraries even exist and how to use them if they do.
http://docs.python.org/3/library/xml.etree.elementtree.html
>>> from xml.etree import ElementTree as ET
>>> elem = ET.fromstring('''<Atom3d ID="18" Mapping="43" Parent="2" Name="C7"
... XYZ="0.0148299997672439,0.283699989318848,1.0291999578476" Connections="33,39"
... TemperatureType="Isotropic" IsotropicTemperature="0.0677"
... AnisotropicTemperature="0,0,0,0,0,0,0,0,0" Occupancy="0.708" Components="C"/>
... ''')
get attribute using get('attribute-name'):
>>> elem.get('XYZ')
'0.0148299997672439,0.283699989318848,1.0291999578476'
split string by ',':
>>> elem.get('XYZ').split(',')
['0.0148299997672439', '0.283699989318848', '1.0291999578476']

xml missing element in python

System uses dom parser in python 2.7.2. The goal is to extract the .db file and use it on sql server.I currently have no problem with sqlite3 library. I have read the similar questions/answers about how to handle a missing element while parsing xml files.But still I couldn't figure out the solution. xml has 15000+ elements. here is the basic code from xml:
<topo>
<vlancard>
<id>4545</id>
<nodeValue>21</nodeValue>
<vlanName>voice</vlanName>
</vlancard>
<vlancard>
<id>1234</id>
<nodeValue>42</nodeValue>
<vlanName>camera</vlanName>
</vlancard>
<vlancard>
<id>9876</id>
<nodeValue>84</nodeValue>
</vlancard>
</topo>
Like the 3rd element, several elements do not have the node. That causes inconsistency on element numbers. i.e.
from xml.dom import minidom
xmldoc = minidom.parse('c:\vlan.xml')
vlId = xmldoc.getElementsByTagName('id')
vlValue = xmldoc.getElementsByTagName('nodeValue')
vlName = xmldoc.getElementsByTagName('vlanName')
after running the module:
IndexError: list index out of range
>>> len(id)
16163
>>> len(vlanName)
16155
Because of this problem , problem occurs for ordering the elements. while printing the table , parser passes the missing elements and element orders are mixed up. I use a simple while loop to insert the values into the table.
x=0
while x < (len(vlId)):
c.execute('''insert into vlan ('id','nodeValue','vlanName') values ('%s','%s','%s') ''' %(id[x].firstChild.nodeValue, nodeValue[x].firstChild.nodeValue, vlanName[x].firstChild.nodeValue))
x= x+1
How else can I do this? Any help will be appreciated.
Yusuf
Instead of parsing the entire xml and then inserting, parse each vlancard the retrieve it's id/value/name and then insert them into the DB.

Python development - elementtree XML and string operations

I am using ElementTree to load up a series of XML files and parse them. As a file is parsed, I am grabbing a few bits of data from it ( a headline and a paragraph of text). I then need to grab some file names that are stored in the XML. They are contained in an element called ContentItem.
My code looks a bit like this:
for item in dirlist:
newsML = ET.parse(item)
NewsLines = newsML.getroot()
HeadLine = NewsLines.getiterator("HeadLine")
result.append(HeadLine)
p = NewsLines.getiterator("p")
result.append(p)
ci = NewsLines.getiterator("ContentItem")
for i in ci:
result.append(i.attrib)
Now, if there was only one type of file, this would have been fine, but it contains 3 types (jpg, flv and a mp4). So as I loop through them in the view, it spits them out, but how do I just grab the flv if I only want that one? or just the mp4? They don't always appear in the same order in the list either.
Is there a way to say if it ends in .mp4 then do this action, or is there a way to do that in the template even?
If i try to do this;
url = i.attrib
if url.get("Href", () ).endswith('jpg'):
result.append(i.attrib)
I get an error tuple object has no attribute endswith. Why is this a tuple? I thought it was a dict?
You get a tuple because you supply a tuple (the parentheses) as the default return value for url.get(). Supply an empty string, and you can use its .endswith() method. Also note that the element itself has a get() method to retrieve attribute values (you do not have to go via .attrib). Example:
if i.get('Href', '').endswith('.jpg'):
result.append(i.attrib)

Categories