Returning value after recursively iterating through XML

Returning value after recursively iterating through XML - python

I'm working with a very nested XML file and the path is critical for understanding. This answer enables me to print both the path and value: Python xml absolute path
What I can't figure out is how to output the result in a more usable way (trying to construct a dataframe listing Path and Value).
For example, from the linked example:
<A>
<B>foo</B>
<C>
<D>On</D>
</C>
<E>Auto</E>
<F>
<G>
<H>shoo</H>
<I>Off</I>
</G>
</F>
</A>
from lxml import etree
root = etree.XML(your_xml_string)
def print_path_of_elems(elem, elem_path=""):
for child in elem:
if not child.getchildren() and child.text:
# leaf node with text => print
print "%s/%s, %s" % (elem_path, child.tag, child.text)
else:
# node with child elements => recurse
print_path_of_elems(child, "%s/%s" % (elem_path, child.tag))
print_path_of_elems(root, root.tag)
Results in the following printout:
/A/B, foo
/A/C/D, On
/A/E, Auto
/A/F/G/H, shoo
/A/F/G/I, Off
I believe yield is the correct technique but I'm getting no where, current attempt returns nothing:
from lxml import etree
root = etree.XML(your_xml_string)
def yield_path_of_elems(elem, elem_path=""):
for child in elem:
if not child.getchildren() and child.text:
ylddict = {'Path':elem_path, 'Value':child.text}
yield(ylddict)
else:
# node with child elements => recurse
yield_path_of_elems(child, "%s/%s" % (elem_path, child.tag))
for i in yield_path_of_elems(root):
#print for simplicity in example, otherwise turn into DF and concat
print(i)
From experimenting I believe when I use yield or return the recursion doesn't function correctly.

You need to pass the values yielded by the recursive call back to the original caller. So change:
yield_path_of_elems(child, "%s/%s" % (elem_path, child.tag))
to
yield from yield_path_of_elems(child, "%s/%s" % (elem_path, child.tag))
This is analogous to the way you have to use return recursive_call(...) in a normal recursive function.

Related

python error AttributeError: 'Element' object has no attribute 'childNodes' while parsing xml

I very very new to python world .im having a new requirement that I need to print the node value along with node name in python.but im getting some error.
can any one help me pls...!!!!
This is my sample xml:
<task chapnbr="05" sectnbr="41" subjnbr="05" func="210" seq="8" pgblknbr="2" chg="N" key="4B5014E906" revdate="20051230">
<effect effrg="801899" efftext="ZAP ALL" />
<title>WING</title>
<refblock>(<grphcref refid=NAX00000>Figure 208</grphcref>)</refblock>
<tfmatr>
<pretopic>
<title>General</title>
<list1>
<l1item>
<para>This procedure is a scheduled maintenance task.</para>
</l1item>
</list1>
</pretopic>
</tfmatr>
<topic>
<title>Zonal Inspection</title>
<subtask chapnbr="05" sectnbr="41" subjnbr="05" func="210" seq="008" pgblknbr="2" chg="N" key="B7A276D9D" revdate="20051230">
<effect effrg="8018" efftext="ZAP ALL" />
<list1>
<l1item>
<para>Do the zonal inspection.</para>
</l1item>
</list1>
</subtask>
</topic>
<graphic chapnbr="05" sectnbr="41" subjnbr="05" func="990" seq="808" pgblknbr="2" chg="N" key=NAX00000 revdate="20051230">
<effect effrg="801899" efftext="ZAP ALL" />
<title>Figure 208. Leading Edge to Front Spar (Outboard of Nacelle Strut) Left Wing - General Visual (External)</title>
<sheet gnbr="811DE0C" sheetnbr="1" chg="N" key="5C152" revdate="20051230">
<effect effrg="801899" efftext="ZAP ALL" />
</sheet>
</graphic>
</task>
my sample python code :
import os
import sys
import xml.etree.ElementTree as ET
directory=raw_input("Enter the folderPath : ")
files=raw_input("Enter the File type : ")
def select_files_in_folder(dir, ext):
for file in os.listdir(dir):
if file.endswith('.%s' % ext):
yield os.path.join(dir, file)
def print_node(root):
if root.childNodes[0]:
for node in root.childNodes:
if node.nodeType == node.ELEMENT_NODE:
print node.tagName,"has value:", node.nodeValue, "and is child of:", node.parentNode.tagName
print_node(node)
for file in select_files_in_folder(directory, files):
tree = ET.parse(file)
root = tree.getroot()
print_node(root)
Im getting below error.

You are getting AttributeError because xml.etree.ElementTree.Element object has no attribute childNodes.
To iterate over childern elements, just do for child in elem.
def print_node(root):
for node in root:
if node.nodeType == node.ELEMENT_NODE:
print node.tagName,"has value:", node.nodeValue, "and is child of:", node.parentNode.tagName
print_node(node)

how to get file names and paths based on a given attribute in parent tag

I want to change the below code to get file_names and file_paths only when fastboot="true" attribute is present in the parent tag,I provided the current output and expected ouput,can anyone provide guidance on how to do it?
import sys
import os
import string
from xml.dom import minidom
if __name__ == '__main__':
meta_contents = minidom.parse("fast.xml")
builds_flat = meta_contents.getElementsByTagName("builds_flat")[0]
build_nodes = builds_flat.getElementsByTagName("build")
for build in build_nodes:
bid_name = build.getElementsByTagName("name")[0]
print "Checking if this is cnss related image... : \n"+bid_name.firstChild.data
if (bid_name.firstChild.data == 'apps'):
file_names = build.getElementsByTagName("file_name")
file_paths = build.getElementsByTagName("file_path")
print "now files paths...\n"
for fn,fp in zip(file_names,file_paths):
if (not fp.firstChild.nodeValue.endswith('/')):
fp.firstChild.nodeValue = fp.firstChild.nodeValue + '/'
full_path = fp.firstChild.nodeValue+fn.firstChild.nodeValue
print "file-to-copy: "+full_path
break
INPUT XML:-
<builds_flat>
<build>
<name>apps</name>
<file_ref ignore="true" minimized="true">
<file_name>adb.exe</file_name>
<file_path>LINUX/android/vendor/qcom/proprietary/usb/host/windows/prebuilt/</file_path>
</file_ref>
<file_ref ignore="true" minimized="true">
<file_name>system.img</file_name>
<file_path>LINUX/android/out/target/product/msmcobalt/secondary-boot/</file_path>
</file_ref>
<download_file cmm_file_var="APPS_BINARY" fastboot_rumi="boot" fastboot="true" minimized="true">
<file_name>boot.img</file_name>
<file_path>LINUX/android/out/target/product/msmcobalt/</file_path>
</download_file>
<download_file sparse_image_path="true" fastboot_rumi="abl" fastboot="true" minimized="true">
<file_name>abl.elf</file_name>
<file_path>LINUX/android/out/target/product/msmcobalt/</file_path>
</download_file>
</build>
</builds_flat>
OUTPUT:-
...............
now files paths...
file-to-copy: LINUX/android/vendor/qcom/proprietary/usb/host/windows/prebuilt/adb.exe
file-to-copy: LINUX/android/out/target/product/msmcobalt/secondary-boot/system.img
file-to-copy: LINUX/android/out/target/product/msmcobalt/boot.img
file-to-copy: LINUX/android/out/target/product/msmcobalt/abl.elf
EXPECTED OUT:-
now files paths...
........
file-to-copy: LINUX/android/out/target/product/msmcobalt/boot.img
file-to-copy: LINUX/android/out/target/product/msmcobalt/abl.elf

Something rather quick and dirty that comes to mind is using the fact that only the download_file elements have the fastboot attribute, right? If that's the case, you could always get the children of type download_file and filter the ones whose fastboot attribute is not "true":
import os
from xml.dom import minidom
if __name__ == '__main__':
meta_contents = minidom.parse("fast.xml")
for elem in meta_contents.getElementsByTagName('download_file'):
if elem.getAttribute('fastboot') == "true":
path = elem.getElementsByTagName('file_path')[0].firstChild.nodeValue
file_name = elem.getElementsByTagName('file_name')[0].firstChild.nodeValue
print os.path.join(path, file_name)
With the sample you provided that outputs:
$ python ./stack_034.py
LINUX/android/out/target/product/msmcobalt/boot.img
LINUX/android/out/target/product/msmcobalt/abl.elf
Needless to say... since there's no .xsd file (nor that it'd matter with the minidom, though) you only get strings (no type safety) and this only applies to the structure shown in the example (you probably would like to add some extra checks there, is what I mean)
EDIT:
As per the comment in this answer:
To get the elements within the <build> that contains a <name> attribute with value apps, you can: Find that <name> tag (the one whose value is the string apps), then move to the parent node (which will put you in the build element) and then proceed as mentioned above:
if __name__ == '__main__':
meta_contents = minidom.parse("fast.xml")
for elem in meta_contents.getElementsByTagName('name'):
if elem.firstChild.nodeValue == "apps":
apps_build = elem.parentNode
for elem in apps_build.getElementsByTagName('download_file'):
if elem.getAttribute('fastboot') == "true":
path = elem.getElementsByTagName('file_path')[0].firstChild.nodeValue
file_name = elem.getElementsByTagName('file_name')[0].firstChild.nodeValue
print os.path.join(path, file_name)

parsing parent/children relationship of XML elements

Given the following XML (ant build xml):
<project name="pj1">
<target name="t1">
...
<antcall target="t2"/>
<a>
<antcall target="t4"/>
</a>
...
</target>
<target name="t2">
...
<antcall target="t3"/>
...
</target>
<target name="t3">
...
...
</target>
<target name="t4">
...
<antcall target="t2"/>
...
</target>
<target name="t5">
...
...
</target>
</project>
I'd like to display parent/children relationship of the target elements as follows (without displaying a target as first level element if it is nested in another target)
t1
t2
t3
t4
t2
t3
t5
could anyone please help?
Thanks in advance.

When I need to manipulate an XML tree to some other representation, I find it useful to first convert it to an abstract representation and then convert to the final concrete representation.
In this case, first we create a dictionary of lists which represent the target dependency structure, then we pretty-print that dictionary.
#!/usr/bin/python
import xml.etree.ElementTree as ET
from itertools import chain
def parse(filename):
tree = ET.parse(filename)
root = tree.getroot()
result = {}
for target in root.findall('target'):
target_name = target.get('name')
result[target_name] = []
for antcall in target.findall('.//antcall'):
result[target_name].append(antcall.get('target'))
return result
def display(tree):
def recurse(node, indent):
print "%*s%s"%(indent*4, "", node)
for node in sorted(tree[node]):
recurse(node, indent+1)
for item in sorted(tree):
if item in chain(*tree.values()): continue
recurse(item,0)
if __name__=="__main__":
import argparse
parser = argparse.ArgumentParser(description='Dump ANT files')
parser.add_argument('antfile',
nargs='+',
type=argparse.FileType('r'),
help='ANT build file')
args = parser.parse_args()
for antfile in args.antfile:
display(parse(antfile))

using libclang, callback function does not traverse recursively and does not visit all functions

I am following an example which shows limitation of python bindings, from site http://eli.thegreenplace.net/2011/07/03/parsing-c-in-python-with-clang/
It uses "libclang visitation API directly".
import sys
import clang.cindex
def callexpr_visitor(node, parent, userdata):
if node.kind == clang.cindex.CursorKind.CALL_EXPR:
print 'Found %s [line=%s, col=%s]' % (
node.spelling, node.location.line, node.location.column)
return 2 # means continue visiting recursively
index = clang.cindex.Index.create()
tu = index.parse(sys.argv[1])
clang.cindex.Cursor_visit(
tu.cursor,
clang.cindex.Cursor_visit_callback(callexpr_visitor),
None)
The output shows all functions called along with their line numbers.
Found foo [line=8, col=5]
Found foo [line=10, col=9]
Found bar [line=15, col=5]
Found foo [line=16, col=9]
Found bar [line=17, col=9]
When I run the same code, I only get output
Found bar [line=15, col=5]
The version I use is llvm3.1 with windows (with the changes suggested in the link).
I feel, returning 2 is not calling the callback function again.
I have even tried using 'get_children' on node and traversing without callback, I get the same result.
import sys
import clang.cindex
#def callexpr_visitor(node, parent, userdata):
def callexpr_visitor(node):
if node.kind == clang.cindex.CursorKind.CALL_EXPR:
print 'Found %s [line=%s, col=%s]' % (
clang.cindex.Cursor_displayname(node), node.location.line, node.location.column)
for c in node.get_children():
callexpr_visitor(c)
#return 2 # means continue visiting recursively
index = clang.cindex.Index.create()
tu = index.parse(sys.argv[1])
#clang.cindex.Cursor_visit(
# tu.cursor,
# clang.cindex.Cursor_visit_callback(callexpr_visitor),
# None)
callexpr_visitor(tu.cursor)
I could not get the reason for this behavior after much search and trials.
Can anyone explain this please ?
Regards.

I think I found the reason for this.
If I change the return type of function 'foo', from 'bool' to 'int', I get the expected result.
This may be as 'bool' is not a keyword in 'c'.
That was simple.

Preserving special characters in text nodes using Python lxml module

I am editing an XML file that is provided by a third party. The XML is used to recreate and entire environment and one is able to edit the XML to propogate the changes. I was able to lookup the element I wanted to change through command line options and save the XML, but special characters are being escaped and I need to retain the special characters. For example it is changing > to $gt; in the file during the .write operation. This is affecting in all occurances of the XML document not just the node element (I think that is what it is called) Below is my code:
import sys
from lxml import etree
from optparse import OptionParser
def parseCommandLine ():
usage = "usage: %prog [options] arg"
parser = OptionParser(usage)
parser.add_option("-f","--file",dest="filename",
help="Context File name including full path", metavar="CONTEXT_FILE")
parser.add_option("-k","--key",dest="key",
help="Key you are looking for in Context File i.e s_isAdmin", metavar="s_someKey")
parser.add_option("-v","--value",dest="value",
help="The replacement value for the key")
if len(sys.argv[1:]) < 3:
print len(sys.argv[1:])
parser.print_help()
sys.exit(2)
(options, args) = parser.parse_args()
return options.filename, options.key, options.value
Filename, Key, Value=parseCommandLine()
parser_options=etree.XMLParser(attribute_defaults=True, dtd_validation=False, strip_cdata=False)
doc = etree.parse(Filename, parser_options ) #Open and parse the file
print doc.findall("//*[#oa_var=%r]" % Key)[0].text
oldval = doc.findall("//*[#oa_var=%r]" % Key)[0].text
val = doc.findall("//*[#oa_var=%r]" % Key)[0]
val.text = Value
print 'old value is %s' % oldval
print 'new value is %s' % val.text
root = doc.getroot()
doc.write(Filename,method='xml',with_tail=True,pretty_print=False)
Original file has this:
tf.fm.FulfillmentServer >> /s_u01/app/applmgr/f
Saved version is being replaced with this:
tf.fm.FulfillmentServer >> /s_u01/app/applmgr/f
I have been trying to mess with pretty_print in the output side DTD validations on the parsing side and I am stumped.
Below is a diff from the changed file and and the original file:
I updated the s_cookie_domain only.
diff finprod_acfpdb10.xml_original finprod_acfpdb10.xml
Warning: missing newline at end of file finprod_acfpdb10.xml
1,3c1
< <?xml version = '1.0'?>
< <!-- $Header: adxmlctx.tmp 115.426 2009/05/08 08:46:29 rdamodar ship $ -->
< <!--
---
> <!-- $Header: adxmlctx.tmp 115.426 2009/05/08 08:46:29 rdamodar ship $ --><!--
13,14c11
< -->
< <oa_context version="$Revision: 115.426 $">
---
> --><oa_context version="$Revision: 115.426 $">
242c239
< <cookiedomain oa_var="s_cookie_domain">.apollogrp.edu</cookiedomain>
---
> <cookiedomain oa_var="s_cookie_domain">.qadoamin.edu</cookiedomain>
526c523
< <FORMS60_BLOCK_URL_CHARACTERS oa_var="s_f60blockurlchar">%0a,%0d,!,%21,",%22,%28,%29,;,[,%5b,],%5d,{,%7b,|,%7c,},%7d,%7f,>,%3c,<,%3e</FORMS60_BLOCK_URL_CHARACTERS>
---
> <FORMS60_BLOCK_URL_CHARACTERS oa_var="s_f60blockurlchar">%0a,%0d,!,%21,",%22,%28,%29,;,[,%5b,],%5d,{,%7b,|,%7c,},%7d,%7f,>,%3c,<,%3e</FORMS60_BLOCK_URL_CHARACTERS>
940c937
< <start_cmd oa_var="s_jtffstart">/s_u01/app/applmgr/jdk1.5.0_11/bin/java -Xmx512M -classpath .:/s_u01/app/applmgr/finprod/comn/java/jdbc111.zip:/s_u01/app/applmgr/finprod/comn/java/xmlparserv2.zip:/s_u01/app/applmgr/finprod/comn/java:/s_u01/app/applmgr/finprod/comn/java/apps.zip:/s_u01/app/applmgr/jdk1.5.0_11/classes:/s_u01/app/applmgr/jdk1.5.0_11/lib:/s_u01/app/applmgr/jdk1.5.0_11/lib/classes.zip:/s_u01/app/applmgr/jdk1.5.0_11/lib/classes.jar:/s_u01/app/applmgr/jdk1.5.0_11/lib/rt.jar:/s_u01/app/applmgr/jdk1.5.0_11/lib/i18n.jar:/s_u01/app/applmgr/finprod/comn/java/3rdparty/RFJavaInt.zip: -Dengine.LogPath=/s_u01/app/applmgr/finprod/comn/admin/log/finprod_acfpdb10 -Dengine.TempDir=/s_u01/app/applmgr/finprod/comn/temp -Dengine.CommandPromptEnabled=false -Dengine.CommandPort=11000 -Dengine.AOLJ.config=/s_u01/app/applmgr/finprod/appl/fnd/11.5.0/secure/acfpdb10_finprod.dbc -Dengine.ServerID=5000 -Ddebug=off -Dengine.LogLevel=1 -Dlog.ShowWarnings=false -Dengine.FaxEnabler=oracle.apps.jtf.fm.engine.rightfax.RfFaxEnablerImpl -Dengine.PrintEnabler=oracle.apps.jtf.fm.engine.rightfax.RfPrintEnablerImpl -Dfax.TempDir=/s_u01/app/applmgr/finprod/comn/admin/log/finprod_acfpdb10 -Dprint.TempDir=/s_u01/app/applmgr/finprod/comn/admin/log/finprod_acfpdb10 oracle.apps.jtf.fm.FulfillmentServer >> /s_u01/app/applmgr/finprod/comn/admin/log/finprod_acfpdb10/jtffmctl.txt</start_cmd>
---
> <start_cmd oa_var="s_jtffstart">/s_u01/app/applmgr/jdk1.5.0_11/bin/java -Xmx512M -classpath .:/s_u01/app/applmgr/finprod/comn/java/jdbc111.zip:/s_u01/app/applmgr/finprod/comn/java/xmlparserv2.zip:/s_u01/app/applmgr/finprod/comn/java:/s_u01/app/applmgr/finprod/comn/java/apps.zip:/s_u01/app/applmgr/jdk1.5.0_11/classes:/s_u01/app/applmgr/jdk1.5.0_11/lib:/s_u01/app/applmgr/jdk1.5.0_11/lib/classes.zip:/s_u01/app/applmgr/jdk1.5.0_11/lib/classes.jar:/s_u01/app/applmgr/jdk1.5.0_11/lib/rt.jar:/s_u01/app/applmgr/jdk1.5.0_11/lib/i18n.jar:/s_u01/app/applmgr/finprod/comn/java/3rdparty/RFJavaInt.zip: -Dengine.LogPath=/s_u01/app/applmgr/finprod/comn/admin/log/finprod_acfpdb10 -Dengine.TempDir=/s_u01/app/applmgr/finprod/comn/temp -Dengine.CommandPromptEnabled=false -Dengine.CommandPort=11000 -Dengine.AOLJ.config=/s_u01/app/applmgr/finprod/appl/fnd/11.5.0/secure/acfpdb10_finprod.dbc -Dengine.ServerID=5000 -Ddebug=off -Dengine.LogLevel=1 -Dlog.ShowWarnings=false -Dengine.FaxEnabler=oracle.apps.jtf.fm.engine.rightfax.RfFaxEnablerImpl -Dengine.PrintEnabler=oracle.apps.jtf.fm.engine.rightfax.RfPrintEnablerImpl -Dfax.TempDir=/s_u01/app/applmgr/finprod/comn/admin/log/finprod_acfpdb10 -Dprint.TempDir=/s_u01/app/applmgr/finprod/comn/admin/log/finprod_acfpdb10 oracle.apps.jtf.fm.FulfillmentServer >> /s_u01/app/applmgr/finprod/comn/admin/log/finprod_acfpdb10/jtffmctl.txt</start_cmd>
983c980
< </oa_context>
---
> </oa_context>

Terminology: Parsers don't write XML; they read XML. Serialisers write XML.
In normal element content, < and & are illegal and must be escaped. > is legal except where it follows ]] and is NOT the end of a CDATA section. Most serialisers take the easy way out and write > because a parser will handle both that and >.
I suggest that you submit both your output and input files to an XML validation service like this or this and also test whether the consumer will actually parse your output file.

The only thing I can think of is forcing the parser to treat the nodes you modify as cdata blocks (as the parser is clearly changing the xml tag closing brackets). Try val.text = etree.CDATA(Value) instead of val.text = Value.
http://lxml.de/api.html#cdata

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Returning value after recursively iterating through XML - python

Related

python error AttributeError: 'Element' object has no attribute 'childNodes' while parsing xml

how to get file names and paths based on a given attribute in parent tag

parsing parent/children relationship of XML elements

using libclang, callback function does not traverse recursively and does not visit all functions

Preserving special characters in text nodes using Python lxml module

Categories

Resources