parsing parent/children relationship of XML elements

parsing parent/children relationship of XML elements - python

Given the following XML (ant build xml):
<project name="pj1">
<target name="t1">
...
<antcall target="t2"/>
<a>
<antcall target="t4"/>
</a>
...
</target>
<target name="t2">
...
<antcall target="t3"/>
...
</target>
<target name="t3">
...
...
</target>
<target name="t4">
...
<antcall target="t2"/>
...
</target>
<target name="t5">
...
...
</target>
</project>
I'd like to display parent/children relationship of the target elements as follows (without displaying a target as first level element if it is nested in another target)
t1
t2
t3
t4
t2
t3
t5
could anyone please help?
Thanks in advance.

When I need to manipulate an XML tree to some other representation, I find it useful to first convert it to an abstract representation and then convert to the final concrete representation.
In this case, first we create a dictionary of lists which represent the target dependency structure, then we pretty-print that dictionary.
#!/usr/bin/python
import xml.etree.ElementTree as ET
from itertools import chain
def parse(filename):
tree = ET.parse(filename)
root = tree.getroot()
result = {}
for target in root.findall('target'):
target_name = target.get('name')
result[target_name] = []
for antcall in target.findall('.//antcall'):
result[target_name].append(antcall.get('target'))
return result
def display(tree):
def recurse(node, indent):
print "%*s%s"%(indent*4, "", node)
for node in sorted(tree[node]):
recurse(node, indent+1)
for item in sorted(tree):
if item in chain(*tree.values()): continue
recurse(item,0)
if __name__=="__main__":
import argparse
parser = argparse.ArgumentParser(description='Dump ANT files')
parser.add_argument('antfile',
nargs='+',
type=argparse.FileType('r'),
help='ANT build file')
args = parser.parse_args()
for antfile in args.antfile:
display(parse(antfile))

Related

Returning value after recursively iterating through XML

I'm working with a very nested XML file and the path is critical for understanding. This answer enables me to print both the path and value: Python xml absolute path
What I can't figure out is how to output the result in a more usable way (trying to construct a dataframe listing Path and Value).
For example, from the linked example:
<A>
<B>foo</B>
<C>
<D>On</D>
</C>
<E>Auto</E>
<F>
<G>
<H>shoo</H>
<I>Off</I>
</G>
</F>
</A>
from lxml import etree
root = etree.XML(your_xml_string)
def print_path_of_elems(elem, elem_path=""):
for child in elem:
if not child.getchildren() and child.text:
# leaf node with text => print
print "%s/%s, %s" % (elem_path, child.tag, child.text)
else:
# node with child elements => recurse
print_path_of_elems(child, "%s/%s" % (elem_path, child.tag))
print_path_of_elems(root, root.tag)
Results in the following printout:
/A/B, foo
/A/C/D, On
/A/E, Auto
/A/F/G/H, shoo
/A/F/G/I, Off
I believe yield is the correct technique but I'm getting no where, current attempt returns nothing:
from lxml import etree
root = etree.XML(your_xml_string)
def yield_path_of_elems(elem, elem_path=""):
for child in elem:
if not child.getchildren() and child.text:
ylddict = {'Path':elem_path, 'Value':child.text}
yield(ylddict)
else:
# node with child elements => recurse
yield_path_of_elems(child, "%s/%s" % (elem_path, child.tag))
for i in yield_path_of_elems(root):
#print for simplicity in example, otherwise turn into DF and concat
print(i)
From experimenting I believe when I use yield or return the recursion doesn't function correctly.

You need to pass the values yielded by the recursive call back to the original caller. So change:
yield_path_of_elems(child, "%s/%s" % (elem_path, child.tag))
to
yield from yield_path_of_elems(child, "%s/%s" % (elem_path, child.tag))
This is analogous to the way you have to use return recursive_call(...) in a normal recursive function.

Python XML findall does not work

I am trying to use findall to select on some xml elements, but i can't get any results.
import xml.etree.ElementTree as ET
import sys
storefront = sys.argv[1]
xmlFileName = 'promotions{0}.xml'
xmlFile = xmlFileName.format(storefront)
csvFileName = 'hrz{0}.csv'
csvFile = csvFileName.format(storefront)
ET.register_namespace('', "http://www.demandware.com/xml/impex/promotion/2008-01-31")
tree = ET.parse(xmlFile)
root = tree.getroot()
print('------------------Generate test-------------\n')
csv = open(csvFile,'w')
n = 0
for child in root.findall('campaign'):
print(child.attrib['campaign-id'])
print(n)
n+=1
The XML looks something like this:
<?xml version="1.0" encoding="UTF-8"?>
<promotions xmlns="http://www.demandware.com/xml/impex/promotion/2008-01-31">
<campaign campaign-id="10off-310781">
<enabled-flag>true</enabled-flag>
<campaign-scope>
<applicable-online/>
</campaign-scope>
<customer-groups match-mode="any">
<customer-group group-id="Everyone"/>
</customer-groups>
</campaign>
<campaign campaign-id="MNT-deals">
<enabled-flag>true</enabled-flag>
<campaign-scope>
<applicable-online/>
</campaign-scope>
<start-date>2017-07-03T22:00:00.000Z</start-date>
<end-date>2017-07-31T22:00:00.000Z</end-date>
<customer-groups match-mode="any">
<customer-group group-id="Everyone"/>
</customer-groups>
</campaign>
<campaign campaign-id="black-friday">
<enabled-flag>true</enabled-flag>
<campaign-scope>
<applicable-online/>
</campaign-scope>
<start-date>2017-11-23T23:00:00.000Z</start-date>
<end-date>2017-11-24T23:00:00.000Z</end-date>
<customer-groups match-mode="any">
<customer-group group-id="Everyone"/>
</customer-groups>
<custom-attributes>
<custom-attribute attribute-id="expires_date">2017-11-29</custom-attribute>
</custom-attributes>
</campaign>
<promotion-campaign-assignment promotion-id="winter17-new-bubble" campaign-id="winter17-new-bubble">
<qualifiers match-mode="any">
<customer-groups/>
<source-codes/>
<coupons/>
</qualifiers>
<rank>100</rank>
</promotion-campaign-assignment>
<promotion-campaign-assignment promotion-id="xmas" campaign-id="xmas">
<qualifiers match-mode="any">
<customer-groups/>
<source-codes/>
<coupons/>
</qualifiers>
</promotion-campaign-assignment>
</promotions>
Any ideas what i am doing wrong?
I have tried different solutions that i found on stackoverflow but nothing seems to work for me(from the things i have tried).
The list is empty.
Sorry if it is something very obvious i am new to python.

As mentioned here by #MartijnPieters, etree's .findall uses the namespaces argument while the .register_namespace() is used for xml output of the tree. Therefore, consider mapping the default namespace with an explicit prefix. Below uses doc but can even be cosmin.
Additionally, consider with and enumerate() even the csv module as better handlers for your print and CSV outputs.
import csv
...
root = tree.getroot()
print('------------------Generate test-------------\n')
with open(csvFile, 'w') as f:
c = csv.writer(f, lineterminator='\n')
for n, child in enumerate(root.findall('doc:campaign', namespaces={'doc':'http://www.demandware.com/xml/impex/promotion/2008-01-31'})):
print(child.attrib['campaign-id'])
print(n)
c.writerow([child.attrib['campaign-id']])
# ------------------Generate test-------------
# 10off-310781
# 0
# MNT-deals
# 1
# black-friday
# 2

Multithreading/Multiprocessing to parse single XML file? [duplicate]

This question already has answers here:
Parsing Very Large XML Files Using Multiprocessing
(2 answers)
Closed 5 years ago.
Can someone tell me how to assign jobs to multiple threads to speed up parsing time? For example, I have XML file with 200k lines, I would assign 50k lines to each 4 threads and parse them using SAX parser. What I have done so far is 4 threads parsing on 200k lines which means 200k*4 = 800k duplicating results.
Any help is appreciated.
test.xml:
<?xml version="1.0" encoding="utf-8"?>
<votes>
<row Id="1" PostId="1" VoteTypeId="2" CreationDate="2014-05-13T00:00:00.000" />
<row Id="2" PostId="1" VoteTypeId="2" CreationDate="2014-05-13T00:00:00.000" />
<row Id="3" PostId="3" VoteTypeId="2" CreationDate="2014-05-13T00:00:00.000" />
<row Id="5" PostId="3" VoteTypeId="2" CreationDate="2014-05-13T00:00:00.000" />
</votes>
My source code:
import json
import xmltodict
from lxml import etree
import xml.etree.ElementTree as ElementTree
import threading
import time
def sax_parsing():
t = threading.currentThread()
for event, element in etree.iterparse("/home/xiang/Downloads/FYP/parallel-python/test.xml"):
#below codes read the attributes in an element specified
if element.tag == 'row':
print("Thread: %s" % t.getName())
row_id = element.attrib.get('Id')
row_post_id = element.attrib.get('PostId')
row_vote_type_id = element.attrib.get('VoteTypeId')
row_user_id = element.attrib.get('UserId')
row_creation_date = element.attrib.get('CreationDate')
print('ID: %s, PostId: %s, VoteTypeID: %s, UserId: %s, CreationDate: %s'% (row_id,row_post_id,row_vote_type_id,row_user_id,row_creation_date))
element.clear()
return
if __name__ == "__main__":
start = time.time() #calculate execution time
main_thread = threading.currentThread()
no_threads = 4
for i in range(no_threads):
t = threading.Thread(target=sax_parsing)
t.start()
for t in threading.enumerate():
if t is main_thread:
continue
t.join()
end = time.time() #calculate execution time
exec_time = end - start
print('Execution time: %fs' % (exec_time))

simplest way you could expend your parse function to receive start row and end row like so:
def sax_parsing(start, end):
and then when sending the threading command:
t = threading.Thread(target=sax_parsing, args=(i*50, i+1*50))
and change if element.tag == 'row': to if element.tag == 'row' and element.attrib.get('Id') >= start and element.attrib.get('Id') < end:
so each thread checks just the rows it was given in the range
(didn't actually check this, so play around)

how to get file names and paths based on a given attribute in parent tag

I want to change the below code to get file_names and file_paths only when fastboot="true" attribute is present in the parent tag,I provided the current output and expected ouput,can anyone provide guidance on how to do it?
import sys
import os
import string
from xml.dom import minidom
if __name__ == '__main__':
meta_contents = minidom.parse("fast.xml")
builds_flat = meta_contents.getElementsByTagName("builds_flat")[0]
build_nodes = builds_flat.getElementsByTagName("build")
for build in build_nodes:
bid_name = build.getElementsByTagName("name")[0]
print "Checking if this is cnss related image... : \n"+bid_name.firstChild.data
if (bid_name.firstChild.data == 'apps'):
file_names = build.getElementsByTagName("file_name")
file_paths = build.getElementsByTagName("file_path")
print "now files paths...\n"
for fn,fp in zip(file_names,file_paths):
if (not fp.firstChild.nodeValue.endswith('/')):
fp.firstChild.nodeValue = fp.firstChild.nodeValue + '/'
full_path = fp.firstChild.nodeValue+fn.firstChild.nodeValue
print "file-to-copy: "+full_path
break
INPUT XML:-
<builds_flat>
<build>
<name>apps</name>
<file_ref ignore="true" minimized="true">
<file_name>adb.exe</file_name>
<file_path>LINUX/android/vendor/qcom/proprietary/usb/host/windows/prebuilt/</file_path>
</file_ref>
<file_ref ignore="true" minimized="true">
<file_name>system.img</file_name>
<file_path>LINUX/android/out/target/product/msmcobalt/secondary-boot/</file_path>
</file_ref>
<download_file cmm_file_var="APPS_BINARY" fastboot_rumi="boot" fastboot="true" minimized="true">
<file_name>boot.img</file_name>
<file_path>LINUX/android/out/target/product/msmcobalt/</file_path>
</download_file>
<download_file sparse_image_path="true" fastboot_rumi="abl" fastboot="true" minimized="true">
<file_name>abl.elf</file_name>
<file_path>LINUX/android/out/target/product/msmcobalt/</file_path>
</download_file>
</build>
</builds_flat>
OUTPUT:-
...............
now files paths...
file-to-copy: LINUX/android/vendor/qcom/proprietary/usb/host/windows/prebuilt/adb.exe
file-to-copy: LINUX/android/out/target/product/msmcobalt/secondary-boot/system.img
file-to-copy: LINUX/android/out/target/product/msmcobalt/boot.img
file-to-copy: LINUX/android/out/target/product/msmcobalt/abl.elf
EXPECTED OUT:-
now files paths...
........
file-to-copy: LINUX/android/out/target/product/msmcobalt/boot.img
file-to-copy: LINUX/android/out/target/product/msmcobalt/abl.elf

Something rather quick and dirty that comes to mind is using the fact that only the download_file elements have the fastboot attribute, right? If that's the case, you could always get the children of type download_file and filter the ones whose fastboot attribute is not "true":
import os
from xml.dom import minidom
if __name__ == '__main__':
meta_contents = minidom.parse("fast.xml")
for elem in meta_contents.getElementsByTagName('download_file'):
if elem.getAttribute('fastboot') == "true":
path = elem.getElementsByTagName('file_path')[0].firstChild.nodeValue
file_name = elem.getElementsByTagName('file_name')[0].firstChild.nodeValue
print os.path.join(path, file_name)
With the sample you provided that outputs:
$ python ./stack_034.py
LINUX/android/out/target/product/msmcobalt/boot.img
LINUX/android/out/target/product/msmcobalt/abl.elf
Needless to say... since there's no .xsd file (nor that it'd matter with the minidom, though) you only get strings (no type safety) and this only applies to the structure shown in the example (you probably would like to add some extra checks there, is what I mean)
EDIT:
As per the comment in this answer:
To get the elements within the <build> that contains a <name> attribute with value apps, you can: Find that <name> tag (the one whose value is the string apps), then move to the parent node (which will put you in the build element) and then proceed as mentioned above:
if __name__ == '__main__':
meta_contents = minidom.parse("fast.xml")
for elem in meta_contents.getElementsByTagName('name'):
if elem.firstChild.nodeValue == "apps":
apps_build = elem.parentNode
for elem in apps_build.getElementsByTagName('download_file'):
if elem.getAttribute('fastboot') == "true":
path = elem.getElementsByTagName('file_path')[0].firstChild.nodeValue
file_name = elem.getElementsByTagName('file_name')[0].firstChild.nodeValue
print os.path.join(path, file_name)

Preserving special characters in text nodes using Python lxml module

I am editing an XML file that is provided by a third party. The XML is used to recreate and entire environment and one is able to edit the XML to propogate the changes. I was able to lookup the element I wanted to change through command line options and save the XML, but special characters are being escaped and I need to retain the special characters. For example it is changing > to $gt; in the file during the .write operation. This is affecting in all occurances of the XML document not just the node element (I think that is what it is called) Below is my code:
import sys
from lxml import etree
from optparse import OptionParser
def parseCommandLine ():
usage = "usage: %prog [options] arg"
parser = OptionParser(usage)
parser.add_option("-f","--file",dest="filename",
help="Context File name including full path", metavar="CONTEXT_FILE")
parser.add_option("-k","--key",dest="key",
help="Key you are looking for in Context File i.e s_isAdmin", metavar="s_someKey")
parser.add_option("-v","--value",dest="value",
help="The replacement value for the key")
if len(sys.argv[1:]) < 3:
print len(sys.argv[1:])
parser.print_help()
sys.exit(2)
(options, args) = parser.parse_args()
return options.filename, options.key, options.value
Filename, Key, Value=parseCommandLine()
parser_options=etree.XMLParser(attribute_defaults=True, dtd_validation=False, strip_cdata=False)
doc = etree.parse(Filename, parser_options ) #Open and parse the file
print doc.findall("//*[#oa_var=%r]" % Key)[0].text
oldval = doc.findall("//*[#oa_var=%r]" % Key)[0].text
val = doc.findall("//*[#oa_var=%r]" % Key)[0]
val.text = Value
print 'old value is %s' % oldval
print 'new value is %s' % val.text
root = doc.getroot()
doc.write(Filename,method='xml',with_tail=True,pretty_print=False)
Original file has this:
tf.fm.FulfillmentServer >> /s_u01/app/applmgr/f
Saved version is being replaced with this:
tf.fm.FulfillmentServer >> /s_u01/app/applmgr/f
I have been trying to mess with pretty_print in the output side DTD validations on the parsing side and I am stumped.
Below is a diff from the changed file and and the original file:
I updated the s_cookie_domain only.
diff finprod_acfpdb10.xml_original finprod_acfpdb10.xml
Warning: missing newline at end of file finprod_acfpdb10.xml
1,3c1
< <?xml version = '1.0'?>
< <!-- $Header: adxmlctx.tmp 115.426 2009/05/08 08:46:29 rdamodar ship $ -->
< <!--
---
> <!-- $Header: adxmlctx.tmp 115.426 2009/05/08 08:46:29 rdamodar ship $ --><!--
13,14c11
< -->
< <oa_context version="$Revision: 115.426 $">
---
> --><oa_context version="$Revision: 115.426 $">
242c239
< <cookiedomain oa_var="s_cookie_domain">.apollogrp.edu</cookiedomain>
---
> <cookiedomain oa_var="s_cookie_domain">.qadoamin.edu</cookiedomain>
526c523
< <FORMS60_BLOCK_URL_CHARACTERS oa_var="s_f60blockurlchar">%0a,%0d,!,%21,",%22,%28,%29,;,[,%5b,],%5d,{,%7b,|,%7c,},%7d,%7f,>,%3c,<,%3e</FORMS60_BLOCK_URL_CHARACTERS>
---
> <FORMS60_BLOCK_URL_CHARACTERS oa_var="s_f60blockurlchar">%0a,%0d,!,%21,",%22,%28,%29,;,[,%5b,],%5d,{,%7b,|,%7c,},%7d,%7f,>,%3c,<,%3e</FORMS60_BLOCK_URL_CHARACTERS>
940c937
< <start_cmd oa_var="s_jtffstart">/s_u01/app/applmgr/jdk1.5.0_11/bin/java -Xmx512M -classpath .:/s_u01/app/applmgr/finprod/comn/java/jdbc111.zip:/s_u01/app/applmgr/finprod/comn/java/xmlparserv2.zip:/s_u01/app/applmgr/finprod/comn/java:/s_u01/app/applmgr/finprod/comn/java/apps.zip:/s_u01/app/applmgr/jdk1.5.0_11/classes:/s_u01/app/applmgr/jdk1.5.0_11/lib:/s_u01/app/applmgr/jdk1.5.0_11/lib/classes.zip:/s_u01/app/applmgr/jdk1.5.0_11/lib/classes.jar:/s_u01/app/applmgr/jdk1.5.0_11/lib/rt.jar:/s_u01/app/applmgr/jdk1.5.0_11/lib/i18n.jar:/s_u01/app/applmgr/finprod/comn/java/3rdparty/RFJavaInt.zip: -Dengine.LogPath=/s_u01/app/applmgr/finprod/comn/admin/log/finprod_acfpdb10 -Dengine.TempDir=/s_u01/app/applmgr/finprod/comn/temp -Dengine.CommandPromptEnabled=false -Dengine.CommandPort=11000 -Dengine.AOLJ.config=/s_u01/app/applmgr/finprod/appl/fnd/11.5.0/secure/acfpdb10_finprod.dbc -Dengine.ServerID=5000 -Ddebug=off -Dengine.LogLevel=1 -Dlog.ShowWarnings=false -Dengine.FaxEnabler=oracle.apps.jtf.fm.engine.rightfax.RfFaxEnablerImpl -Dengine.PrintEnabler=oracle.apps.jtf.fm.engine.rightfax.RfPrintEnablerImpl -Dfax.TempDir=/s_u01/app/applmgr/finprod/comn/admin/log/finprod_acfpdb10 -Dprint.TempDir=/s_u01/app/applmgr/finprod/comn/admin/log/finprod_acfpdb10 oracle.apps.jtf.fm.FulfillmentServer >> /s_u01/app/applmgr/finprod/comn/admin/log/finprod_acfpdb10/jtffmctl.txt</start_cmd>
---
> <start_cmd oa_var="s_jtffstart">/s_u01/app/applmgr/jdk1.5.0_11/bin/java -Xmx512M -classpath .:/s_u01/app/applmgr/finprod/comn/java/jdbc111.zip:/s_u01/app/applmgr/finprod/comn/java/xmlparserv2.zip:/s_u01/app/applmgr/finprod/comn/java:/s_u01/app/applmgr/finprod/comn/java/apps.zip:/s_u01/app/applmgr/jdk1.5.0_11/classes:/s_u01/app/applmgr/jdk1.5.0_11/lib:/s_u01/app/applmgr/jdk1.5.0_11/lib/classes.zip:/s_u01/app/applmgr/jdk1.5.0_11/lib/classes.jar:/s_u01/app/applmgr/jdk1.5.0_11/lib/rt.jar:/s_u01/app/applmgr/jdk1.5.0_11/lib/i18n.jar:/s_u01/app/applmgr/finprod/comn/java/3rdparty/RFJavaInt.zip: -Dengine.LogPath=/s_u01/app/applmgr/finprod/comn/admin/log/finprod_acfpdb10 -Dengine.TempDir=/s_u01/app/applmgr/finprod/comn/temp -Dengine.CommandPromptEnabled=false -Dengine.CommandPort=11000 -Dengine.AOLJ.config=/s_u01/app/applmgr/finprod/appl/fnd/11.5.0/secure/acfpdb10_finprod.dbc -Dengine.ServerID=5000 -Ddebug=off -Dengine.LogLevel=1 -Dlog.ShowWarnings=false -Dengine.FaxEnabler=oracle.apps.jtf.fm.engine.rightfax.RfFaxEnablerImpl -Dengine.PrintEnabler=oracle.apps.jtf.fm.engine.rightfax.RfPrintEnablerImpl -Dfax.TempDir=/s_u01/app/applmgr/finprod/comn/admin/log/finprod_acfpdb10 -Dprint.TempDir=/s_u01/app/applmgr/finprod/comn/admin/log/finprod_acfpdb10 oracle.apps.jtf.fm.FulfillmentServer >> /s_u01/app/applmgr/finprod/comn/admin/log/finprod_acfpdb10/jtffmctl.txt</start_cmd>
983c980
< </oa_context>
---
> </oa_context>

Terminology: Parsers don't write XML; they read XML. Serialisers write XML.
In normal element content, < and & are illegal and must be escaped. > is legal except where it follows ]] and is NOT the end of a CDATA section. Most serialisers take the easy way out and write > because a parser will handle both that and >.
I suggest that you submit both your output and input files to an XML validation service like this or this and also test whether the consumer will actually parse your output file.

The only thing I can think of is forcing the parser to treat the nodes you modify as cdata blocks (as the parser is clearly changing the xml tag closing brackets). Try val.text = etree.CDATA(Value) instead of val.text = Value.
http://lxml.de/api.html#cdata

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

parsing parent/children relationship of XML elements - python

Related

Returning value after recursively iterating through XML

Python XML findall does not work

Multithreading/Multiprocessing to parse single XML file? [duplicate]

how to get file names and paths based on a given attribute in parent tag

Preserving special characters in text nodes using Python lxml module

Categories

Resources