Multithreading/Multiprocessing to parse single XML file? [duplicate]

Multithreading/Multiprocessing to parse single XML file? [duplicate] - python

This question already has answers here:
Parsing Very Large XML Files Using Multiprocessing
(2 answers)
Closed 5 years ago.
Can someone tell me how to assign jobs to multiple threads to speed up parsing time? For example, I have XML file with 200k lines, I would assign 50k lines to each 4 threads and parse them using SAX parser. What I have done so far is 4 threads parsing on 200k lines which means 200k*4 = 800k duplicating results.
Any help is appreciated.
test.xml:
<?xml version="1.0" encoding="utf-8"?>
<votes>
<row Id="1" PostId="1" VoteTypeId="2" CreationDate="2014-05-13T00:00:00.000" />
<row Id="2" PostId="1" VoteTypeId="2" CreationDate="2014-05-13T00:00:00.000" />
<row Id="3" PostId="3" VoteTypeId="2" CreationDate="2014-05-13T00:00:00.000" />
<row Id="5" PostId="3" VoteTypeId="2" CreationDate="2014-05-13T00:00:00.000" />
</votes>
My source code:
import json
import xmltodict
from lxml import etree
import xml.etree.ElementTree as ElementTree
import threading
import time
def sax_parsing():
t = threading.currentThread()
for event, element in etree.iterparse("/home/xiang/Downloads/FYP/parallel-python/test.xml"):
#below codes read the attributes in an element specified
if element.tag == 'row':
print("Thread: %s" % t.getName())
row_id = element.attrib.get('Id')
row_post_id = element.attrib.get('PostId')
row_vote_type_id = element.attrib.get('VoteTypeId')
row_user_id = element.attrib.get('UserId')
row_creation_date = element.attrib.get('CreationDate')
print('ID: %s, PostId: %s, VoteTypeID: %s, UserId: %s, CreationDate: %s'% (row_id,row_post_id,row_vote_type_id,row_user_id,row_creation_date))
element.clear()
return
if __name__ == "__main__":
start = time.time() #calculate execution time
main_thread = threading.currentThread()
no_threads = 4
for i in range(no_threads):
t = threading.Thread(target=sax_parsing)
t.start()
for t in threading.enumerate():
if t is main_thread:
continue
t.join()
end = time.time() #calculate execution time
exec_time = end - start
print('Execution time: %fs' % (exec_time))

simplest way you could expend your parse function to receive start row and end row like so:
def sax_parsing(start, end):
and then when sending the threading command:
t = threading.Thread(target=sax_parsing, args=(i*50, i+1*50))
and change if element.tag == 'row': to if element.tag == 'row' and element.attrib.get('Id') >= start and element.attrib.get('Id') < end:
so each thread checks just the rows it was given in the range
(didn't actually check this, so play around)

Related

Speedup extracting data form larger xml files using python

Hello I am not strong python user , but need to extract the xml file values.
I am using for loop to get attribute values from 'xml.dom.minidom.document'
Both the xyz or temp uses for loop , since the file has half million values it takes time.
I tried using lxml, but it had error:
module 'lxml' has no attribute 'parse' or 'Xpath'
The xml file has following format
<?xml version="1.0" encoding="utf-8"?>
<variable_output>
<!--version : 1-->
<!--object title : Volume (1)-->
<!--scalar variable : Temperature (TEMP)-->
<POINT>
<Vertex>
<Position x="-0.176300004" y="-0.103100002" z="-0.153699994"/>
<Scalar TEMP="84.192421"/>
</Vertex>
</POINT>
<POINT>
<Vertex>
<Position x="-0.173557162" y="-0.103100002" z="-0.153699994"/>
<Scalar TEMP="83.9050522"/>
</Vertex>
</POINT>
<POINT>
<Vertex>
<Position x="-0.170814306" y="-0.103100002" z="-0.153699994"/>
<Scalar TEMP="83.7506332"/>
</Vertex>
</POINT>
</variable_output>
The following code give larger time for bigger files.
from xml.dom.minidom import parse
import xml.dom.minidom
import csv
import pandas as pd
import numpy as np
import os
import glob
import time
from lxml import etree
v=[]
doc =parse("document.xml")
Val = doc.getElementsByTagName("Scalar")
t0 = time.time()
for s in Val:
v=np.append(v,float(s.attributes['TEMP'].value))
res=np.array([v])
t1 = time.time()
total = (t1-t0)
print('Time for Value', str(total))
# Using lxml
doc2=etree.parse("document.xml")
# try using Xpath
t0 = time.time()
temp=doc2.Xpath("/POINT/Vertex/Scaler/#TEMP")
t1 = time.time()
total2 = t1-t0
print('Time for Value', str(total2))
# save data as csv from xml
pd.DataFrame(res.T).to_csv(('Data.csv'),index=False,header=False) #write timestep as csv
The error while using the Xpath to get the values of Temp,or x,y,z:
In [12]: temp=doc2.Xpath("/POINT/Vertex/Scaler/#TEMP")
Traceback (most recent call last):
File "<ipython-input-12-bbd832a3074e>", line 1, in <module>
temp=doc2.Xpath("/POINT/Vertex/Scaler/#TEMP")
AttributeError: 'lxml.etree._ElementTree' object has no attribute 'Xpath'

I recommend iterparse() for large xml files:
import timeit
import os, psutil
import datetime
import pandas as pd
import xml.etree.ElementTree as ET
class parse_xml:
def __init__(self, path):
self.xml = os.path.split(path)[1]
print(self.xml)
columns = ["Pos_x", "Pos_y", "Pos_z", "Scalar_Temp"]
data = []
for event, elem in ET.iterparse(self.xml, events=("end",)):
if elem.tag == "Position":
x = elem.get("x")
y = elem.get("y")
z = elem.get("z")
if elem.tag == "Scalar":
row = (x, y, z , elem.get("TEMP"))
data.append(row)
elem.clear()
df = pd.DataFrame(data, columns=columns)
print(df)
def main():
xml_file = r"D:\Daten\Programmieren\stackoverflow\document.xml"
parse_xml(xml_file)
if __name__ == "__main__":
now = datetime.datetime.now()
starttime = timeit.default_timer()
main()
process = psutil.Process(os.getpid())
print('\nFinished')
print(f"{now:%Y-%m-%d %H:%M}")
print('Runtime:', timeit.default_timer()-starttime)
print(f'RAM: {process.memory_info().rss/1000**2} MB')
Output:
document.xml
Pos_x Pos_y Pos_z Scalar_Temp
0 -0.176300004 -0.103100002 -0.153699994 84.192421
1 -0.173557162 -0.103100002 -0.153699994 83.9050522
2 -0.170814306 -0.103100002 -0.153699994 83.7506332
Finished
2022-11-29 23:51
Runtime: 0.007375300000000029
RAM: 55.619584 MB
If the output will be too large you can write it to a sqlite3 database with df.to_sql().

Python XML findall does not work

I am trying to use findall to select on some xml elements, but i can't get any results.
import xml.etree.ElementTree as ET
import sys
storefront = sys.argv[1]
xmlFileName = 'promotions{0}.xml'
xmlFile = xmlFileName.format(storefront)
csvFileName = 'hrz{0}.csv'
csvFile = csvFileName.format(storefront)
ET.register_namespace('', "http://www.demandware.com/xml/impex/promotion/2008-01-31")
tree = ET.parse(xmlFile)
root = tree.getroot()
print('------------------Generate test-------------\n')
csv = open(csvFile,'w')
n = 0
for child in root.findall('campaign'):
print(child.attrib['campaign-id'])
print(n)
n+=1
The XML looks something like this:
<?xml version="1.0" encoding="UTF-8"?>
<promotions xmlns="http://www.demandware.com/xml/impex/promotion/2008-01-31">
<campaign campaign-id="10off-310781">
<enabled-flag>true</enabled-flag>
<campaign-scope>
<applicable-online/>
</campaign-scope>
<customer-groups match-mode="any">
<customer-group group-id="Everyone"/>
</customer-groups>
</campaign>
<campaign campaign-id="MNT-deals">
<enabled-flag>true</enabled-flag>
<campaign-scope>
<applicable-online/>
</campaign-scope>
<start-date>2017-07-03T22:00:00.000Z</start-date>
<end-date>2017-07-31T22:00:00.000Z</end-date>
<customer-groups match-mode="any">
<customer-group group-id="Everyone"/>
</customer-groups>
</campaign>
<campaign campaign-id="black-friday">
<enabled-flag>true</enabled-flag>
<campaign-scope>
<applicable-online/>
</campaign-scope>
<start-date>2017-11-23T23:00:00.000Z</start-date>
<end-date>2017-11-24T23:00:00.000Z</end-date>
<customer-groups match-mode="any">
<customer-group group-id="Everyone"/>
</customer-groups>
<custom-attributes>
<custom-attribute attribute-id="expires_date">2017-11-29</custom-attribute>
</custom-attributes>
</campaign>
<promotion-campaign-assignment promotion-id="winter17-new-bubble" campaign-id="winter17-new-bubble">
<qualifiers match-mode="any">
<customer-groups/>
<source-codes/>
<coupons/>
</qualifiers>
<rank>100</rank>
</promotion-campaign-assignment>
<promotion-campaign-assignment promotion-id="xmas" campaign-id="xmas">
<qualifiers match-mode="any">
<customer-groups/>
<source-codes/>
<coupons/>
</qualifiers>
</promotion-campaign-assignment>
</promotions>
Any ideas what i am doing wrong?
I have tried different solutions that i found on stackoverflow but nothing seems to work for me(from the things i have tried).
The list is empty.
Sorry if it is something very obvious i am new to python.

As mentioned here by #MartijnPieters, etree's .findall uses the namespaces argument while the .register_namespace() is used for xml output of the tree. Therefore, consider mapping the default namespace with an explicit prefix. Below uses doc but can even be cosmin.
Additionally, consider with and enumerate() even the csv module as better handlers for your print and CSV outputs.
import csv
...
root = tree.getroot()
print('------------------Generate test-------------\n')
with open(csvFile, 'w') as f:
c = csv.writer(f, lineterminator='\n')
for n, child in enumerate(root.findall('doc:campaign', namespaces={'doc':'http://www.demandware.com/xml/impex/promotion/2008-01-31'})):
print(child.attrib['campaign-id'])
print(n)
c.writerow([child.attrib['campaign-id']])
# ------------------Generate test-------------
# 10off-310781
# 0
# MNT-deals
# 1
# black-friday
# 2

Splitting file based on data comparison

I've been recently using a Garmin GPS path tracker which produces files like this:
<?xml version="1.0" encoding="UTF-8"?>
<gpx version="1.1" creator="GPS Track Editor" xmlns="http://www.topografix.com/GPX/1/1" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:gte="http://www.gpstrackeditor.com/xmlschemas/General/1" xmlns:gpxtpx="http://www.garmin.com/xmlschemas/TrackPointExtension/v1" xmlns:gpxx="http://www.garmin.com/xmlschemas/GpxExtensions/v3" targetNamespace="http://www.topografix.com/GPX/1/1" elementFormDefault="qualified" xsi:schemaLocation="http://www.topografix.com/GPX/1/1 http://www.topografix.com/GPX/1/1/gpx.xsd">
<metadata>
<name>Ślad_16-SIE-15 190121.gpx</name>
<link href="http://www.garmin.com">
<text>Garmin International</text>
</link>
</metadata>
<trk>
<name>16-SIE-15 19:01:21</name>
<trkseg>
<trkpt lat="55.856890" lon="-4.250866">
<ele>9.27</ele>
<time>2015-08-16T08:32:13Z</time>
</trkpt>
<trkpt lat="55.856904" lon="-4.250904">
<ele>6.39</ele>
<time>2015-08-16T08:32:15Z</time>
</trkpt>
...
<trkpt lat="55.876979" lon="-4.286995">
<ele>46.28</ele>
<time>2015-08-16T17:22:14Z</time>
</trkpt>
<extensions>
<gte:name>#1</gte:name>
<gte:color>#fbaf00</gte:color>
</extensions>
</trkseg>
</trk>
</gpx>
The thing is that sometimes the device is losing signal (when in an inner city for example), which causes a footpath to be interpolated in an unpleasent manner:
footpath
I would like to split the footpath file into three separate files (to avoid these long arrows - see picture).
I ended up with following decomposition of a problem:
Read the original file latitude (lat) and longitude (lon) values
Compare 2 consecutive lat and lon values until assumed difference is
met while saving them to file one.
Add ending to file one, add predata tags to file two, continue with
comparing
Since I'm trying to learn Python 2.X, I'm stuck with this:
gpxFile = open('track.gpx', 'r')
with open("track.gpx", "r") as gpxFile:
data = gpxFile.read()
print data
for subString in data:
subString = data[data.find("<trkpt")+12:data.find("lon")-2] + " " + data[data.find("lon")+5:data.find(
"<ele>")-6]
Can anybody help me with that or at least give me a heads up of what to look for in a documentation or tutorials?
Thanks.
Cheers!

This isn't perfect, but it should do what you want. If not, it should serve as a good starting point. It works by reading in the XML file, extracting all of the track points, and then finding the gaps based on the timestamps. For each group of points, it outputs a new file named original_N.gpx (N = 0,1,2,...) where the input file is original.gpx. It could be modified to use distance between points, but time seemed a little easier. Look at delta_too_large(pt1, pt2) to change the gap detection, currently two seconds.
GitHub (Public Domain)
#!/usr/bin/env python
# Copyright (C) 2015 Harvey Chapman <hchapman#3gfp.com>
# Public Domain
# Use at your own risk.
"""
Splits a gpx file with breaks in the track into separate files.
Based on: http://stackoverflow.com/q/33803614/47078
"""
import sys
import re
import os
from datetime import datetime, timedelta
from itertools import izip
from xml.etree import ElementTree
ns = { 'gpx': 'http://www.topografix.com/GPX/1/1' }
def iso8601_to_datetime(datestring):
d = datetime(*map(int, re.split('\D', datestring)[:-1]))
# intentionally ignoring timezone info (for now)
# d = d.replace(tzinfo=UTC)
return d
def datetime_from_trkpt(trkpt):
datestring = trkpt.find('gpx:time', ns).text
return iso8601_to_datetime(datestring)
def delta_too_large(trkpt1, trkpt2):
delta = datetime_from_trkpt(trkpt2) - datetime_from_trkpt(trkpt1)
return delta > timedelta(seconds=2)
def trkpt_groups(trkpts):
last_index = 0
for n, (a,b) in enumerate(izip(trkpts[:-1], trkpts[1:]), start=1):
if delta_too_large(a,b):
yield last_index, n
last_index = n
yield last_index, len(trkpts)
def remove_all_trkpts_from_trkseg(trkseg):
trkpts = trkseg.findall('gpx:trkpt', ns)
for trkpt in trkpts:
trkseg.remove(trkpt)
return trkpts
def add_trkpts_to_trkseg(trkseg, trkpts):
# not sure if this will be slow or not...
for trkpt in reversed(trkpts):
trkseg.insert(0, trkpt)
def save_xml(filename, index, tree):
filename_parts = os.path.splitext(filename)
new_filename = '{1}_{0}{2}'.format(index, *filename_parts)
with open(new_filename, 'wb') as f:
tree.write(f,
xml_declaration=True,
encoding='utf-8',
method='xml')
def get_trkseg(tree):
trk = tree.getroot().findall('gpx:trk', ns)
if len(trk) > 1:
raise Exception("Don't know how to parse multiple tracks!")
trkseg = trk[0].findall('gpx:trkseg', ns)
if len(trkseg) > 1:
raise Exception("Don't know how to parse multiple track segment lists!")
return trkseg[0]
def split_gpx_file(filename):
ElementTree.register_namespace('', ns['gpx'])
tree = ElementTree.parse(filename)
trkseg = get_trkseg(tree)
trkpts = remove_all_trkpts_from_trkseg(trkseg)
for n, (start,end) in enumerate(trkpt_groups(trkpts)):
# Remove all points and insert only the ones for this group
remove_all_trkpts_from_trkseg(trkseg)
add_trkpts_to_trkseg(trkseg, trkpts[start:end])
save_xml(filename, n, tree)
if __name__ == '__main__':
if len(sys.argv) < 2:
print >> sys.stderr, "Usage: {} file.gpx".format(sys.argv[0])
sys.exit(-1)
split_gpx_file(sys.argv[1])

parsing parent/children relationship of XML elements

Given the following XML (ant build xml):
<project name="pj1">
<target name="t1">
...
<antcall target="t2"/>
<a>
<antcall target="t4"/>
</a>
...
</target>
<target name="t2">
...
<antcall target="t3"/>
...
</target>
<target name="t3">
...
...
</target>
<target name="t4">
...
<antcall target="t2"/>
...
</target>
<target name="t5">
...
...
</target>
</project>
I'd like to display parent/children relationship of the target elements as follows (without displaying a target as first level element if it is nested in another target)
t1
t2
t3
t4
t2
t3
t5
could anyone please help?
Thanks in advance.

When I need to manipulate an XML tree to some other representation, I find it useful to first convert it to an abstract representation and then convert to the final concrete representation.
In this case, first we create a dictionary of lists which represent the target dependency structure, then we pretty-print that dictionary.
#!/usr/bin/python
import xml.etree.ElementTree as ET
from itertools import chain
def parse(filename):
tree = ET.parse(filename)
root = tree.getroot()
result = {}
for target in root.findall('target'):
target_name = target.get('name')
result[target_name] = []
for antcall in target.findall('.//antcall'):
result[target_name].append(antcall.get('target'))
return result
def display(tree):
def recurse(node, indent):
print "%*s%s"%(indent*4, "", node)
for node in sorted(tree[node]):
recurse(node, indent+1)
for item in sorted(tree):
if item in chain(*tree.values()): continue
recurse(item,0)
if __name__=="__main__":
import argparse
parser = argparse.ArgumentParser(description='Dump ANT files')
parser.add_argument('antfile',
nargs='+',
type=argparse.FileType('r'),
help='ANT build file')
args = parser.parse_args()
for antfile in args.antfile:
display(parse(antfile))

Preserving special characters in text nodes using Python lxml module

I am editing an XML file that is provided by a third party. The XML is used to recreate and entire environment and one is able to edit the XML to propogate the changes. I was able to lookup the element I wanted to change through command line options and save the XML, but special characters are being escaped and I need to retain the special characters. For example it is changing > to $gt; in the file during the .write operation. This is affecting in all occurances of the XML document not just the node element (I think that is what it is called) Below is my code:
import sys
from lxml import etree
from optparse import OptionParser
def parseCommandLine ():
usage = "usage: %prog [options] arg"
parser = OptionParser(usage)
parser.add_option("-f","--file",dest="filename",
help="Context File name including full path", metavar="CONTEXT_FILE")
parser.add_option("-k","--key",dest="key",
help="Key you are looking for in Context File i.e s_isAdmin", metavar="s_someKey")
parser.add_option("-v","--value",dest="value",
help="The replacement value for the key")
if len(sys.argv[1:]) < 3:
print len(sys.argv[1:])
parser.print_help()
sys.exit(2)
(options, args) = parser.parse_args()
return options.filename, options.key, options.value
Filename, Key, Value=parseCommandLine()
parser_options=etree.XMLParser(attribute_defaults=True, dtd_validation=False, strip_cdata=False)
doc = etree.parse(Filename, parser_options ) #Open and parse the file
print doc.findall("//*[#oa_var=%r]" % Key)[0].text
oldval = doc.findall("//*[#oa_var=%r]" % Key)[0].text
val = doc.findall("//*[#oa_var=%r]" % Key)[0]
val.text = Value
print 'old value is %s' % oldval
print 'new value is %s' % val.text
root = doc.getroot()
doc.write(Filename,method='xml',with_tail=True,pretty_print=False)
Original file has this:
tf.fm.FulfillmentServer >> /s_u01/app/applmgr/f
Saved version is being replaced with this:
tf.fm.FulfillmentServer >> /s_u01/app/applmgr/f
I have been trying to mess with pretty_print in the output side DTD validations on the parsing side and I am stumped.
Below is a diff from the changed file and and the original file:
I updated the s_cookie_domain only.
diff finprod_acfpdb10.xml_original finprod_acfpdb10.xml
Warning: missing newline at end of file finprod_acfpdb10.xml
1,3c1
< <?xml version = '1.0'?>
< <!-- $Header: adxmlctx.tmp 115.426 2009/05/08 08:46:29 rdamodar ship $ -->
< <!--
---
> <!-- $Header: adxmlctx.tmp 115.426 2009/05/08 08:46:29 rdamodar ship $ --><!--
13,14c11
< -->
< <oa_context version="$Revision: 115.426 $">
---
> --><oa_context version="$Revision: 115.426 $">
242c239
< <cookiedomain oa_var="s_cookie_domain">.apollogrp.edu</cookiedomain>
---
> <cookiedomain oa_var="s_cookie_domain">.qadoamin.edu</cookiedomain>
526c523
< <FORMS60_BLOCK_URL_CHARACTERS oa_var="s_f60blockurlchar">%0a,%0d,!,%21,",%22,%28,%29,;,[,%5b,],%5d,{,%7b,|,%7c,},%7d,%7f,>,%3c,<,%3e</FORMS60_BLOCK_URL_CHARACTERS>
---
> <FORMS60_BLOCK_URL_CHARACTERS oa_var="s_f60blockurlchar">%0a,%0d,!,%21,",%22,%28,%29,;,[,%5b,],%5d,{,%7b,|,%7c,},%7d,%7f,>,%3c,<,%3e</FORMS60_BLOCK_URL_CHARACTERS>
940c937
< <start_cmd oa_var="s_jtffstart">/s_u01/app/applmgr/jdk1.5.0_11/bin/java -Xmx512M -classpath .:/s_u01/app/applmgr/finprod/comn/java/jdbc111.zip:/s_u01/app/applmgr/finprod/comn/java/xmlparserv2.zip:/s_u01/app/applmgr/finprod/comn/java:/s_u01/app/applmgr/finprod/comn/java/apps.zip:/s_u01/app/applmgr/jdk1.5.0_11/classes:/s_u01/app/applmgr/jdk1.5.0_11/lib:/s_u01/app/applmgr/jdk1.5.0_11/lib/classes.zip:/s_u01/app/applmgr/jdk1.5.0_11/lib/classes.jar:/s_u01/app/applmgr/jdk1.5.0_11/lib/rt.jar:/s_u01/app/applmgr/jdk1.5.0_11/lib/i18n.jar:/s_u01/app/applmgr/finprod/comn/java/3rdparty/RFJavaInt.zip: -Dengine.LogPath=/s_u01/app/applmgr/finprod/comn/admin/log/finprod_acfpdb10 -Dengine.TempDir=/s_u01/app/applmgr/finprod/comn/temp -Dengine.CommandPromptEnabled=false -Dengine.CommandPort=11000 -Dengine.AOLJ.config=/s_u01/app/applmgr/finprod/appl/fnd/11.5.0/secure/acfpdb10_finprod.dbc -Dengine.ServerID=5000 -Ddebug=off -Dengine.LogLevel=1 -Dlog.ShowWarnings=false -Dengine.FaxEnabler=oracle.apps.jtf.fm.engine.rightfax.RfFaxEnablerImpl -Dengine.PrintEnabler=oracle.apps.jtf.fm.engine.rightfax.RfPrintEnablerImpl -Dfax.TempDir=/s_u01/app/applmgr/finprod/comn/admin/log/finprod_acfpdb10 -Dprint.TempDir=/s_u01/app/applmgr/finprod/comn/admin/log/finprod_acfpdb10 oracle.apps.jtf.fm.FulfillmentServer >> /s_u01/app/applmgr/finprod/comn/admin/log/finprod_acfpdb10/jtffmctl.txt</start_cmd>
---
> <start_cmd oa_var="s_jtffstart">/s_u01/app/applmgr/jdk1.5.0_11/bin/java -Xmx512M -classpath .:/s_u01/app/applmgr/finprod/comn/java/jdbc111.zip:/s_u01/app/applmgr/finprod/comn/java/xmlparserv2.zip:/s_u01/app/applmgr/finprod/comn/java:/s_u01/app/applmgr/finprod/comn/java/apps.zip:/s_u01/app/applmgr/jdk1.5.0_11/classes:/s_u01/app/applmgr/jdk1.5.0_11/lib:/s_u01/app/applmgr/jdk1.5.0_11/lib/classes.zip:/s_u01/app/applmgr/jdk1.5.0_11/lib/classes.jar:/s_u01/app/applmgr/jdk1.5.0_11/lib/rt.jar:/s_u01/app/applmgr/jdk1.5.0_11/lib/i18n.jar:/s_u01/app/applmgr/finprod/comn/java/3rdparty/RFJavaInt.zip: -Dengine.LogPath=/s_u01/app/applmgr/finprod/comn/admin/log/finprod_acfpdb10 -Dengine.TempDir=/s_u01/app/applmgr/finprod/comn/temp -Dengine.CommandPromptEnabled=false -Dengine.CommandPort=11000 -Dengine.AOLJ.config=/s_u01/app/applmgr/finprod/appl/fnd/11.5.0/secure/acfpdb10_finprod.dbc -Dengine.ServerID=5000 -Ddebug=off -Dengine.LogLevel=1 -Dlog.ShowWarnings=false -Dengine.FaxEnabler=oracle.apps.jtf.fm.engine.rightfax.RfFaxEnablerImpl -Dengine.PrintEnabler=oracle.apps.jtf.fm.engine.rightfax.RfPrintEnablerImpl -Dfax.TempDir=/s_u01/app/applmgr/finprod/comn/admin/log/finprod_acfpdb10 -Dprint.TempDir=/s_u01/app/applmgr/finprod/comn/admin/log/finprod_acfpdb10 oracle.apps.jtf.fm.FulfillmentServer >> /s_u01/app/applmgr/finprod/comn/admin/log/finprod_acfpdb10/jtffmctl.txt</start_cmd>
983c980
< </oa_context>
---
> </oa_context>

Terminology: Parsers don't write XML; they read XML. Serialisers write XML.
In normal element content, < and & are illegal and must be escaped. > is legal except where it follows ]] and is NOT the end of a CDATA section. Most serialisers take the easy way out and write > because a parser will handle both that and >.
I suggest that you submit both your output and input files to an XML validation service like this or this and also test whether the consumer will actually parse your output file.

The only thing I can think of is forcing the parser to treat the nodes you modify as cdata blocks (as the parser is clearly changing the xml tag closing brackets). Try val.text = etree.CDATA(Value) instead of val.text = Value.
http://lxml.de/api.html#cdata

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Multithreading/Multiprocessing to parse single XML file? [duplicate] - python

Related

Speedup extracting data form larger xml files using python

Python XML findall does not work

Splitting file based on data comparison

parsing parent/children relationship of XML elements

Preserving special characters in text nodes using Python lxml module

Categories

Resources