how to edit XML files in batch / python

how to edit XML files in batch / python - python

I'm trying to edit xml files in a batch / python script
this is my xml file:
<?xml version="1.0" encoding="UTF-8"?>
<task name="analyse">
<taskInfo taskId="21a09311-ade3-4e9a-af21-d13be8b7ba45" runAt="2015-05-20 13:48:50" runTime="5 minutes, 53 seconds">
<project name="13955 - HMI Volvo Truck PA15" number="e20d51c0-71dc-4572-8f9b-4c150bf35222" />
<language lcid="1031" name="German (Germany)" />
<tm name="ENG-DEU_en-GB_de-DE.sdltm" />
<settings reportInternalFuzzyLeverage="yes" reportLockedSegments="no" reportCrossFileRepetitions="yes" minimumMatchScore="70" searchMode="bestWins" missingFormattingPenalty="1" differentFormattingPenalty="1" multipleTranslationsPenalty="1" autoLocalizationPenalty="0" textReplacementPenalty="0" />
</taskInfo>
<file name="VT MAIN TRACK_PA15_Default_DE-DE_20150520_102527.xlf.sdlxliff" guid="111f9ba6-82f6-45fb-ac49-8bf6cf57c169">
<analyse>
<perfect segments="0" words="0" characters="0" placeables="0" tags="0" />
<inContextExact segments="60" words="55" characters="755" placeables="3" tags="0" />
' Replace the Value word="55" with "0"
<exact segments="114" words="334" characters="1687" placeables="14" tags="3" />
<locked segments="0" words="0" characters="0" placeables="0" tags="0" />
<crossFileRepeated segments="2" words="20" characters="0" placeables="0" tags="0" />
'Cut the value words="20" replace with 0
<repeated segments="17" words="34" characters="293" placeables="2" tags="0" />
'add the value to current value 20 to 34 so the new value is words="54"
<total segments="449" words="1462" characters="7630" placeables="66" tags="24" />
<new segments="126" words="434" characters="2384" placeables="18" tags="5" />
<fuzzy min="75" max="84" segments="25" words="108" characters="528" placeables="6" tags="3" />
<fuzzy min="85" max="94" segments="23" words="92" characters="454" placeables="7" tags="4" />
<fuzzy min="95" max="99" segments="77" words="260" characters="1318" placeables="13" tags="6" />
<internalFuzzy min="75" max="84" segments="3" words="16" characters="100" placeables="2" tags="2" />
<internalFuzzy min="85" max="94" segments="4" words="25" characters="111" placeables="1" tags="1" />
<internalFuzzy min="95" max="99" segments="0" words="0" characters="0" placeables="0" tags="0" />
</analyse>
</file>
<file name="VT MAIN TRACK_PA15_Default_DE-DE_20150523_254796.xlf.sdlxliff" guid="111f9ba6-82f6-45fb-ac49-8bf6cf57c169">
<analyse>
<perfect segments="0" words="0" characters="0" placeables="0" tags="0" />
<inContextExact segments="60" words="67" characters="755" placeables="3" tags="0" />
' Replace the Value word="67" with "0"
<exact segments="114" words="334" characters="1687" placeables="14" tags="3" />
<locked segments="0" words="0" characters="0" placeables="0" tags="0" />
<crossFileRepeated segments="2" words="35" characters="0" placeables="0" tags="0" />
'Cut the value words="35" replace with 0
<repeated segments="17" words="54" characters="293" placeables="2" tags="0" />
'add the value to current value 35 to 54 so the new value is words="89"
<total segments="449" words="1462" characters="7630" placeables="66" tags="24" />
<new segments="126" words="434" characters="2384" placeables="18" tags="5" />
<fuzzy min="75" max="84" segments="25" words="108" characters="528" placeables="6" tags="3" />
<fuzzy min="85" max="94" segments="23" words="92" characters="454" placeables="7" tags="4" />
<fuzzy min="95" max="99" segments="77" words="260" characters="1318" placeables="13" tags="6" />
<internalFuzzy min="75" max="84" segments="3" words="16" characters="100" placeables="2" tags="2" />
<internalFuzzy min="85" max="94" segments="4" words="25" characters="111" placeables="1" tags="1" />
<internalFuzzy min="95" max="99" segments="0" words="0" characters="0" placeables="0" tags="0" />
</analyse>
</file>
<batchTotal>
<analyse>
<perfect segments="0" words="0" characters="0" placeables="0" tags="0" />
<inContextExact segments="60" words="139" characters="755" placeables="3" tags="0" />
<exact segments="114" words="334" characters="1687" placeables="14" tags="3" />
<locked segments="0" words="0" characters="0" placeables="0" tags="0" />
<crossFileRepeated segments="0" words="0" characters="0" placeables="0" tags="0" />
<repeated segments="17" words="54" characters="293" placeables="2" tags="0" />
<total segments="449" words="1462" characters="7630" placeables="66" tags="24" />
<new segments="126" words="434" characters="2384" placeables="18" tags="5" />
<fuzzy min="75" max="84" segments="25" words="108" characters="528" placeables="6" tags="3" />
<fuzzy min="85" max="94" segments="23" words="92" characters="454" placeables="7" tags="4" />
<fuzzy min="95" max="99" segments="77" words="260" characters="1318" placeables="13" tags="6" />
<internalFuzzy min="75" max="84" segments="3" words="16" characters="100" placeables="2" tags="2" />
<internalFuzzy min="85" max="94" segments="4" words="25" characters="111" placeables="1" tags="1" />
<internalFuzzy min="95" max="99" segments="0" words="0" characters="0" placeables="0" tags="0" />
</analyse>
</batchTotal>
</task>
general notes:
the <task> is the root element (end element </task>)
the important here is to modify a few tags in a section called file <file> and endtag </file>
there can be X occurrences of <file>*</file>
What i need,
for each <file> element, i would like to:
In <inContextExact>, Set the value of the attribute words with 0
<inContextExact ... words="55" ... /> => <inContextExact ... words="0" ... />
In <crossFileRepeated>, Set the value of the attribute words with 0
<crossFileRepeated ... words="20" ... /> => <crossFileRepeated ... words="0" ... />
In <total>, Set the value of the words attribute to be calculated by my own logic
<total ... words="1462" ... /> => <total ... words="??" ... />
I could really appreciate an example of processing XML files in batch / python

Let's utilize python!
it's extremely easy to do that in python. and since you said it's ok to make a solution in python, check the script below.
here's how you can iterate over a directory contains xml files and process them as requested in python while saving the file changes.
from xml.etree import ElementTree
import os
def edit_xml_file(data):
e = ElementTree.fromstring(data)
for file_element in e.findall('file'):
analyse_element = file_element.find('analyse')
in_context_exact_element = analyse_element.find('inContextExact')
in_context_exact_words = int(in_context_exact_element.get('words'))
in_context_exact_element.set('words', '0')
cross_file_repeated_element = analyse_element.find('crossFileRepeated')
cross_file_repeated_words = int(cross_file_repeated_element.get('words'))
cross_file_repeated_element.set('words', '0')
total_element = analyse_element.find('total')
total_element.set('words', str(in_context_exact_words + cross_file_repeated_words))
xmlstr = ElementTree.tostring(e)
return xmlstr
def main():
source_directory = 'xmlfiles'
for filename in os.listdir(source_directory):
if not filename.endswith('.xml'):
continue
xml_file_path = os.path.join(source_directory, filename)
with open(xml_file_path, 'r+b') as f:
data = f.read()
fixed_data = edit_xml_file(data)
f.seek(0)
f.write(fixed_data)
f.truncate()
if __name__ == '__main__':
main()
in this solution, iv'e used the built in ElementTree utility

Necessary tools
Here are the necessary tools you will need to create a script in Excel VBA or VBscript:
Looping text files in a directory: link
Reading text files: link
Writing text files: link
Replacing using RegExp: link
Example Regex to get you going:
<exact segments="114" words="334" characters="1687" placeables="14" tags="3" />
->
<exact segments="114" words="0" characters="1687" placeables="14" tags="3" />
Use this regex:
(words="[0-9]+?") or words="([0-9]+?)" even better
Below an example of processing a single row:
Dim re as RegExp
set re = new RegExp
re.Pattern = "words="([0-9]+?)"
newTextRow = re.Replace(textRow, 0) 'Replace word value with 0
The approach
Loop through your XML files using the Dir function
Read the contents of the file using the link above on how to read text files in VBA
Loop through all rows and use the RegExp function to replace the necessary word params
Save the output back to the XML file using the link above on how to write text files in VBA

Related

get default value if don't find elements with iterfind

I have these code to extract of xml file some elements:
for general in tree.iter('FOLDER'):
nameFolder = general.attrib.get('FOLDER_NAME')
for job_nodeOS in tablaGeneral.iterfind(".//JOB[#APPL_TYPE='OS']"):
listaOS.clear()
listaOS.append(job_name)
listaOS.append(nameFolder)
listaOS.append(daily)
for job_nodeOS3 in job_nodeOS.iterfind("ON"):
listaOS.append(job_nodeOS3.get('STMT',"NO APLICA"))
listaOS.append(job_nodeOS3.get('CODE',"NO APLICA"))
for job_nodeOS4 in job_nodeOS3.iterfind("DOMAIL"):
listaOS.append(job_nodeOS5.get('SUBJECT',"NO APLICA"))
listaOS.append(job_nodeOS5.get('MESSAGE',"NO APLICA"))
for variable_name in variablesOS:
variable_node = job_nodeOS.find(f"./VARIABLE[#NAME='{variable_name}']")
variable_value = variable_node.get("VALUE", default_value) if variable_node is not None else default_value
#print(job_name, variable_name.lstrip("%"), "=", variable_value)
listaOS.append(variable_value)
My problem is that if the clause for don't find any occurrences, I need listaOS add default values ('NO APLICA').
A piece of code xml:
<?xml version="1.0" encoding="utf-8"?>
<!--Exported at 11-06-2022 17:14:50-->
<DEFTABLE xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="Folder.xsd">
<FOLDER SERVER="PROD" VERSION="800" PLATFORM="UNIX" FOLDER_NAME="PALNF" >
<JOB ID="256" APPLICATION="HOUSE" SUBAPP="SERVER" JOBNAME="JOBA" APPL_TYPE="OS">
<SHOUT WHEN="LATESUB" TIME="0825" URGENCY="R" DEST="DESTINATION" MESSAGE="HI" DAYSOFFSET="0" />
<ON STMT="*" CODE="*NETWORK*">
<DOACTION ACTION="NOTOK" />
<DOMAIL URGENCY="U" DEST="EXMAPLE#EXMAPLE.COM" SUBJECT="SUBJECT" MESSAGE="HI" />
</ON>
</JOB>
<JOB ID="1" APPLICATION="OFFICE" SUBAPP="Google" JOBNAME="Google_Update_Task_Machine_UA" APPL_TYPE="OS">
<VARIABLE NAME="%%PARM1" VALUE="GoogleUpdate.exe" />
<VARIABLE NAME="%%PARM2" VALUE="/ua /installsource scheduler" />
</JOB>
</FOLDER>
<FOLDER SERVER="PROD" VERSION="800" PLATFORM="UNIX" FOLDER_NAME="PALNF_CALENDARIO">
<JOB ID="2" APPLICATION="APP" SUBAPP="SUB" JOBNAME="NOSCHEDULER" APPL_TYPE="OS" />
<JOB ID="3" APPLICATION="APP" SUBAPP="SUB" JOBNAME="NOSCHEDULER_CONMONHDAYS" APPL_TYPE="OS" />
</FOLDER>
</DEFTABLE>
Do you know how could I get that?
Thanks and sorry for my English!

Increasing Version in xml file using python script

/* Python Script */
import xml.etree.ElementTree as ET
tree = ET.parse('config.xml')
root = tree.getroot()
updateData = open('config.xml','w+')
print('Root Data is ',root.tag)
print('Root Attribute ',root.attrib)
old_version = root.attrib.values()[0]
print('Old_Version is ',old_version)
def increment_ver(old_version):
old_version = old_version.split('.')
old_version[2] = str(int(old_version[2]) + 1)
print('Old_Version 2 ',old_version[2])
return '.'.join(old_version)
new_Version = increment_ver(old_version);
print('New_version :',new_Version,root.attrib['version'])
root.attrib['version'] = new_Version
print(root.attrib)
tree.write(updateData)
updateData.close()
/* Original Config xml file */
<?xml version='1.0' encoding='utf-8'?>
<widget id="io.ionic.starter" version="0.0.1" xmlns="http://www.w3.org/ns/widgets" xmlns:cdv="http://cordova.apache.org/ns/1.0">
<name>aman</name>
<description>An awesome Ionic/Cordova app.</description>
<author email="hi#ionicframework.com" href="http://ionicframework.com/">Ionic Framework Team</author>
<content src="index.html" />
<access origin="*" />
<allow-intent href="http://*/*" />
<allow-intent href="https://*/*" />
<allow-intent href="tel:*" />
<allow-intent href="sms:*" />
<allow-intent href="mailto:*" />
<allow-intent href="geo:*" />
<preference name="ScrollEnabled" value="false" />
/* New Config.xml file */
<ns0:widget xmlns:ns0="http://www.w3.org/ns/widgets" xmlns:ns1="http://schemas.android.com/apk/res/android" id="io.ionic.starter" version="0.0.2">
<ns0:name>aman</ns0:name>
<ns0:description>An awesome Ionic/Cordova app.</ns0:description>
<ns0:author email="hi#ionicframework.com" href="http://ionicframework.com/">Ionic Framework Team</ns0:author>
<ns0:content src="index.html" />
<ns0:access origin="*" />
<ns0:allow-intent href="http://*/*" />
<ns0:allow-intent href="https://*/*" />
<ns0:allow-intent href="tel:*" />
<ns0:allow-intent href="sms:*" />
<ns0:allow-intent href="mailto:*" />
Once the script gets executed the version number is increased by 1 which i was trying to achieve. But, ns0 tag is added throughout the file and the header XML info tag gets removed [].
Please let me know what i have done wrong.

Your script slightly modified:
import xml.etree.ElementTree as ET
ET.register_namespace('', 'http://www.w3.org/ns/widgets')
tree = ET.parse('config.xml')
# (...) no changes in this part of code.
tree.write(f, xml_declaration=True, encoding="utf-8")
updateData.close()
The result:
<?xml version='1.0' encoding='utf-8'?>
<widget xmlns="http://www.w3.org/ns/widgets" id="io.ionic.starter" version="0.0.2">
<name>aman</name>
<description>An awesome Ionic/Cordova app.</description>
<author email="hi#ionicframework.com" href="http://ionicframework.com/">Ionic Framework Team</author>
<content src="index.html" />
<access origin="*" />
<allow-intent href="http://*/*" />
<allow-intent href="https://*/*" />
<allow-intent href="tel:*" />
<allow-intent href="sms:*" />
<allow-intent href="mailto:*" />
<allow-intent href="geo:*" />
<preference name="ScrollEnabled" value="false" />
</widget>
One of the namespace declarations has been dropped because it was not used in the XML body.
If you want to preserve namespaces use lxml library. In this case, your code would look like this (notice no ET.register_namespace):
import lxml.etree as ET
tree = ET.parse('config.xml')
root = tree.getroot()
updateData = open('config.xml','w+')
# (...) no changes in this part of code.
tree.write(f, xml_declaration=True, encoding="utf-8")
updateData.close()
In this case the output:
<?xml version='1.0' encoding='UTF-8'?>
<widget xmlns="http://www.w3.org/ns/widgets" xmlns:cdv="http://cordova.apache.org/ns/1.0" id="io.ionic.starter" version="0.0.2">
<name>aman</name>
<description>An awesome Ionic/Cordova app.</description>
<author email="hi#ionicframework.com" href="http://ionicframework.com/">Ionic Framework Team</author>
<content src="index.html"/>
<access origin="*"/>
<allow-intent href="http://*/*"/>
<allow-intent href="https://*/*"/>
<allow-intent href="tel:*"/>
<allow-intent href="sms:*"/>
<allow-intent href="mailto:*"/>
<allow-intent href="geo:*"/>
<preference name="ScrollEnabled" value="false"/>
</widget>

using python to convert XML to Json and trigger a post API request

im lookning for a solution to convert XML to Json and use the Json as the payload for post request.
I'm aiming for the following logic:
search for all root.listing.scedules.s and parse #s #d #p #c.
in root.listing.programs parse #t [p.id = #p (from scedules)] ->"Prime Discussion"
3, in root.listing.channels parse #c [c.id = #c (from scedules)] -> "mychannel"
once I have all the info parsed, I want to build a JSON containing all the params and send it using post request
I also look for a solution which will trigger multiple post APIs as the number of root.listing.scedules.s elements
{
"time":"{#s}",
"durartion":"{#d}",
"programID":"{#p}",
"title":"{#t}",
"channelName":"{#c}",
}
<?xml version="1.0" encoding="UTF-8"?>
<root>
<listings>
<schedules>
<s s="2019-09-26T00:00:00" d="1800" p="1569735" c="100007">
<f id="3" />
</s>
</schedules>
<programs>
<p id="1569735" t="Prime Discussion" d="Discussion on Current Affairs." rd="Discussion on Current Affairs." l="en">
<f id="2" />
<f id="21" />
<k id="6" v="20160614" />
<k id="1" v="2450548" />
<k id="18" v="12983658" />
<k id="21" v="12983658" />
<k id="10" v="Program" />
<k id="19" v="SH024505480000" />
<k id="20" v="http://tmsimg.com/assets/p12983658_b_h5_aa.jpg" />
<c id="607" />
<r o="1" r="1" n="100" />
<r o="2" r="1" n="1000" />
<r o="3" r="1" n="10000" />
</p>
</programs>
</listings>
<channels>
<c id="100007" c="mychannel" l="Prime Asia TV SD" d="Prime Asia TV SD" t="Digital" iso639="hi" />
<c id="10035" c="AETV" l="A&amp;E Canada" d="A&amp;E Canada" t="Digital" u="WWW.AETV.COM" iso639="en" />
</channels>
</root>
currently, i use this code to parse the scedules.s elements (part 1) and need some help with parts 2,3,4
import xml.etree.ElementTree as ET
tree = ET.parse('ChannelsProgramsTest.xml')
root = tree.getroot()
for sched in root[0][0].findall('s'):
new = sched.get('s'),sched.get('p'),sched.get('d'),sched.get('c')
print(new)

Below (I think it is the core solution you were looking for)
import xml.etree.ElementTree as ET
xml = '''<?xml version="1.0" encoding="UTF-8"?>
<root>
<listings>
<schedules>
<s s="2019-09-26T00:00:00" d="1800" p="1569735" c="100007">
<f id="3" />
</s>
</schedules>
<programs>
<p id="1569735" t="Prime Discussion" d="Discussion on Current Affairs." rd="Discussion on Current Affairs." l="en">
<f id="2" />
<f id="21" />
<k id="6" v="20160614" />
<k id="1" v="2450548" />
<k id="18" v="12983658" />
<k id="21" v="12983658" />
<k id="10" v="Program" />
<k id="19" v="SH024505480000" />
<k id="20" v="http://tmsimg.com/assets/p12983658_b_h5_aa.jpg" />
<c id="607" />
<r o="1" r="1" n="100" />
<r o="2" r="1" n="1000" />
<r o="3" r="1" n="10000" />
</p>
</programs>
</listings>
<channels>
<c id="100007" c="mychannel" l="Prime Asia TV SD" d="Prime Asia TV SD" t="Digital" iso639="hi" />
<c id="10035" c="AETV" l="A&amp;E Canada" d="A&amp;E Canada" t="Digital" u="WWW.AETV.COM" iso639="en" />
</channels>
</root>'''
tree = ET.fromstring(xml)
listings = tree.findall('.//listings')
for entry in listings:
# This is the first requirement: find s,d,p,c under 's' element
s = entry.find('./schedules/s')
print(s.attrib)
# now that we have s,d,p,c we can move on and look for the program with a specific id
program = entry.find("./programs/p[#id='{}']".format(s.attrib['p']))
print(program.attrib['t'])
# find the channel
channel = tree.find(".//channels/c[#id='{}']".format(s.attrib['c']))
print(channel.attrib['c'])
output
{'s': '2019-09-26T00:00:00', 'd': '1800', 'p': '1569735', 'c': '100007'}
Prime Discussion
mychannel

I'm still somewhat new to Stackoverflow in usage but not years. I think this is a somewhat duplicate question but I do not know how to tag this as duplicate yet.
A very good explanation of XML to JSON via Python is in the following post by the author of the library suggested.
Converting XML to JSON using Python?
The data source may have unknown characters in it that you will need to code for if you don't use a library
ie, newlines, unicode characters, other 'stray' characters. Often libraries will have done this for you already and you don't have to re-invent the wheel.

Parsing XML with Python - Accessing Values

I have recently got a RaspberryPi and have started to learn Python. To begin with I want to parse an XML file and I am doing this via the untangle library.
My XML looks like:
<?xml version="1.0" encoding="utf-8"?>
<weatherdata>
<location>
<name>Katherine</name>
<type>Administrative division</type>
<country>Australia</country>
<timezone id="Australia/Darwin" utcoffsetMinutes="570" />
<location altitude="176" latitude="-14.65012" longitude="132.17414" geobase="geonames" geobaseid="7839404" />
</location>
<sun rise="2019-02-04T06:33:52" set="2019-02-04T19:16:15" />
<forecast>
<tabular>
<time from="2019-02-04T06:30:00" to="2019-02-04T12:30:00" period="1">
<!-- Valid from 2019-02-04T06:30:00 to 2019-02-04T12:30:00 -->
<symbol number="9" numberEx="9" name="Rain" var="09" />
<precipitation value="1.8" />
<!-- Valid at 2019-02-04T06:30:00 -->
<windDirection deg="314.8" code="NW" name="Northwest" />
<windSpeed mps="3.3" name="Light breeze" />
<temperature unit="celsius" value="26" />
<pressure unit="hPa" value="1005.0" />
</time>
<time from="2019-02-04T12:30:00" to="2019-02-04T18:30:00" period="2">
<!-- Valid from 2019-02-04T12:30:00 to 2019-02-04T18:30:00 -->
<symbol number="9" numberEx="9" name="Rain" var="09" />
<precipitation value="2.3" />
<!-- Valid at 2019-02-04T12:30:00 -->
<windDirection deg="253.3" code="WSW" name="West-southwest" />
<windSpeed mps="3.0" name="Light breeze" />
<temperature unit="celsius" value="29" />
<pressure unit="hPa" value="1005.0" />
</time>
</tabular>
</forecast>
</weatherdata>
From this I would like to be able to print out the from and to attributes of the <time> element as well as the value attribute in its child node <temperature>
I can correctly print out the temperature values if I run the Python script below:
for forecast in data.weatherdata.forecast.tabular.time:
print (forecast.temperature['value'])
but if I run
for forecast in data.weatherdata.forecast.tabular:
print ("time is " + forecast.time['from'] + "and temperature is " + forecast.time.temperature['value'])
I get an error:
print (forecast.time['from'] + forecast.time.temperature['value'])
TypeError: list indices must be integers, not str
Can anyone advise how I can correctly access these values?

forecast.time should be a list, as it does have multiple values, one for each <time> node.
Did you expect forecast.time['from'] to automatically aggregate that data?

How to put the end event check in parsing a 1 GB XML file using lxml iterparse

I'm trying to parse a very large XML file at about 1GB, it's format being:
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE candidates SYSTEM "dtd/mwetoolkit-candidates.dtd">
<!-- MWETOOLKIT: filetype="XML" -->
<candidates>
<meta>
<corpussize name="ukwac-01" value="38224449" />
<corpussize name="sum" value="38224449" />
</meta>
<cand candid="2">
<ngram><w lemma="executive" pos="JJ" ><freq name="ukwac-01" value="600" /><freq name="sum" value="600" /></w> <w lemma="box" pos="NNS" ><freq name="ukwac-01" value="1006" /><freq name="sum" value="1006" /></w> <freq name="ukwac-01" value="9" /><freq name="sum" value="9" /></ngram>
<occurs>
<ngram><w surface="Executive" lemma="executive" pos="JJ" /> <w surface="boxes" lemma="box" pos="NNS" /> <freq name="ukwac-01" value="1" /></ngram>
<ngram><w surface="executive" lemma="executive" pos="JJ" /> <w surface="boxes" lemma="box" pos="NNS" /> <freq name="ukwac-01" value="8" /></ngram>
</occurs>
</cand>
<cand candid="5">
<ngram><w lemma="bad" pos="JJ" ><freq name="ukwac-01" value="4094" /><freq name="sum" value="4094" /></w> <w lemma="thing" pos="NN" ><freq name="ukwac-01" value="6609" /><freq name="sum" value="6609" /></w> <freq name="ukwac-01" value="119" /><freq name="sum" value="119" /></ngram>
<occurs>
<ngram><w surface="bad" lemma="bad" pos="JJ" /> <w surface="thing" lemma="thing" pos="NN" /> <freq name="ukwac-01" value="115" /></ngram>
<ngram><w surface="Bad" lemma="bad" pos="JJ" /> <w surface="thing" lemma="thing" pos="NN" /> <freq name="ukwac-01" value="4" /></ngram>
</occurs>
</cand>
</candidates>
So far, I have this code:
from lxml import etree
import sys
def fast_iter(context, func):
#http://www.ibm.com.br/developerworks/xml/library/x-hiperfparse/
#Author = Liza Daly
for event, elem in context:
func(elem)
elem.clear()
while elem.getprevious() is not None:
del elem.getparent()[0]
del context
def print_csv(element):
if element.tag == 'cand':
lemmas = []
compound_freqs = []
mweval = 0
for f in c.xpath('ngram/freq'):
if f.attrib['name'] == 'ukwac':
mweval = int(f.attrib['value'])
for w in element.xpath('ngram/w'):
lemmas.append(w.attrib['lemma'])
for freq in element.xpath('ngram/w/freq'):
if freq.attrib['name'] == 'ukwac':
compound_freqs.append(int(freq.attrib['value']))
print(' '.join(lemmas),mweval,sep='\t',end='\t')
[print(l,f,sep=":",end='') for l,f in zip(lemmas,compound_freqs)]
print()
if __name__ == '__main__':
args = sys.argv
context = etree.iterparse(args[1], events=("start", "end"))
print("mwe","mwe_freq","compounds",sep='\t')
for event, element in context:
if element.tag == "candidates":
fast_iter(context, print_csv)
The desired output is a CSV file in the format:
mwe mwe_freq compounds
executive box 9 executive:600,box:1006
The exact print format may (and will) change, but for some reason, once I get to the print function and past the element.tag check, the freq elements are empty and all I get printed are their addresses. I know I'm supposed to put an end event check somewhere, as per iterparse's documentation, but I tried putting one inside fast_iter and that surely didn't work.
My current output:
mwe mwe_freq compounds
<Element freq at 0x7f8735342c48>
<Element freq at 0x7f8735342c88>
executive box 0
0
<Element freq at 0x7f8735346708>
<Element freq at 0x7f87353467c8>
bad thing 0
0
Any help is very much appreciated.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

how to edit XML files in batch / python - python

Related

get default value if don't find elements with iterfind

Increasing Version in xml file using python script

using python to convert XML to Json and trigger a post API request

Parsing XML with Python - Accessing Values

How to put the end event check in parsing a 1 GB XML file using lxml iterparse

Categories

Resources