Opening file by file in a folder - python

I m new at programing with python but currently i received a task to write a script that writes me down all ID's were the type=0 or type=1 occurs. Its an XML File that looks like this example:
<root>
<bla1 type="0" id = "1001" pvalue:="djdjd"/>
<bla2 type="0" id = "1002" pvalue:="djdjd" />
<bla3 type="0" id = "1003" pvalue:="djdjd"/>
<bla4 type="0" id = "1004" pvalue:="djdjd"/>
<bla5 type="0" id = "1005" pvalue:="djdjd"/>
<bla6 type="1" id = "1006" pvalue:="djdjd"/>
<bla7 type="0" id = "1007" pvalue:="djdjd"/>
<bla8 type="0" id = "1008" pvalue:="djdjd"/>
<bla9 type="1" id = "1009" pvalue:="djdjd"/>
<bla10 type="0" id = "1010" pvalue:="djdjd"/>
<bla11 type="0" id = "1011" pvalue:="djdjd"/>
<bla12 type="0" id = "1009" pvalue:="djdjd"/>
<root>
So the first thing the code does is to replace basically ':=' with '=' cause that makes my xml upload causing errors. Anyway then it writes down the ID's were the type is 0 and the ID's where the the type is 1. This works perfectly for one xml file. Unfortunately i have more then just one file and i need sth like a loop that always opens the next xml file (different names) in the folder and adds always the new ID's to the ID's found in the last xml. So basically it adds always the new found id's from the new xml file.
import xml.etree.cElementTree as ET # required import
XmlFile = 'ID3.xml' # insert here the name of the XML-file, which needs to be inside the same folder as the .py file
my_file = open('%s' % XmlFile, "r+") # open the XML-file
Xml2String = my_file.readlines() # convert the file into a list strings
XmlFile_new = [] # new list, which is filled with the modified strings
L = len(Xml2String) # length of the string-list
for i in range(1, L): # Increment starts at 0, therefore, the first line is ignored
if ':=' in Xml2String[i]:
XmlFile_new.append(Xml2String[i].replace(':=', '=')) # get rid of colon
else:
XmlFile_new.append(Xml2String[i])
tree = ET.ElementTree(XmlFile_new)
root = tree.getroot()
id_0 = [] # list for id="0"
id_1 = [] # list for id="1"
id_one2zero = [] # list for ids, that occur twice
for i in range(len(root)):
if 'type="0"' in root[i]: # check for type
a = root[i].index("id") + 5 # search index of id
b = a+6
id_0.append((root[i][a:b])) # the id is set via index slicing
elif 'type="1"' in root[i]: # check for type
a = root[i].index("id") + 5
b = a+6
id_1.append((root[i][a:b]))
else:
print("Unknown type occurred") # If there's a line without type="0" or type="1", this message gets printed
# (Remember: first line of the xml-file is ignored)
for i in range(len(id_0)): # check for ids, that occur twice
for j in range(len(id_1)):
if id_0[i] == id_1[j]:
id_one2zero.append(id_0[i])
print(id_0)
print(id_1)
f = open('write.xml','w')
print >>f, 'whatever'
print('<end>')

An easy way to solve this is using the os.walk() function. With it you can open all files in one directory or even recursively.
Here is an example how to use it:
for root, dirs, files in os.walk("your/path"):
for file in files:
# process your file
If you also have other files than xml-files in your directory you can make sure you only process xml-files with the file.endswith(".xml").

Related

How to get the content of specific grandchild from xml file through python

Hi I am very new to python programming. I have an xml file of structure:
<?xml version="1.0" encoding="UTF-8"?>
-<LidcReadMessage xsi:schemaLocation="http://www.nih.gov http://troll.rad.med.umich.edu/lidc/LidcReadMessage.xsd"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns="http://www.nih.gov" uid="1.3.6.1.4.1.14519.5.2.1.6279.6001.1307390687803.0">
-<ResponseHeader>
<Version>1.8.1</Version>
<MessageId>-421198203</MessageId>
<DateRequest>2007-11-01</DateRequest>
<TimeRequest>12:30:44</TimeRequest>
<RequestingSite>removed</RequestingSite>
<ServicingSite>removed</ServicingSite>
<TaskDescription>Second unblinded read</TaskDescription>
<CtImageFile>removed</CtImageFile>
<SeriesInstanceUid>1.3.6.1.4.1.14519.5.2.1.6279.6001.179049373636438705059720603192</SeriesInstanceUid>
<DateService>2008-08-18</DateService>
<TimeService>02:05:51</TimeService>
<ResponseDescription>1 - Reading complete</ResponseDescription>
<StudyInstanceUID>1.3.6.1.4.1.14519.5.2.1.6279.6001.298806137288633453246975630178</StudyInstanceUID>
</ResponseHeader>
-<readingSession>
<annotationVersion>3.12</annotationVersion>
<servicingRadiologistID>540461523</servicingRadiologistID>
-<unblindedReadNodule>
<noduleID>Nodule 001</noduleID>
-<characteristics>
<subtlety>5</subtlety>
<internalStructure>1</internalStructure>
<calcification>6</calcification>
<sphericity>3</sphericity>
<margin>3</margin>
<lobulation>3</lobulation>
<spiculation>4</spiculation>
<texture>5</texture>
<malignancy>5</malignancy>
</characteristics>
-<roi>
<imageZposition>-125.000000 </imageZposition>
<imageSOP_UID>1.3.6.1.4.1.14519.5.2.1.6279.6001.110383487652933113465768208719</imageSOP_UID>
......
There are four which contains multiple . Each contains an . I need to extract the information in from all of these headers.
Right now I am doing this:
import xml.etree.ElementTree as ET
tree = ET.parse('069.xml')
root = tree.getroot()
#lst = []
for readingsession in root.iter('readingSession'):
for roi in readingsession.findall('roi'):
id = roi.findtext('imageSOP_UID')
print(id)
but it ouputs like this:
Process finished with exit code 0.
If anyone can help.
The real problem as been wit the namespace. I tried with and without it, but it didn't work with this code.
ds = pydicom.dcmread("000071.dcm")
uid = ds.SOPInstanceUID
tree = ET.parse("069.xml")
root = tree.getroot()
for child in root:
print(child.tag)
if child.tag == '{http://www.nih.gov}readingSession':
read = child.find('{http://www.nih.gov}unblindedReadNodule')
if read != None:
nodule_id = read.find('{http://www.nih.gov}noduleID').text
xml_uid = read.find('{http://www.nih.gov}roi').find('{http://www.nih.gov}imageSOP_UID').text
if xml_uid == uid:
print(xml_uid, "=", uid)
roi= read.find('{http://www.nih.gov}roi')
print(roi)
This work completely fine to get a uid from dicom image of LIDC/IDRI dataset and then extract the same uid from xml file for it region of interest.

Using a .py script that cleans then splits a large MODS XML record to do the same for a Dublin Core XML record and I'm getting no output

I took an OpenRefine template for translating a csv to a giant MODS XML record, then a .py script for cleaning it and turning it into several smaller xml files, named using one of the tags. It works perfectly. However, when I tried altering it to fit my needs for Dublin Core xml records... not so much.
I've got an OpenRefine template that gives me this from my csv:
<collection xmlns:xsi="http:www.w3.org/2001/XMLSchema-instance">
<record xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:dcterms="http://purl.org/dc/terms/" xmlns:dctype="http://purl.org/dc/dcmitype/" xmlns:xsi="http://www.w3.org/2001XMLSchema-instance" xsi:schemaLocation="http://purl.org/dc/elements/1.1/ http://dublincore.org/schemas/xmls/qdc/2008/02/11/dc.xsd http://purl.org/dc/terms/ http://dublincore.org/schemas/xmls/qdc/2008/02/11/dcterms.xsd http://purl.org/dc/dcmitype/ http://dublincore.org/schemas/xmls/qdc/2008/02/11/dcmitype.xsd http://dublincore.org/schemas/xmls/qdc/2008/02/11/qualifieddc.xsd">
<dc:title>[Mary Adams at the organ]</dc:title>
<dc:creator>MacAfee, Don</dc:creator>
<dc:date>4/14/1964</dc:date>
<dc:subject>organs</dc:subject><dc:subject>musical instruments</dc:subject><dc:subject>musicians</dc:subject><dc:subject>Adams, Mary</dc:subject>
<dc:description>Music instructor Mary C. Adams playing the organ.</dc:description>
<dc:format>1 print : b&w ; 6.5 x 6.5 in.</dc:format>
<dcterms:spatial>Alexandria, Virginia</dcterms:spatial>
<dc:type>Photograph</dc:type>
<dc:format>Image</dc:format>
<dc:identifier>MS332-01-01-001</dc:identifier>
<dc:rights>Copyright has not been assigned to the Alexandria Library. All requests for permission to publish or quote from manuscripts must be submitted in writing to the Alexandria Library. Permission for publication is given on behalf of the Alexandria Library as the owner of the physical items and is not intended to include or imply permission of the copyright holder, which must also be obtained by the researcher.</dc:rights>
</record>
<record xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:dcterms="http://purl.org/dc/terms/" xmlns:dctype="http://purl.org/dc/dcmitype/" xmlns:xsi="http://www.w3.org/2001XMLSchema-instance" xsi:schemaLocation="http://purl.org/dc/elements/1.1/ http://dublincore.org/schemas/xmls/qdc/2008/02/11/dc.xsd http://purl.org/dc/terms/ http://dublincore.org/schemas/xmls/qdc/2008/02/11/dcterms.xsd http://purl.org/dc/dcmitype/ http://dublincore.org/schemas/xmls/qdc/2008/02/11/dcmitype.xsd http://dublincore.org/schemas/xmls/qdc/2008/02/11/qualifieddc.xsd">
<dc:title>[Portrait of Dr. Robert Adeson]</dc:title>
<dc:date>1980</dc:date>
<dc:subject>physicians</dc:subject><dc:subject>doctors</dc:subject><dc:subject>Adeson, Robert, M.D.</dc:subject>
<dc:description>Dr. Robert L. Adeson, Alexandria Hospital.</dc:description>
<dc:format>1 print : b&w ; 5 x 7 in.</dc:format>
<dcterms:spatial>Alexandria, Virginia</dcterms:spatial>
<dc:type>Photograph</dc:type>
<dc:format>Image</dc:format>
<dc:identifier>MS332-01-01-002</dc:identifier>
<dc:rights>Copyright has not been assigned to the Alexandria Library. All requests for permission to publish or quote from manuscripts must be submitted in writing to the Alexandria Library. Permission for publication is given on behalf of the Alexandria Library as the owner of the physical items and is not intended to include or imply permission of the copyright holder, which must also be obtained by the researcher.</dc:rights>
</record>
</collection>
I've got a Python program that cleans and separates a MODS record, that I've modified that looks like this:
import os, lxml.etree as ET
output_path = 'C:\\Users\\Staff\\Desktop\\Metadata\\SplitXML\\'
# parse source.xml with lxml
tree = ET.parse('source.xml')
# start cleanup
# remove any element tails
for element in tree.iter():
element.tail = None
# remove any line breaks or tabs in element text
if element.text:
if '\n' in element.text:
element.text = element.text.replace('\n', '')
if '\t' in element.text:
element.text = element.text.replace('\t', '')
# remove any remaining whitespace
parser = ET.XMLParser(remove_blank_text=True, remove_comments=True, recover=True)
treestring = ET.tostring(tree)
clean = ET.XML(treestring, parser)
# remove recursively empty nodes
def recursively_empty(e):
if e.text:
return False
return all((recursively_empty(c) for c in e.iterchildren()))
context = ET.iterwalk(clean)
for action, elem in context:
parent = elem.getparent()
if recursively_empty(elem):
parent.remove(elem)
# remove nodes with blank attribute
for element in clean.xpath(".//*[#*='']"):
element.getparent().remove(element)
# remove nodes with attribute "null"
for element in clean.xpath(".//*[#*='null']"):
element.getparent().remove(element)
# finished cleanup
# write out to intermediate file
with open('clean.xml', 'wb') as f:
f.write(ET.tostring(clean))
print("XML is now clean")
# parse the clean xml
cleanxml = ET.iterparse('clean.xml', events=('end', ))
# find the <dc> nodes
for event, elem in cleanxml:
if elem.tag == '{http://purl.org/dc/elements/1.1/}record':
# name new files using the <dc:identifier> tag
identifier = elem.find('{http://purl.org/dc/elements/1.1/}dc:identifier').text
filename = format(identifier + "_DC.xml")
# write out to new file
with open(output_path+filename, 'wb') as f:
f.write("<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n")
f.write(ET.tostring(elem, pretty_print = True))
print("Writing", filename)
# remove the intermediate file
os.remove('clean.xml')
print("All done!")
The cmd prints the "XML is now clean" and "All done!" statements, HOWEVER, there are no files in the SplitXML directory (or anywhere). My attempt at de-bugging was to comment out the os.remove('clean.xml') line so I could look at the cleaned xml. I've done this with the MODS .py script, and the xml file looks like what you'd expect. However, the clean.xml file on the DC one is clean, but just one long string of code, rather than using different lines and tabs, like this:
<collection xmlns:xsi="http:www.w3.org/2001/XMLSchema-instance"><record xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:dcterms="http://purl.org/dc/terms/" xmlns:dctype="http://purl.org/dc/dcmitype/" xmlns:xsi="http://www.w3.org/2001XMLSchema-instance" xsi:schemaLocation="http://purl.org/dc/elements/1.1/ http://dublincore.org/schemas/xmls/qdc/2008/02/11/dc.xsd http://purl.org/dc/terms/ http://dublincore.org/schemas/xmls/qdc/2008/02/11/dcterms.xsd http://purl.org/dc/dcmitype/ http://dublincore.org/schemas/xmls/qdc/2008/02/11/dcmitype.xsd http://dublincore.org/schemas/xmls/qdc/2008/02/11/qualifieddc.xsd"><dc:title>[Mary Adams at the organ]</dc:title><dc:creator>MacAfee, Don</dc:creator><dc:date>4/14/1964</dc:date><dc:subject>organs</dc:subject><dc:subject>musical instruments</dc:subject><dc:subject>musicians</dc:subject><dc:subject>Adams, Mary</dc:subject><dc:description>Music instructor Mary C. Adams playing the organ.</dc:description><dc:format>1 print : b&w ; 6.5 x 6.5 in.</dc:format><dcterms:spatial>Alexandria, Virginia</dcterms:spatial><dc:type>Photograph</dc:type><dc:format>Image</dc:format><dc:identifier>MS332-01-01-001</dc:identifier><dc:rights>Copyright has not been assigned to the Alexandria Library. All requests for permission to publish or quote from manuscripts must be submitted in writing to the Alexandria Library. Permission for publication is given on behalf of the Alexandria Library as the owner of the physical items and is not intended to include or imply permission of the copyright holder, which must also be obtained by the researcher.</dc:rights></record><record xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:dcterms="http://purl.org/dc/terms/" xmlns:dctype="http://purl.org/dc/dcmitype/" xmlns:xsi="http://www.w3.org/2001XMLSchema-instance" xsi:schemaLocation="http://purl.org/dc/elements/1.1/ http://dublincore.org/schemas/xmls/qdc/2008/02/11/dc.xsd http://purl.org/dc/terms/ http://dublincore.org/schemas/xmls/qdc/2008/02/11/dcterms.xsd http://purl.org/dc/dcmitype/ http://dublincore.org/schemas/xmls/qdc/2008/02/11/dcmitype.xsd http://dublincore.org/schemas/xmls/qdc/2008/02/11/qualifieddc.xsd"><dc:title>[Portrait of Dr. Robert Adeson]</dc:title><dc:date>1980</dc:date><dc:subject>physicians</dc:subject><dc:subject>doctors</dc:subject><dc:subject>Adeson, Robert, M.D.</dc:subject><dc:description>Dr. Robert L. Adeson, Alexandria Hospital.</dc:description><dc:format>1 print : b&w ; 5 x 7 in.</dc:format><dcterms:spatial>Alexandria, Virginia</dcterms:spatial><dc:type>Photograph</dc:type><dc:format>Image</dc:format><dc:identifier>MS332-01-01-002</dc:identifier><dc:rights>Copyright has not been assigned to the Alexandria Library. All requests for permission to publish or quote from manuscripts must be submitted in writing to the Alexandria Library. Permission for publication is given on behalf of the Alexandria Library as the owner of the physical items and is not intended to include or imply permission of the copyright holder, which must also be obtained by the researcher.</dc:rights></record></collection>
If it helps, here's the original Python code for cleaning and splitting MODS. I got it from calhist on github.
# Split XML containing many <mods> elements into invidual files
# Modified from script found here: http://stackoverflow.com/questions/36155049/splitting-xml-file-into-multiple-at-given-tags
# by Bill Levay for California Historical Society
import os, lxml.etree as ET
# uncomment below modules if doing MODS cleanup on existing Islandora objects
import codecs, json
output_path = 'C:\\Users\\Staff\\Desktop\\Metadata\\SplitXML\\'
# parse source.xml with lxml
tree = ET.parse('source.xml')
# start cleanup
# remove any element tails
for element in tree.iter():
element.tail = None
# remove any line breaks or tabs in element text
if element.text:
if '\n' in element.text:
element.text = element.text.replace('\n', '')
if '\t' in element.text:
element.text = element.text.replace('\t', '')
# remove any remaining whitespace
parser = ET.XMLParser(remove_blank_text=True, remove_comments=True, recover=True)
treestring = ET.tostring(tree)
clean = ET.XML(treestring, parser)
# remove recursively empty nodes
# found here: https://stackoverflow.com/questions/12694091/python-lxml-how-to-remove-empty-repeated-tags
def recursively_empty(e):
if e.text:
return False
return all((recursively_empty(c) for c in e.iterchildren()))
context = ET.iterwalk(clean)
for action, elem in context:
parent = elem.getparent()
if recursively_empty(elem):
parent.remove(elem)
# remove nodes with blank attribute
# for element in clean.xpath(".//*[#*='']"):
# element.getparent().remove(element)
# remove nodes with attribute "null"
for element in clean.xpath(".//*[#*='null']"):
element.getparent().remove(element)
# finished cleanup
# write out to intermediate file
with open('clean.xml', 'wb') as f:
f.write(ET.tostring(clean))
print("XML is now clean")
# parse the clean xml
cleanxml = ET.iterparse('clean.xml', events=('end', ))
###
# uncomment this section if doing MODS cleanup on existing Islandora objects
# getting islandora IDs for existing collections
###
# item_list = []
# json_path = 'C:\\mods\\data.json'
# with codecs.open(json_path, encoding='utf-8') as filename:
# item_list = json.load(filename)
# filename.close
###
# find the <mods> nodes
for event, elem in cleanxml:
if elem.tag == '{http://www.loc.gov/mods/v3}mods':
# the filenames of the resulting xml files will be based on the <identifier> element
# edit the specific element or attribute if necessary
identifier = elem.find('{http://www.loc.gov/mods/v3}identifier[#type="local"]').text
filename = format(identifier + "_MODS.xml")
###
# uncomment this section if doing MODS cleanup on existing Islandora objects
# look through the list of object metadata and get the islandora ID by matching the digital object ID
###
# for item in item_list:
# local_ID = item["identifier-type:local"]
# islandora_ID = item["PID"]
# if identifier == local_ID:
# filename = format(islandora_ID + "_MODS.xml")
###
# write out to new file
with open(output_path+filename, 'wb') as f:
f.write("<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n")
f.write(ET.tostring(elem, pretty_print = True))
print("Writing", filename)
# remove the intermediate file
os.remove('clean.xml')
print("All done!")
I found two namespace-related problems:
The record element is in no namespace. Therefore, you need to change
if elem.tag == '{http://purl.org/dc/elements/1.1/}record':
to
if elem.tag == 'record':
elem.find('{http://purl.org/dc/elements/1.1/}dc:identifier') is not correct. The dc: bit must be removed.

How convert xml to csv file using python (in row)?

I want to encode this xml document in cvs. I tried but it does not work I do not know what I'm doing wrong.I'm new on this.
There is the xml that i want to convert
<?xml version="1.0" encoding="UTF-8"?>
<Shot
Shotcode = "30AA"
ShotDate = "4/2/2000">
<Images>
<Image
ImageNumber="103"
RawFileName="18_Shot_30AA.jpg" />
<Image
ImageNumber="104"
RawFileName="17_Shot_30AA.jpg" />
<Image
ImageNumber="105"
RawFileName="14_Shot_30AA" />
</Images>
<Metrics>
<Metric
Name = "30AA"
TypeId = "163"
Value = "0" />
<Metric
Name = "Area"
TypeId = "10"
Value = "63" />
</Metrics>
</Shot>
I code this in that form, in order to complete some example and is not the complete program but show what i'm doing.
import xml.etree.ElementTree as ET
import csv
tree = ET.parse("30AA.xml")
root = tree.getroot()
30AA = open('30AA.csv', 'w+')
csvwriter = csv.writer(30AA)
head = []
count = 0 #loops
for member in root.findall('Shot'):
Shot = []
if count == 0:
ShotCode = member.find('ShotCode').tag
head.append(ShotCode)
ShotDate = member.find('ShotDate').tag
head.append(ShotDate)
csvwriter.writerow(head)
count = count + 1
ShotCode = member.find('ShotCode').txt
Shot.append(ShotCode)
ShotDate = member.find('ShotDate').txt
Shot.append(ShotDate)
30AA.close()
the result that i expect is
Shotcode 30AA
ShotDate 4/2/2000
Imagen 103
Imagen 104
Imagen 105
Name TypeId Value
30AA 163 0
area 10 63
Okay I think I see whats going wrong, the major problem is mostly in reading the xml It just looks like its a csv thing.
The root of your xml is a Shot tag, so you can't use root.findall('Shot') to get all the tags since root is already and it doesn't have any Shot's inside it.
So that why your not getting anything in your output.
Also when you want to get the attributes of a tag you use .attrib['name_of_attribute'] so for example instead of member.find('ShotCode').tag should be member.attrib['ShotCode']
That changes the rest of the script quite a bit but you then need to do something like this:
root = tree.getroot()
_30AA = open('30AA.csv', 'w+')
csvwriter = csv.writer(_30AA)
head = []
ShotCode = root.attrib['Shotcode']
csvwriter.writerow(['ShotCode', ShotCode])
head.append(ShotCode)
ShotDate = root.attrib['ShotDate']
csvwriter.writerow(['ShotDate', ShotDate])
# member is going to be the <Images> and <Metrics>
for member in root.getchildren():
submembers = member.getchildren()
# Write the names of the attributes as headings
keys = submembers[0].attrib.keys()
csvwriter.writerow(keys)
for submember in submembers:
row_data = [submember.attrib[k] for k in keys]
csvwriter.writerow(row_data )
_30AA.close()
Will give you what you want

element attributes missing when parsing XML with iterparse/lxml/python 2

Here's my use case:
I have a potentially large XML file, and I want to output the frequency of all the unique structural variations of a given element type. Element attributes should be included as part of the uniqueness test. The output should sort the variations by frequency.
Here's a trivial input example, with 4 entries for automobile:
<automobile>
<mileage>20192</mileage>
<year>2005</year>
<user_defined name="color">red</user_defined>
</automobile>
<automobile>
<mileage>1098</mileage>
<year>2018</year>
<user_defined name="color">blue</user_defined>
</automobile>
<automobile>
<mileage>17964</mileage>
<year>2012</year>
<user_defined name="title_status">salvage</user_defined>
</automobile>
<automobile>
<mileage>198026</mileage>
<year>1990</year>
</automobile>
The output I expect would look like this:
<automobile automobile_frequency="2">
<mileage />
<year />
<user_defined name="color" />
</automobile>
<automobile automobile_frequency="1">
<mileage />
<year />
<user_defined name="title_status" />
</automobile>
<automobile automobile_frequency="1">
<mileage />
<year />
</automobile>
I've implemented the code using iterparse, but when it's processing the elements, the attributes do not exist in the element. The code logic appears to be correct, but attributes simply don't exist; they are not written in the output, and they are not present for the uniqueness test. Per the above input example, this is what I get on output:
<root>
<automobile automobile_frequency="3">
<mileage/>
<year/>
<user_defined/>
</automobile>
<automobile automobile_frequency="1">
<mileage/>
<year/>
</automobile>
</root>
The usage is:
xplore.py input.xml node_to_explore
In the above example, I used:
xplore.py trivial.xml automobile
Here's the source:
from lxml import etree
import sys
import re
from datetime import datetime
# global node signature map
structure_map = {}
# global code frequency map
frequency_map = {}
# output tree
tmp_root = etree.Element("tmp_root")
def process_element(el):
global target
if el.tag != target:
return
# get the structure of the element
structure = get_structure(el)
global structure_map
structure_key = etree.tostring(structure, pretty_print=True)
if structure_key not in structure_map.keys():
# add signature to structure map
structure_map[structure_key] = structure
# add node to output
global tmp_root
tmp_root.append(structure)
# add signature to frequency map
frequency_map[structure_key] = 1
else:
# increment frequency map
frequency_map[structure_key] += 1
# returns a unique string representing the structure of the node
# including attributes
def get_structure(el):
# create new element for the return value
ret = etree.Element(el.tag)
# get attributes
attribute_keys = el.attrib.keys()
for attribute_key in attribute_keys:
ret.set(attribute_key, el.get(attribute_key))
# check for children
children = list(el)
for child in children:
ret.append(get_structure(child))
return ret
if len(sys.argv) < 3:
print "Must specify an XML file for processing, as well as an element type!"
exit(0)
# Get XML file
xml = sys.argv[1]
# Get output file name
output_file = xml[0:xml.rindex(".")]+".txt"
# get target element type to evaluate
target = sys.argv[2]
# mark start
startTime = datetime.now()
# Parse XML
print '==========================='
print 'Parsing XML'
print '==========================='
context = etree.iterparse(xml, events=('end',))
for event, element in context:
process_element(element)
element.clear()
# create tree sorted by frequency
ranked = sorted(frequency_map.items(), key=lambda x: x[1], reverse=True)
root = etree.Element("root")
for item in ranked:
structure = structure_map[item[0]]
structure.set(target+"_frequency", str(item[1]))
root.append(structure)
# pretty print root
out = open(output_file, 'w')
out.write(etree.tostring(root, pretty_print=True))
# output run time
time = datetime.now() - startTime
reg3 = re.compile("\\d+:\\d(\\d:\\d+\\.\\d{4})")
time = re.search(reg3, unicode(time))
time = "Runtime: %ss" % (time.group(1).encode("utf-8"))
print(time)
In the debugger, I can clearly see that the attributes are missing from elements in the calls to get_structure. Can anyone tell me why this is the case?
The data:
<root>
<automobile>
<mileage>20192</mileage>
<year>2005</year>
<user_defined name="color">red</user_defined>
</automobile>
<automobile>
<mileage>1098</mileage>
<year>2018</year>
<user_defined name="color">blue</user_defined>
</automobile>
<automobile>
<mileage>17964</mileage>
<year>2012</year>
<user_defined name="title_status">salvage</user_defined>
</automobile>
<automobile>
<mileage>198026</mileage>
<year>1990</year>
</automobile>
</root>
The code:
from lxml import etree
import sys
import re
from datetime import datetime
# global node signature map
structure_map = {}
# global code frequency map
frequency_map = {}
# output tree
tmp_root = etree.Element("tmp_root")
def process_element(el):
# get the structure of the element
structure = get_structure(el)
global structure_map
structure_key = etree.tostring(structure, pretty_print=True)
if structure_key not in structure_map.keys():
# add signature to structure map
structure_map[structure_key] = structure
# add node to output
global tmp_root
tmp_root.append(structure)
# add signature to frequency map
frequency_map[structure_key] = 1
else:
# increment frequency map
frequency_map[structure_key] += 1
# returns a unique string representing the structure of the node
# including attributes
def get_structure(el):
# create new element for the return value
ret = etree.Element(el.tag)
# get attributes
attribute_keys = el.attrib.keys()
for attribute_key in attribute_keys:
ret.set(attribute_key, el.get(attribute_key))
# check for children
children = list(el)
for child in children:
ret.append(get_structure(child))
return ret
if len(sys.argv) < 3:
print "Must specify an XML file for processing, as well as an element type!"
exit(0)
# Get XML file
xml = sys.argv[1]
# Get output file name
output_file = xml[0:xml.rindex(".")]+".txt"
# get target element type to evaluate
target = sys.argv[2]
# mark start
startTime = datetime.now()
# Parse XML
print '==========================='
print 'Parsing XML'
print '==========================='
context = etree.iterparse(xml, events=('end',))
element_to_clear = []
for event, element in context:
element_to_clear.append(element)
global target
if element.tag == target:
process_element(element)
for ele in element_to_clear:
ele.clear()
element_to_clear = []
# create tree sorted by frequency
ranked = sorted(frequency_map.items(), key=lambda x: x[1], reverse=True)
root = etree.Element("root")
for item in ranked:
structure = structure_map[item[0]]
structure.set(target+"_frequency", str(item[1]))
root.append(structure)
# pretty print root
out = open(output_file, 'w')
out.write(etree.tostring(root, pretty_print=True))
# output run time
time = datetime.now() - startTime
reg3 = re.compile("\\d+:\\d(\\d:\\d+\\.\\d{4})")
time = re.search(reg3, unicode(time))
time = "Runtime: %ss" % (time.group(1).encode("utf-8"))
print(time)
The command: xplore.py trivial.xml automobile

Create XML file by iterating over lists in python

I have checked this link but it doesnt solved my problem.
I have 2 lists:
a = [['txt','stxt','pi','min','max'],['txt1','stxt1','pi1','min1','max1']]
b = [[0.45,1.23],[[0.75,1.53]]
for l1 in a:
for l2 in b:
root = ET.Element("Class ",name = l1[0])
doc = ET.SubElement(root, "subclass" , name = l1[1])
ET.SubElement(doc, l1[4], min = str(l2 [0]),max = str(l2 [1]))
tree = ET.ElementTree(root)
tree.write(FilePath)
The last record is overwriting all the previous records. So if i want all the records to be written to the xml file? how can i do that using python programming. I also want each record to be saved to the xml file in new line but not pretty printing.
Output i need to be added to the xml:
<Class name="txt"><subclass name="stxt"><pi max="1.23" min="0.45" /></subclass></Class >
<Class name="txt1"><subclass name="stxt1"><pi1 max1="1.53" min1="0.75" /></subclass></Class >
But i am getting is only one record in xml:
<Class name="txt1"><subclass name="stxt1"><pi1 max1="0.1077" min1="-0.0785" /></subclass></Class >
You are writing to same file every time. You need to create new file for every input and the two for loops will make 4 files with undesired combinations. Instead zip is what you need
a = [['txt','stxt','pi','min','max'],['txt1','stxt1','pi1','min1','max1']]
b = [[0.45,1.23],[0.75,1.53]]
from xml.etree import ElementTree as ET
root = ET.Element("xml")
for l1 in zip(a,b):
sroot_root = ET.Element("Class ",name = l1[0][0])
doc = ET.SubElement(sroot_root, "subclass" , name = l1[0][1])
ET.SubElement(doc, l1[0][4], min = str(l1[1][0]),max = str(l1[1][1]))
root.append(sroot_root)
tree = ET.ElementTree(root)
tree.write("test.xml")
Output :
Filename: test.xml
<xml><Class name="txt"><subclass name="stxt"><max max="1.23" min="0.45" /></subclass></Class ><Class name="txt1"><subclass name="stxt1"><max1 max="1.53" min="0.75" /></subclass></Class ></xml>

Categories