I'd like to parse and compare 2 XML files with the Python Etree parser as follows:
I have 2 XML files with loads of data. One is in English (the source file), the other one the corresponding French translation (the target file).
E.g.:
source file:
<AB>
<CD/>
<EF>
<GH>
<id>123</id>
<IJ>xyz</IJ>
<KL>DOG</KL>
<MN>dogs/dog</MN>
some more tags and info on same level
<metadata>
<entry>
<cl>Translation</cl>
<cl>English:dog/dogs</cl>
</entry>
<entry>
<string>blabla</string>
<string>blabla</string>
</entry>
some more strings and entries
</metadata>
</GH>
</EF>
<stuff/>
<morestuff/>
<otherstuff/>
<stuffstuff/>
<blubb/>
<bla/>
<blubbbla>8</blubbla>
</AB>
The target file looks exactly the same, but has no text at some places:
<MN>chiens/chien</MN>
some more tags and info on same level
<metadata>
<entry>
<cl>Translation</cl>
<cl></cl>
</entry>
The French target file has an empty cross-language reference where I'd like to put in the information from the English source file whenever the 2 macros have the same ID.
I already wrote some code in which I replaced the string tag name with a unique tag name in order to identify the cross-language reference. Now I want to compare the 2 files and if 2 macros have the same ID, exchange the empty reference in the French file with the info from the English file. I was trying out the minidom parser before but got stuck and would like to try Etree now. I have hardly any knowledge about programming and find this very hard.
Here is the code I have so far:
macros = ElementTree.parse(english)
for tag in macros.getchildren('macro'):
id_ = tag.find('id')
data = tag.find('cl')
id_dict[id_.text] = data.text
macros = ElementTree.parse(french)
for tag in macros.getchildren('macro'):
id_ = tag.find('id')
target = tag.find('cl')
if target.text.strip() == '':
target.text = id_dict[id_.text]
print (ElementTree.tostring(macros))
I am more than clueless and reading other posts on this confuses me even more. I'd appreciate it very much if someone could enlighten me :-)
There is probably more details to be clarified. Here is the sample with some debug prints that shows the idea. It assumes that both files have exactly the same structure, and that you want to go only one level below the root:
import xml.etree.ElementTree as etree
english_tree = etree.parse('en.xml')
french_tree = etree.parse('fr.xml')
# Get the root elements, as they support iteration
# through their children (direct descendants)
english_root = english_tree.getroot()
french_root = french_tree.getroot()
# Iterate through the direct descendants of the root
# elements in both trees in parallel.
for en, fr in zip(english_root, french_root):
assert en.tag == fr.tag # check for the same structure
if en.tag == 'id':
assert en.text == fr.text # check for the same id
elif en.tag == 'string':
if fr.text is None:
fr.text = en.text
print en.text # displaying what was replaced
etree.dump(french_tree)
For more complex structures of the file, the loop through the direct children of the node can be replaced by iteration through all the elements of the tree. If the structures of the files are exactly the same, the following code will work:
import xml.etree.ElementTree as etree
english_tree = etree.parse('en.xml')
french_tree = etree.parse('fr.xml')
for en, fr in zip(english_tree.iter(), french_tree.iter()):
assert en.tag == fr.tag # check if the structure is the same
if en.tag == 'id':
assert en.text == fr.text # identification must be the same
elif en.tag == 'string':
if fr.text is None:
fr.text = en.text
print en.text # display the inserted text
# Write the result to the output file.
with open('fr2.xml', 'w') as fout:
fout.write(etree.tostring(french_tree.getroot()))
However, it works only in cases when both files have exactly the same structure. Let's follow the algorithm that would be used when the task is to be done manually. Firstly, we need to find the French translation that is empty. Then it should be replaced by the English translation from the GH element with the same identification. A subset of XPath expressions is used in the case when searching for the elements:
import xml.etree.ElementTree as etree
def find_translation(tree, id_):
# Search fot the GH element with the given identification, and return
# its translation if found. Otherwise None is returned implicitly.
for gh in tree.iter('GH'):
id_elem = gh.find('./id')
if id_ == id_elem.text:
# The related GH element found.
# Find metadata entry, extract the translation.
# Warning! This is simplification for the fixed position
# of the Translation entry.
me = gh.find('./metadata/entry')
assert len(me) == 2 # metadata/entry has two elements
cl1 = me[0]
assert cl1.text == 'Translation'
cl2 = me[1]
return cl2.text
# Body of the program. --------------------------------------------------
english_tree = etree.parse('en.xml')
french_tree = etree.parse('fr.xml')
for gh in french_tree.iter('GH'): # iterate through the GH elements only
# Get the identification of the GH section
id_elem = gh.find('./id')
id_ = id_elem.text
# Find and check the metadata entry, extract the French translation.
# Warning! This is simplification for the fixed position of the Translation
# entry.
me = gh.find('./metadata/entry')
assert len(me) == 2 # metadata/entry has two elements
cl1 = me[0]
assert cl1.text == 'Translation'
cl2 = me[1]
fr_translation = cl2.text
# If the French translation is empty, put there the English translation
# from the related element.
if cl2.text is None:
cl2.text = find_translation(english_tree, id_)
with open('fr2.xml', 'w') as fout:
fout.write(etree.tostring(french_tree.getroot()).decode('utf-8'))
Related
I am trying to parse NPORT-P XML SEC submission. My code (Python 3.6.8) with a sample XML record:
import xml.etree.ElementTree as ET
content_xml = '<?xml version="1.0" encoding="UTF-8"?><edgarSubmission xmlns="http://www.sec.gov/edgar/nport" xmlns:com="http://www.sec.gov/edgar/common" xmlns:ncom="http://www.sec.gov/edgar/nportcommon" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"><headerData></headerData><formData><genInfo></genInfo><fundInfo></fundInfo><invstOrSecs><invstOrSec><name>N/A</name><lei>N/A</lei><title>US 10YR NOTE (CBT)Sep20</title><cusip>N/A</cusip> <identifiers> <ticker value="TYU0"/> </identifiers> <derivativeInfo> <futrDeriv derivCat="FUT"> <counterparties> <counterpartyName>Chicago Board of Trade</counterpartyName> <counterpartyLei>549300EX04Q2QBFQTQ27</counterpartyLei> </counterparties><payOffProf>Short</payOffProf> <descRefInstrmnt> <otherRefInst> <issuerName>U.S. Treasury 10 Year Notes</issuerName> <issueTitle>U.S. Treasury 10 Year Notes</issueTitle> <identifiers> <cusip value="N/A"/><other otherDesc="USER DEFINED" value="TY_Comdty"/> </identifiers> </otherRefInst> </descRefInstrmnt> <expDate>2020-09-21</expDate> <notionalAmt>-2770555</notionalAmt> <curCd>USD</curCd> <unrealizedAppr>-12882.5</unrealizedAppr></futrDeriv> </derivativeInfo> </invstOrSec> </invstOrSecs> <signature> </signature> </formData></edgarSubmission>'
content_tree = ET.ElementTree(ET.fromstring(bytes(content_xml, encoding='utf-8')))
content_root = content_tree.getroot()
for edgar_submission in content_root.iter('{http://www.sec.gov/edgar/nport}edgarSubmission'):
for form_data in edgar_submission.iter('{http://www.sec.gov/edgar/nport}formData'):
for genInfo in form_data.iter('{http://www.sec.gov/edgar/nport}genInfo'):
None
for fundInfo in form_data.iter('{http://www.sec.gov/edgar/nport}fundInfo'):
None
for invstOrSecs in form_data.iter('{http://www.sec.gov/edgar/nport}invstOrSecs'):
for invstOrSec in invstOrSecs.iter('{http://www.sec.gov/edgar/nport}invstOrSec'):
myrow = []
myrow.append(getattr(invstOrSec.find('{http://www.sec.gov/edgar/nport}name'), 'text', ''))
myrow.append(getattr(invstOrSec.find('{http://www.sec.gov/edgar/nport}lei'), 'text', ''))
security_title = getattr(invstOrSec.find('{http://www.sec.gov/edgar/nport}title'), 'text', '')
myrow.append(security_title)
myrow.append(getattr(invstOrSec.find('{http://www.sec.gov/edgar/nport}cusip'), 'text', ''))
for identifiers in invstOrSec.iter('{http://www.sec.gov/edgar/nport}identifiers'):
if identifiers.find('{http://www.sec.gov/edgar/nport}isin') is not None:
myrow.append(identifiers.find('{http://www.sec.gov/edgar/nport}isin').attrib['value'])
else:
myrow.append('')
if security_title == "US 10YR NOTE (CBT)Sep20":
print("No ISIN")
if identifiers.find('{http://www.sec.gov/edgar/nport}ticker') is not None:
myrow.append(identifiers.find('{http://www.sec.gov/edgar/nport}ticker').attrib['value'])
else:
myrow.append('')
if security_title == "US 10YR NOTE (CBT)Sep20":
print("No Ticker")
if identifiers.find('{http://www.sec.gov/edgar/nport}other') is not None:
myrow.append(identifiers.find('{http://www.sec.gov/edgar/nport}other').attrib['value'])
else:
myrow.append('')
if security_title == "US 10YR NOTE (CBT)Sep20":
print("No Other")
The output from this code is:
No ISIN
No Other
No ISIN
No Ticker
This working fine aside from the fact that the identifiers iter invstOrSec.iter('{http://www.sec.gov/edgar/nport}identifiers') finds identifiers under formData>invstOrSecs>invstOrSec but also other identifiers under a nested tag under formData>invstOrSecs>invstOrSec>derivativeInfo>futrDeriv>descRefInstrmnt>otherRefInst. How can I restrict my iter or the find to the right level? I have unsuccessfully tried to get the parent but I am not finding how to do this using the {namespace}tag notation. Any ideas?
So I switched from ElementTree to lxml using an import like this to avoid code changes:
from lxml import etree as ET
Make sure you check https://lxml.de/1.3/compatibility.html to avoid compatibility issues. In my case lxml worked without issues.
And I then I was able to use the getparent() method to be able to only read the identifiers from the right part of the XML file:
if identifiers.getparent().tag == '{http://www.sec.gov/edgar/nport}invstOrSec':
I want to copy some data from .xml file based on some search value .
In below xml file I want to search 0xCCB7B836 ( 0xCCB7B836 )and copy data inside that
4e564d2d52656648
6173685374617274
1782af065966579e
899885d440d3ad67
d04b41b15e2b13c2
one more example :
search value 0xECFBBA1A and return 0000
or
search value 0xA54E2B5A and return 30d4
<MEM_DATA>
<MEM_SECTOR>
<MEM_SECTOR_NUMBER>0</MEM_SECTOR_NUMBER>
<MEM_SECTOR_STATUS>ACTIVE</MEM_SECTOR_STATUS>
<MEM_SECTOR_STARTADR>0x800000</MEM_SECTOR_STARTADR>
<MEM_SECTOR_ENDADR>0x0</MEM_SECTOR_ENDADR>
<MEM_SECTOR_COUNTER>0x1</MEM_SECTOR_COUNTER>
<MEM_ERASED_MARKER>SET</MEM_ERASED_MARKER>
<MEM_USED_MARKER>SET</MEM_USED_MARKER>
<MEM_FULL_MARKER>NOT_SET</MEM_FULL_MARKER>
<MEM_ERASE_MARKER>NOT_SET</MEM_ERASE_MARKER>
<MEM_START_MARKER>SET</MEM_START_MARKER>
<MEM_START_OFFSET>0x1</MEM_START_OFFSET>
<MEM_CLONE_MARKER>NOT_SET</MEM_CLONE_MARKER>
<MEM_BLOCK>
<MEM_BLOCK_ID>0x101</MEM_BLOCK_ID>
<MEM_BLOCK_NAME>UNKNOWN</MEM_BLOCK_NAME>
<MEM_BLOCK_STATUS>VALID</MEM_BLOCK_STATUS>
<MEM_BLOCK_FLAGS>0x0</MEM_BLOCK_FLAGS>
<MEM_BLOCK_STORAGE>Emulation</MEM_BLOCK_STORAGE>
<MEM_BLOCK_LEN>0x28</MEM_BLOCK_LEN>
<MEM_BLOCK_VERSION>0x0</MEM_BLOCK_VERSION>
<MEM_BLOCK_HEADER_CRC>0xE527</MEM_BLOCK_HEADER_CRC>
<MEM_BLOCK_CRC>0xCCB7B836</MEM_BLOCK_CRC>
<MEM_BLOCK_CRC2>None</MEM_BLOCK_CRC2>
<MEM_BLOCK_DATA>
<MEM_PAGE_DATA>4e564d2d52656648</MEM_PAGE_DATA>
<MEM_PAGE_DATA>6173685374617274</MEM_PAGE_DATA>
<MEM_PAGE_DATA>1782af065966579e</MEM_PAGE_DATA>
<MEM_PAGE_DATA>899885d440d3ad67</MEM_PAGE_DATA>
<MEM_PAGE_DATA>d04b41b15e2b13c2</MEM_PAGE_DATA>
</MEM_BLOCK_DATA>
</MEM_BLOCK>
<MEM_BLOCK>
<MEM_BLOCK_ID>0x20F</MEM_BLOCK_ID>
<MEM_BLOCK_NAME>UNKNOWN</MEM_BLOCK_NAME>
<MEM_BLOCK_STATUS>VALID</MEM_BLOCK_STATUS>
<MEM_BLOCK_FLAGS>0x0</MEM_BLOCK_FLAGS>
<MEM_BLOCK_STORAGE>Emulation</MEM_BLOCK_STORAGE>
<MEM_BLOCK_LEN>0x2</MEM_BLOCK_LEN>
<MEM_BLOCK_VERSION>0x0</MEM_BLOCK_VERSION>
<MEM_BLOCK_HEADER_CRC>0xE0D2</MEM_BLOCK_HEADER_CRC>
<MEM_BLOCK_CRC>0xECFBBA1A</MEM_BLOCK_CRC>
<MEM_BLOCK_CRC2>None</MEM_BLOCK_CRC2>
<MEM_BLOCK_DATA>
<MEM_PAGE_DATA>0000</MEM_PAGE_DATA>
</MEM_BLOCK_DATA>
</MEM_BLOCK>
<MEM_BLOCK>
<MEM_BLOCK_ID>0x1F8</MEM_BLOCK_ID>
<MEM_BLOCK_NAME>UNKNOWN</MEM_BLOCK_NAME>
<MEM_BLOCK_STATUS>VALID</MEM_BLOCK_STATUS>
<MEM_BLOCK_FLAGS>0x0</MEM_BLOCK_FLAGS>
<MEM_BLOCK_STORAGE>Emulation</MEM_BLOCK_STORAGE>
<MEM_BLOCK_LEN>0x2</MEM_BLOCK_LEN>
<MEM_BLOCK_VERSION>0x0</MEM_BLOCK_VERSION>
<MEM_BLOCK_HEADER_CRC>0x1DCC</MEM_BLOCK_HEADER_CRC>
<MEM_BLOCK_CRC>0xA54E2B5A</MEM_BLOCK_CRC>
<MEM_BLOCK_CRC2>None</MEM_BLOCK_CRC2>
<MEM_BLOCK_DATA>
<MEM_PAGE_DATA>30d4</MEM_PAGE_DATA>
</MEM_BLOCK_DATA>
</MEM_BLOCK>
</MEM_SECTOR>
</MEM_DATA>
Assuming that we have this xml data inside a file named test.xml, you can do something like that:
import xml.etree.ElementTree as ET
tree = ET.parse('test.xml')
root = tree.getroot()
def search_and_copy(query):
for child in root.findall("MEM_SECTOR/MEM_BLOCK"):
if child.find("MEM_BLOCK_CRC").text == query:
return [item.text for item in child.findall("MEM_BLOCK_DATA/*")]
Let's try this search_and_copy() function out:
>>> search_and_copy("0xCCB7B836")
['4e564d2d52656648', '6173685374617274', '1782af065966579e', '899885d440d3ad67', 'd04b41b15e2b13c2']
>>> search_and_copy("0xA54E2B5A")
['30d4']
We can use xpath, with python's xml etree and elementpath to write a function to retrieve the data :
Breakdown of the code below (within the elementpath.Selector):
1. the first line looks for elements that have our search string
2. The second line .. goes back one step to get the parent element
3. Proceeding from the parent element, this line searches for MEM_PAGE_DATA within the parent element. This element holds the data we are actually interested in.
4. The rest of the code simply pulls the text from the matches
import xml.etree.ElementTree as ET
import elementpath
#wrapped the shared data into a test.xml file
root = ET.parse('test.xml').getroot()
def find_data(search_string):
selector = elementpath.Selector(f""".//*[text()='{search_string}']
//..
//MEM_PAGE_DATA""")
#pull text from the match
result = [entry.text for entry in selector.select(root)]
return result
Test on the strings provided :
find_data("0xCCB7B836")
['4e564d2d52656648',
'6173685374617274',
'1782af065966579e',
'899885d440d3ad67',
'd04b41b15e2b13c2']
find_data("0xECFBBA1A")
['0000']
find_data("0xA54E2B5A")
['30d4']
Hi I am very new to python programming. I have an xml file of structure:
<?xml version="1.0" encoding="UTF-8"?>
-<LidcReadMessage xsi:schemaLocation="http://www.nih.gov http://troll.rad.med.umich.edu/lidc/LidcReadMessage.xsd"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns="http://www.nih.gov" uid="1.3.6.1.4.1.14519.5.2.1.6279.6001.1307390687803.0">
-<ResponseHeader>
<Version>1.8.1</Version>
<MessageId>-421198203</MessageId>
<DateRequest>2007-11-01</DateRequest>
<TimeRequest>12:30:44</TimeRequest>
<RequestingSite>removed</RequestingSite>
<ServicingSite>removed</ServicingSite>
<TaskDescription>Second unblinded read</TaskDescription>
<CtImageFile>removed</CtImageFile>
<SeriesInstanceUid>1.3.6.1.4.1.14519.5.2.1.6279.6001.179049373636438705059720603192</SeriesInstanceUid>
<DateService>2008-08-18</DateService>
<TimeService>02:05:51</TimeService>
<ResponseDescription>1 - Reading complete</ResponseDescription>
<StudyInstanceUID>1.3.6.1.4.1.14519.5.2.1.6279.6001.298806137288633453246975630178</StudyInstanceUID>
</ResponseHeader>
-<readingSession>
<annotationVersion>3.12</annotationVersion>
<servicingRadiologistID>540461523</servicingRadiologistID>
-<unblindedReadNodule>
<noduleID>Nodule 001</noduleID>
-<characteristics>
<subtlety>5</subtlety>
<internalStructure>1</internalStructure>
<calcification>6</calcification>
<sphericity>3</sphericity>
<margin>3</margin>
<lobulation>3</lobulation>
<spiculation>4</spiculation>
<texture>5</texture>
<malignancy>5</malignancy>
</characteristics>
-<roi>
<imageZposition>-125.000000 </imageZposition>
<imageSOP_UID>1.3.6.1.4.1.14519.5.2.1.6279.6001.110383487652933113465768208719</imageSOP_UID>
......
There are four which contains multiple . Each contains an . I need to extract the information in from all of these headers.
Right now I am doing this:
import xml.etree.ElementTree as ET
tree = ET.parse('069.xml')
root = tree.getroot()
#lst = []
for readingsession in root.iter('readingSession'):
for roi in readingsession.findall('roi'):
id = roi.findtext('imageSOP_UID')
print(id)
but it ouputs like this:
Process finished with exit code 0.
If anyone can help.
The real problem as been wit the namespace. I tried with and without it, but it didn't work with this code.
ds = pydicom.dcmread("000071.dcm")
uid = ds.SOPInstanceUID
tree = ET.parse("069.xml")
root = tree.getroot()
for child in root:
print(child.tag)
if child.tag == '{http://www.nih.gov}readingSession':
read = child.find('{http://www.nih.gov}unblindedReadNodule')
if read != None:
nodule_id = read.find('{http://www.nih.gov}noduleID').text
xml_uid = read.find('{http://www.nih.gov}roi').find('{http://www.nih.gov}imageSOP_UID').text
if xml_uid == uid:
print(xml_uid, "=", uid)
roi= read.find('{http://www.nih.gov}roi')
print(roi)
This work completely fine to get a uid from dicom image of LIDC/IDRI dataset and then extract the same uid from xml file for it region of interest.
I took an OpenRefine template for translating a csv to a giant MODS XML record, then a .py script for cleaning it and turning it into several smaller xml files, named using one of the tags. It works perfectly. However, when I tried altering it to fit my needs for Dublin Core xml records... not so much.
I've got an OpenRefine template that gives me this from my csv:
<collection xmlns:xsi="http:www.w3.org/2001/XMLSchema-instance">
<record xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:dcterms="http://purl.org/dc/terms/" xmlns:dctype="http://purl.org/dc/dcmitype/" xmlns:xsi="http://www.w3.org/2001XMLSchema-instance" xsi:schemaLocation="http://purl.org/dc/elements/1.1/ http://dublincore.org/schemas/xmls/qdc/2008/02/11/dc.xsd http://purl.org/dc/terms/ http://dublincore.org/schemas/xmls/qdc/2008/02/11/dcterms.xsd http://purl.org/dc/dcmitype/ http://dublincore.org/schemas/xmls/qdc/2008/02/11/dcmitype.xsd http://dublincore.org/schemas/xmls/qdc/2008/02/11/qualifieddc.xsd">
<dc:title>[Mary Adams at the organ]</dc:title>
<dc:creator>MacAfee, Don</dc:creator>
<dc:date>4/14/1964</dc:date>
<dc:subject>organs</dc:subject><dc:subject>musical instruments</dc:subject><dc:subject>musicians</dc:subject><dc:subject>Adams, Mary</dc:subject>
<dc:description>Music instructor Mary C. Adams playing the organ.</dc:description>
<dc:format>1 print : b&w ; 6.5 x 6.5 in.</dc:format>
<dcterms:spatial>Alexandria, Virginia</dcterms:spatial>
<dc:type>Photograph</dc:type>
<dc:format>Image</dc:format>
<dc:identifier>MS332-01-01-001</dc:identifier>
<dc:rights>Copyright has not been assigned to the Alexandria Library. All requests for permission to publish or quote from manuscripts must be submitted in writing to the Alexandria Library. Permission for publication is given on behalf of the Alexandria Library as the owner of the physical items and is not intended to include or imply permission of the copyright holder, which must also be obtained by the researcher.</dc:rights>
</record>
<record xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:dcterms="http://purl.org/dc/terms/" xmlns:dctype="http://purl.org/dc/dcmitype/" xmlns:xsi="http://www.w3.org/2001XMLSchema-instance" xsi:schemaLocation="http://purl.org/dc/elements/1.1/ http://dublincore.org/schemas/xmls/qdc/2008/02/11/dc.xsd http://purl.org/dc/terms/ http://dublincore.org/schemas/xmls/qdc/2008/02/11/dcterms.xsd http://purl.org/dc/dcmitype/ http://dublincore.org/schemas/xmls/qdc/2008/02/11/dcmitype.xsd http://dublincore.org/schemas/xmls/qdc/2008/02/11/qualifieddc.xsd">
<dc:title>[Portrait of Dr. Robert Adeson]</dc:title>
<dc:date>1980</dc:date>
<dc:subject>physicians</dc:subject><dc:subject>doctors</dc:subject><dc:subject>Adeson, Robert, M.D.</dc:subject>
<dc:description>Dr. Robert L. Adeson, Alexandria Hospital.</dc:description>
<dc:format>1 print : b&w ; 5 x 7 in.</dc:format>
<dcterms:spatial>Alexandria, Virginia</dcterms:spatial>
<dc:type>Photograph</dc:type>
<dc:format>Image</dc:format>
<dc:identifier>MS332-01-01-002</dc:identifier>
<dc:rights>Copyright has not been assigned to the Alexandria Library. All requests for permission to publish or quote from manuscripts must be submitted in writing to the Alexandria Library. Permission for publication is given on behalf of the Alexandria Library as the owner of the physical items and is not intended to include or imply permission of the copyright holder, which must also be obtained by the researcher.</dc:rights>
</record>
</collection>
I've got a Python program that cleans and separates a MODS record, that I've modified that looks like this:
import os, lxml.etree as ET
output_path = 'C:\\Users\\Staff\\Desktop\\Metadata\\SplitXML\\'
# parse source.xml with lxml
tree = ET.parse('source.xml')
# start cleanup
# remove any element tails
for element in tree.iter():
element.tail = None
# remove any line breaks or tabs in element text
if element.text:
if '\n' in element.text:
element.text = element.text.replace('\n', '')
if '\t' in element.text:
element.text = element.text.replace('\t', '')
# remove any remaining whitespace
parser = ET.XMLParser(remove_blank_text=True, remove_comments=True, recover=True)
treestring = ET.tostring(tree)
clean = ET.XML(treestring, parser)
# remove recursively empty nodes
def recursively_empty(e):
if e.text:
return False
return all((recursively_empty(c) for c in e.iterchildren()))
context = ET.iterwalk(clean)
for action, elem in context:
parent = elem.getparent()
if recursively_empty(elem):
parent.remove(elem)
# remove nodes with blank attribute
for element in clean.xpath(".//*[#*='']"):
element.getparent().remove(element)
# remove nodes with attribute "null"
for element in clean.xpath(".//*[#*='null']"):
element.getparent().remove(element)
# finished cleanup
# write out to intermediate file
with open('clean.xml', 'wb') as f:
f.write(ET.tostring(clean))
print("XML is now clean")
# parse the clean xml
cleanxml = ET.iterparse('clean.xml', events=('end', ))
# find the <dc> nodes
for event, elem in cleanxml:
if elem.tag == '{http://purl.org/dc/elements/1.1/}record':
# name new files using the <dc:identifier> tag
identifier = elem.find('{http://purl.org/dc/elements/1.1/}dc:identifier').text
filename = format(identifier + "_DC.xml")
# write out to new file
with open(output_path+filename, 'wb') as f:
f.write("<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n")
f.write(ET.tostring(elem, pretty_print = True))
print("Writing", filename)
# remove the intermediate file
os.remove('clean.xml')
print("All done!")
The cmd prints the "XML is now clean" and "All done!" statements, HOWEVER, there are no files in the SplitXML directory (or anywhere). My attempt at de-bugging was to comment out the os.remove('clean.xml') line so I could look at the cleaned xml. I've done this with the MODS .py script, and the xml file looks like what you'd expect. However, the clean.xml file on the DC one is clean, but just one long string of code, rather than using different lines and tabs, like this:
<collection xmlns:xsi="http:www.w3.org/2001/XMLSchema-instance"><record xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:dcterms="http://purl.org/dc/terms/" xmlns:dctype="http://purl.org/dc/dcmitype/" xmlns:xsi="http://www.w3.org/2001XMLSchema-instance" xsi:schemaLocation="http://purl.org/dc/elements/1.1/ http://dublincore.org/schemas/xmls/qdc/2008/02/11/dc.xsd http://purl.org/dc/terms/ http://dublincore.org/schemas/xmls/qdc/2008/02/11/dcterms.xsd http://purl.org/dc/dcmitype/ http://dublincore.org/schemas/xmls/qdc/2008/02/11/dcmitype.xsd http://dublincore.org/schemas/xmls/qdc/2008/02/11/qualifieddc.xsd"><dc:title>[Mary Adams at the organ]</dc:title><dc:creator>MacAfee, Don</dc:creator><dc:date>4/14/1964</dc:date><dc:subject>organs</dc:subject><dc:subject>musical instruments</dc:subject><dc:subject>musicians</dc:subject><dc:subject>Adams, Mary</dc:subject><dc:description>Music instructor Mary C. Adams playing the organ.</dc:description><dc:format>1 print : b&w ; 6.5 x 6.5 in.</dc:format><dcterms:spatial>Alexandria, Virginia</dcterms:spatial><dc:type>Photograph</dc:type><dc:format>Image</dc:format><dc:identifier>MS332-01-01-001</dc:identifier><dc:rights>Copyright has not been assigned to the Alexandria Library. All requests for permission to publish or quote from manuscripts must be submitted in writing to the Alexandria Library. Permission for publication is given on behalf of the Alexandria Library as the owner of the physical items and is not intended to include or imply permission of the copyright holder, which must also be obtained by the researcher.</dc:rights></record><record xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:dcterms="http://purl.org/dc/terms/" xmlns:dctype="http://purl.org/dc/dcmitype/" xmlns:xsi="http://www.w3.org/2001XMLSchema-instance" xsi:schemaLocation="http://purl.org/dc/elements/1.1/ http://dublincore.org/schemas/xmls/qdc/2008/02/11/dc.xsd http://purl.org/dc/terms/ http://dublincore.org/schemas/xmls/qdc/2008/02/11/dcterms.xsd http://purl.org/dc/dcmitype/ http://dublincore.org/schemas/xmls/qdc/2008/02/11/dcmitype.xsd http://dublincore.org/schemas/xmls/qdc/2008/02/11/qualifieddc.xsd"><dc:title>[Portrait of Dr. Robert Adeson]</dc:title><dc:date>1980</dc:date><dc:subject>physicians</dc:subject><dc:subject>doctors</dc:subject><dc:subject>Adeson, Robert, M.D.</dc:subject><dc:description>Dr. Robert L. Adeson, Alexandria Hospital.</dc:description><dc:format>1 print : b&w ; 5 x 7 in.</dc:format><dcterms:spatial>Alexandria, Virginia</dcterms:spatial><dc:type>Photograph</dc:type><dc:format>Image</dc:format><dc:identifier>MS332-01-01-002</dc:identifier><dc:rights>Copyright has not been assigned to the Alexandria Library. All requests for permission to publish or quote from manuscripts must be submitted in writing to the Alexandria Library. Permission for publication is given on behalf of the Alexandria Library as the owner of the physical items and is not intended to include or imply permission of the copyright holder, which must also be obtained by the researcher.</dc:rights></record></collection>
If it helps, here's the original Python code for cleaning and splitting MODS. I got it from calhist on github.
# Split XML containing many <mods> elements into invidual files
# Modified from script found here: http://stackoverflow.com/questions/36155049/splitting-xml-file-into-multiple-at-given-tags
# by Bill Levay for California Historical Society
import os, lxml.etree as ET
# uncomment below modules if doing MODS cleanup on existing Islandora objects
import codecs, json
output_path = 'C:\\Users\\Staff\\Desktop\\Metadata\\SplitXML\\'
# parse source.xml with lxml
tree = ET.parse('source.xml')
# start cleanup
# remove any element tails
for element in tree.iter():
element.tail = None
# remove any line breaks or tabs in element text
if element.text:
if '\n' in element.text:
element.text = element.text.replace('\n', '')
if '\t' in element.text:
element.text = element.text.replace('\t', '')
# remove any remaining whitespace
parser = ET.XMLParser(remove_blank_text=True, remove_comments=True, recover=True)
treestring = ET.tostring(tree)
clean = ET.XML(treestring, parser)
# remove recursively empty nodes
# found here: https://stackoverflow.com/questions/12694091/python-lxml-how-to-remove-empty-repeated-tags
def recursively_empty(e):
if e.text:
return False
return all((recursively_empty(c) for c in e.iterchildren()))
context = ET.iterwalk(clean)
for action, elem in context:
parent = elem.getparent()
if recursively_empty(elem):
parent.remove(elem)
# remove nodes with blank attribute
# for element in clean.xpath(".//*[#*='']"):
# element.getparent().remove(element)
# remove nodes with attribute "null"
for element in clean.xpath(".//*[#*='null']"):
element.getparent().remove(element)
# finished cleanup
# write out to intermediate file
with open('clean.xml', 'wb') as f:
f.write(ET.tostring(clean))
print("XML is now clean")
# parse the clean xml
cleanxml = ET.iterparse('clean.xml', events=('end', ))
###
# uncomment this section if doing MODS cleanup on existing Islandora objects
# getting islandora IDs for existing collections
###
# item_list = []
# json_path = 'C:\\mods\\data.json'
# with codecs.open(json_path, encoding='utf-8') as filename:
# item_list = json.load(filename)
# filename.close
###
# find the <mods> nodes
for event, elem in cleanxml:
if elem.tag == '{http://www.loc.gov/mods/v3}mods':
# the filenames of the resulting xml files will be based on the <identifier> element
# edit the specific element or attribute if necessary
identifier = elem.find('{http://www.loc.gov/mods/v3}identifier[#type="local"]').text
filename = format(identifier + "_MODS.xml")
###
# uncomment this section if doing MODS cleanup on existing Islandora objects
# look through the list of object metadata and get the islandora ID by matching the digital object ID
###
# for item in item_list:
# local_ID = item["identifier-type:local"]
# islandora_ID = item["PID"]
# if identifier == local_ID:
# filename = format(islandora_ID + "_MODS.xml")
###
# write out to new file
with open(output_path+filename, 'wb') as f:
f.write("<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n")
f.write(ET.tostring(elem, pretty_print = True))
print("Writing", filename)
# remove the intermediate file
os.remove('clean.xml')
print("All done!")
I found two namespace-related problems:
The record element is in no namespace. Therefore, you need to change
if elem.tag == '{http://purl.org/dc/elements/1.1/}record':
to
if elem.tag == 'record':
elem.find('{http://purl.org/dc/elements/1.1/}dc:identifier') is not correct. The dc: bit must be removed.
So, I'm parsing this xml file of moderate size (about 27K lines). Not far into it, I'm seeing unexpected behavior from ElementTree.Element where I get Element.text for one entry but not the next, yet it's there in the source XML as you can see:
<!-- language: lang-xml -->
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">
<xs:enumeration value="24">
<xs:annotation>
<xs:documentation>UPC12 (item-specific) on cover 2</xs:documentation>
<xs:documentation>AKA item/price; ‘cover 2’ is defined as the inside front cover of a book</xs:documentation>
</xs:annotation>
</xs:enumeration>
<xs:enumeration value="25">
<xs:annotation>
<xs:documentation>UPC12+5 (item-specific) on cover 2</xs:documentation>
<xs:documentation>AKA item/price; ‘cover 2’ is defined as the inside front cover of a book</xs:documentation>
</xs:annotation>
</xs:enumeration>
When I encounter an enumeration tag I call this function:
import xml.etree.cElementTree as ElementTree
...
def _parse_list_item(xmlns: str, list_id: int, itemElement: ElementTree.Element) -> ListItem:
if isinstance(itemElement, ElementTree.Element):
if itemElement.attrib['value'] is not None:
item_id = itemElement.attrib['value'] # string
if list_id == 6 and (item_id == '25' or item_id=='24'):
print(list_id, item_id) # <== debug break point here
desc = None
notes = ""
for child in itemElement:
if child.tag == (xmlns + 'annotation'):
for grandchild in child:
if grandchild.tag == (xmlns + 'documentation'):
if desc is None:
desc = grandchild.text
else:
if len(notes)>0:
notes += " " # add a space
notes += grandchild.text or ""
if item_id is not None and desc is not None:
return Codex.ListItem({'itemId': item_id, 'listId': list_id, 'description': desc, 'notes': notes})
If I place a breakpoint at the print statement, when I get to the enumeration node for "24" I can look at the text for the grandchild nodes and they are as shown in the XML, i.e. "UPC12..." or "AKA item...", but when I get to the enumeration node for "25", and look at the grandchild text, it's None.
When I remove the xs: namespace by pre-filtering the XML file, the grandchild text comes through fine.
Is it possible I'm over some size limit or is there some syntax problem?
Sorry for less-than-pythonic code but I wanted to be able to examine all the intermediate values in pycharm. It's python 3.6.
Thanks for any insights you may have!
In the for loop, this condition is never met: if child.tag == (xmlns + 'annotation'):.
Why?
Try to output the child's tag. If we suppose your namespace (xmlns) is 'Steve' then:
print(child.tag) will output: {Steve}annotation, not Steveannotation.
So given this fact, if child.tag == (xmlns + 'annotation'): is always False.
You should change it to: if child.tag == ('{'+xmlns+'}annotation'):
With the same logic, you will find out you will also have to change this condition:
if grandchild.tag == (xmlns + 'documentation'):
to:
if grandchild.tag == ('{'+xmlns+'}documentation'):
So, ultimately, I solved my problem by running a pre-process on the XML file to remove the xs: namespace from all of the open/close XML tags and then I was able to successfully process the file using the function as defined above. Not sure why namespaces are causing problems, but perhaps there is a bug in cElementTree for namespace prefixes in large XML files. To #mzjn - I expect that it would be difficult to construct a minimal example as it does process hundreds of items correctly before it fails, so I would at least have to provide a fairly large XML file. Nevertheless, thanks for being a sounding board.