Having challenge with xml file - python

want to print this xml file such that I can be able to loop through it. my aim is to combine it with a csv file having the same column name, before creating a database with this combined file. I'm not allow to use non standard Libraries.
code------
xml_file = ET.parse("E:/Research work/My connect/Sam/CETM50 - 2022_3 - Assignment Data/user_data.xml")
get the parent tag
root = xml_file.getroot()
print the attributes of the first tag
e = ET.tostring(xml_file.getroot(), encoding='unicode', method='xml')
print(e)
output
<user firstName="Jayne" lastName="Wilson" age="69" sex="Female" retired="False" dependants="1" marital_status="divorced" salary="36872" pension="0" company="Wall, Reed and Whitehouse" commute_distance="10.47" address_postcode="TD95 7FL"

Related

Parse XML file and output JSON with Python

I am quite new to Python. I'm currently trying to parse xml files getting their information and printing them as JSON.
I have managed to parse the xml file, but I cannot print them as JSON. In addition, in my printjson function, the function did not run through all results and only print one time. The parse function worked and run through all input files while printjson didn't.
My code is as follow.
from xml.dom import minidom
import os
import json
#input multiple files
def get_files(d):
return [os.path.join(d, f) for f in os.listdir(d) if os.path.isfile(os.path.join(d,f))]
#parse xml
def parse(files):
for xml_file in files:
#indentify all xml files
tree = minidom.parse(xml_file)
#Get some details
NCT_ID = ("NCT ID : %s" % tree.getElementsByTagName("nct_id")[0].firstChild.data)
brief_title = ("brief title : %s" % tree.getElementsByTagName("brief_title")[0].firstChild.data)
official_title = ("official title : %s" % tree.getElementsByTagName("official_title")[0].firstChild.data)
return NCT_ID,brief_title,official_title
#print result in json
def printjson(results):
for result in results:
output_json = json.dumps(result)
print(output_json)
printjson(parse(get_files('my files path')))
Output when running the file
"NCT ID : NCT00571389"
"brief title : Isolation and Culture of Immune Cells and Circulating Tumor Cells From Peripheral Blood and Leukapheresis Products"
"official title : A Study to Facilitate Development of an Ex-Vivo Device Platform for Circulating Tumor Cell and Immune Cell Harvesting, Banking, and Apoptosis-Viability Assay"
Expected output
{
"NCT ID" : "NCT00571389",
"brief title" : "Isolation and Culture of Immune Cells and Circulating Tumor Cells From Peripheral Blood and Leukapheresis Products",
"official title" : "A Study to Facilitate Development of an Ex-Vivo Device Platform for Circulating Tumor Cell and Immune Cell Harvesting, Banking, and Apoptosis-Viability Assay"
}
The sample indexed xml file that I used is named as COVID-19 Clinical Trials dataset and can be found in kaggle
The issue is that your parse function is returning too early (it's returning after getting the details from the first XML file. Instead, you should return a list of dictionaries that stores this information, so each item in the list represents a different file, and each dictionary contains the necessary information regarding the corresponding XML file.
Here's the updated code:
def parse(files):
xml_information = []
for xml_file in files:
#indentify all xml files
tree = minidom.parse(xml_file)
#Get some details
NCT_ID = ("NCT ID : %s" % tree.getElementsByTagName("nct_id")[0].firstChild.data)
brief_title = ("brief title : %s" % tree.getElementsByTagName("brief_title")[0].firstChild.data)
official_title = ("official title : %s" % tree.getElementsByTagName("official_title")[0].firstChild.data)
xml_information.append({"NCT_ID": NCT_ID, "brief title": brief_title, "official title": official_title})
return xml_information
def printresults(results):
for result in results:
print(result)
printresults(parse(get_files('my files path')))
If you absolutely want to return format to be json, you can similarly use json.dumps on each dictionary.
Note: If you have a lot of XML files, I would recommend using yield in the function instead of returning a whole list of dictionaries in order to improve speed and performance.
I don't know much about xml.dom library but you can generate the json with a dictionary, because the dumps function is only for convert json to string.
Some like this.
def parse(files):
for xml_file in files:
#indentify all xml files
tree = minidom.parse(xml_file)
dicJson = {}
dicJson.setdefault("NCT ID",tree.getElementsByTagName("nct_id")[0].firstChild.data)
dicJson.setdefault("brief title",tree.getElementsByTagName("brief_title")[0].firstChild.data)
dicJson.setdefault("official title", tree.getElementsByTagName("official_title")[0].firstChild.data)
return dicJson
and in the function prinJson:
def printJson(results):
# This function return the dictionary but in string, how to write to a JSON file.
print(json.dumps(results))

"not well-formed (invalid token): " error for trying to parse an XML file

I am having this error. I am trying to access an xmlfile called "people-kb.xml".
I am having the problem on a line known as: xmldoc = minidom.parse(xmlfile) #Accesses file.
xmldoc is "people-kb.xml" which is passed into a method such as:
parseXML('people-kb.xml')
So the problem I was having came from the save file I had created as I was trying to make a multiple trials that would contain information on two people. for now I only have one trial included and not multiple yet as I am starting with creating the file and after I would edit if it already exists.
the code for making the file is:
import xml.etree.cElementTree as ET
def saveXML(xmlfile):
root = ET.Element("Simulation")
ET.SubElement(root, "chaserStartingCoords").text = "1,1"
ET.SubElement(root, "runnerStartingCoords").text = "9,9"
doc = ET.SubElement(root, "trail")
ET.SubElement(doc, "number").text = "1"
doc1 = ET.SubElement(doc, "number", name="number").text = "1" #Trying to make multiple trials
ET.SubElement(doc1, "chaserEndCoords").text = "10,10"
ET.SubElement(doc1, "runnerInitialCoords").text = "10,10"
tree = ET.ElementTree(root)
tree.write(xmlfile)
if __name__ == '__main__':
saveXML('output.xml')
Where it says "number" I am trying to make it the amount of trials it would be. So what I am trying to make it expect is an output like this:
<simulation>
<chaserStartingCoords>1,1<chaserStartingCoords>
<runnerStartingCoords>9,9<runnerStartingoords>
<trial>
<number>1</number>
<move number="1">
<chaserEndcoords>10,10<chaserEndCoords>
<runnerInitialCoords>10,10<runnerInitialCoords>
</move>
</trial>
</simulation>
I've been having a problem trying to get the <move number="1"> part as later I expect to be able to go into the file and iterate through each node called "move" to check positions.
You say "when trying to name a node of the file, it shows a red highlight on "1" "
That suggests you're trying to use "1", or something beginning with "1", as an element or attribute name, which would be invalid.

parsing .xml file using python :search and copy related data

I want to copy some data from .xml file based on some search value .
In below xml file I want to search 0xCCB7B836 ( 0xCCB7B836 )and copy data inside that
4e564d2d52656648
6173685374617274
1782af065966579e
899885d440d3ad67
d04b41b15e2b13c2
one more example :
search value 0xECFBBA1A and return 0000
or
search value 0xA54E2B5A and return 30d4
<MEM_DATA>
<MEM_SECTOR>
<MEM_SECTOR_NUMBER>0</MEM_SECTOR_NUMBER>
<MEM_SECTOR_STATUS>ACTIVE</MEM_SECTOR_STATUS>
<MEM_SECTOR_STARTADR>0x800000</MEM_SECTOR_STARTADR>
<MEM_SECTOR_ENDADR>0x0</MEM_SECTOR_ENDADR>
<MEM_SECTOR_COUNTER>0x1</MEM_SECTOR_COUNTER>
<MEM_ERASED_MARKER>SET</MEM_ERASED_MARKER>
<MEM_USED_MARKER>SET</MEM_USED_MARKER>
<MEM_FULL_MARKER>NOT_SET</MEM_FULL_MARKER>
<MEM_ERASE_MARKER>NOT_SET</MEM_ERASE_MARKER>
<MEM_START_MARKER>SET</MEM_START_MARKER>
<MEM_START_OFFSET>0x1</MEM_START_OFFSET>
<MEM_CLONE_MARKER>NOT_SET</MEM_CLONE_MARKER>
<MEM_BLOCK>
<MEM_BLOCK_ID>0x101</MEM_BLOCK_ID>
<MEM_BLOCK_NAME>UNKNOWN</MEM_BLOCK_NAME>
<MEM_BLOCK_STATUS>VALID</MEM_BLOCK_STATUS>
<MEM_BLOCK_FLAGS>0x0</MEM_BLOCK_FLAGS>
<MEM_BLOCK_STORAGE>Emulation</MEM_BLOCK_STORAGE>
<MEM_BLOCK_LEN>0x28</MEM_BLOCK_LEN>
<MEM_BLOCK_VERSION>0x0</MEM_BLOCK_VERSION>
<MEM_BLOCK_HEADER_CRC>0xE527</MEM_BLOCK_HEADER_CRC>
<MEM_BLOCK_CRC>0xCCB7B836</MEM_BLOCK_CRC>
<MEM_BLOCK_CRC2>None</MEM_BLOCK_CRC2>
<MEM_BLOCK_DATA>
<MEM_PAGE_DATA>4e564d2d52656648</MEM_PAGE_DATA>
<MEM_PAGE_DATA>6173685374617274</MEM_PAGE_DATA>
<MEM_PAGE_DATA>1782af065966579e</MEM_PAGE_DATA>
<MEM_PAGE_DATA>899885d440d3ad67</MEM_PAGE_DATA>
<MEM_PAGE_DATA>d04b41b15e2b13c2</MEM_PAGE_DATA>
</MEM_BLOCK_DATA>
</MEM_BLOCK>
<MEM_BLOCK>
<MEM_BLOCK_ID>0x20F</MEM_BLOCK_ID>
<MEM_BLOCK_NAME>UNKNOWN</MEM_BLOCK_NAME>
<MEM_BLOCK_STATUS>VALID</MEM_BLOCK_STATUS>
<MEM_BLOCK_FLAGS>0x0</MEM_BLOCK_FLAGS>
<MEM_BLOCK_STORAGE>Emulation</MEM_BLOCK_STORAGE>
<MEM_BLOCK_LEN>0x2</MEM_BLOCK_LEN>
<MEM_BLOCK_VERSION>0x0</MEM_BLOCK_VERSION>
<MEM_BLOCK_HEADER_CRC>0xE0D2</MEM_BLOCK_HEADER_CRC>
<MEM_BLOCK_CRC>0xECFBBA1A</MEM_BLOCK_CRC>
<MEM_BLOCK_CRC2>None</MEM_BLOCK_CRC2>
<MEM_BLOCK_DATA>
<MEM_PAGE_DATA>0000</MEM_PAGE_DATA>
</MEM_BLOCK_DATA>
</MEM_BLOCK>
<MEM_BLOCK>
<MEM_BLOCK_ID>0x1F8</MEM_BLOCK_ID>
<MEM_BLOCK_NAME>UNKNOWN</MEM_BLOCK_NAME>
<MEM_BLOCK_STATUS>VALID</MEM_BLOCK_STATUS>
<MEM_BLOCK_FLAGS>0x0</MEM_BLOCK_FLAGS>
<MEM_BLOCK_STORAGE>Emulation</MEM_BLOCK_STORAGE>
<MEM_BLOCK_LEN>0x2</MEM_BLOCK_LEN>
<MEM_BLOCK_VERSION>0x0</MEM_BLOCK_VERSION>
<MEM_BLOCK_HEADER_CRC>0x1DCC</MEM_BLOCK_HEADER_CRC>
<MEM_BLOCK_CRC>0xA54E2B5A</MEM_BLOCK_CRC>
<MEM_BLOCK_CRC2>None</MEM_BLOCK_CRC2>
<MEM_BLOCK_DATA>
<MEM_PAGE_DATA>30d4</MEM_PAGE_DATA>
</MEM_BLOCK_DATA>
</MEM_BLOCK>
</MEM_SECTOR>
</MEM_DATA>
Assuming that we have this xml data inside a file named test.xml, you can do something like that:
import xml.etree.ElementTree as ET
tree = ET.parse('test.xml')
root = tree.getroot()
def search_and_copy(query):
for child in root.findall("MEM_SECTOR/MEM_BLOCK"):
if child.find("MEM_BLOCK_CRC").text == query:
return [item.text for item in child.findall("MEM_BLOCK_DATA/*")]
Let's try this search_and_copy() function out:
>>> search_and_copy("0xCCB7B836")
['4e564d2d52656648', '6173685374617274', '1782af065966579e', '899885d440d3ad67', 'd04b41b15e2b13c2']
>>> search_and_copy("0xA54E2B5A")
['30d4']
We can use xpath, with python's xml etree and elementpath to write a function to retrieve the data :
Breakdown of the code below (within the elementpath.Selector):
1. the first line looks for elements that have our search string
2. The second line .. goes back one step to get the parent element
3. Proceeding from the parent element, this line searches for MEM_PAGE_DATA within the parent element. This element holds the data we are actually interested in.
4. The rest of the code simply pulls the text from the matches
import xml.etree.ElementTree as ET
import elementpath
#wrapped the shared data into a test.xml file
root = ET.parse('test.xml').getroot()
def find_data(search_string):
selector = elementpath.Selector(f""".//*[text()='{search_string}']
//..
//MEM_PAGE_DATA""")
#pull text from the match
result = [entry.text for entry in selector.select(root)]
return result
Test on the strings provided :
find_data("0xCCB7B836")
['4e564d2d52656648',
'6173685374617274',
'1782af065966579e',
'899885d440d3ad67',
'd04b41b15e2b13c2']
find_data("0xECFBBA1A")
['0000']
find_data("0xA54E2B5A")
['30d4']

Using a .py script that cleans then splits a large MODS XML record to do the same for a Dublin Core XML record and I'm getting no output

I took an OpenRefine template for translating a csv to a giant MODS XML record, then a .py script for cleaning it and turning it into several smaller xml files, named using one of the tags. It works perfectly. However, when I tried altering it to fit my needs for Dublin Core xml records... not so much.
I've got an OpenRefine template that gives me this from my csv:
<collection xmlns:xsi="http:www.w3.org/2001/XMLSchema-instance">
<record xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:dcterms="http://purl.org/dc/terms/" xmlns:dctype="http://purl.org/dc/dcmitype/" xmlns:xsi="http://www.w3.org/2001XMLSchema-instance" xsi:schemaLocation="http://purl.org/dc/elements/1.1/ http://dublincore.org/schemas/xmls/qdc/2008/02/11/dc.xsd http://purl.org/dc/terms/ http://dublincore.org/schemas/xmls/qdc/2008/02/11/dcterms.xsd http://purl.org/dc/dcmitype/ http://dublincore.org/schemas/xmls/qdc/2008/02/11/dcmitype.xsd http://dublincore.org/schemas/xmls/qdc/2008/02/11/qualifieddc.xsd">
<dc:title>[Mary Adams at the organ]</dc:title>
<dc:creator>MacAfee, Don</dc:creator>
<dc:date>4/14/1964</dc:date>
<dc:subject>organs</dc:subject><dc:subject>musical instruments</dc:subject><dc:subject>musicians</dc:subject><dc:subject>Adams, Mary</dc:subject>
<dc:description>Music instructor Mary C. Adams playing the organ.</dc:description>
<dc:format>1 print : b&w ; 6.5 x 6.5 in.</dc:format>
<dcterms:spatial>Alexandria, Virginia</dcterms:spatial>
<dc:type>Photograph</dc:type>
<dc:format>Image</dc:format>
<dc:identifier>MS332-01-01-001</dc:identifier>
<dc:rights>Copyright has not been assigned to the Alexandria Library. All requests for permission to publish or quote from manuscripts must be submitted in writing to the Alexandria Library. Permission for publication is given on behalf of the Alexandria Library as the owner of the physical items and is not intended to include or imply permission of the copyright holder, which must also be obtained by the researcher.</dc:rights>
</record>
<record xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:dcterms="http://purl.org/dc/terms/" xmlns:dctype="http://purl.org/dc/dcmitype/" xmlns:xsi="http://www.w3.org/2001XMLSchema-instance" xsi:schemaLocation="http://purl.org/dc/elements/1.1/ http://dublincore.org/schemas/xmls/qdc/2008/02/11/dc.xsd http://purl.org/dc/terms/ http://dublincore.org/schemas/xmls/qdc/2008/02/11/dcterms.xsd http://purl.org/dc/dcmitype/ http://dublincore.org/schemas/xmls/qdc/2008/02/11/dcmitype.xsd http://dublincore.org/schemas/xmls/qdc/2008/02/11/qualifieddc.xsd">
<dc:title>[Portrait of Dr. Robert Adeson]</dc:title>
<dc:date>1980</dc:date>
<dc:subject>physicians</dc:subject><dc:subject>doctors</dc:subject><dc:subject>Adeson, Robert, M.D.</dc:subject>
<dc:description>Dr. Robert L. Adeson, Alexandria Hospital.</dc:description>
<dc:format>1 print : b&w ; 5 x 7 in.</dc:format>
<dcterms:spatial>Alexandria, Virginia</dcterms:spatial>
<dc:type>Photograph</dc:type>
<dc:format>Image</dc:format>
<dc:identifier>MS332-01-01-002</dc:identifier>
<dc:rights>Copyright has not been assigned to the Alexandria Library. All requests for permission to publish or quote from manuscripts must be submitted in writing to the Alexandria Library. Permission for publication is given on behalf of the Alexandria Library as the owner of the physical items and is not intended to include or imply permission of the copyright holder, which must also be obtained by the researcher.</dc:rights>
</record>
</collection>
I've got a Python program that cleans and separates a MODS record, that I've modified that looks like this:
import os, lxml.etree as ET
output_path = 'C:\\Users\\Staff\\Desktop\\Metadata\\SplitXML\\'
# parse source.xml with lxml
tree = ET.parse('source.xml')
# start cleanup
# remove any element tails
for element in tree.iter():
element.tail = None
# remove any line breaks or tabs in element text
if element.text:
if '\n' in element.text:
element.text = element.text.replace('\n', '')
if '\t' in element.text:
element.text = element.text.replace('\t', '')
# remove any remaining whitespace
parser = ET.XMLParser(remove_blank_text=True, remove_comments=True, recover=True)
treestring = ET.tostring(tree)
clean = ET.XML(treestring, parser)
# remove recursively empty nodes
def recursively_empty(e):
if e.text:
return False
return all((recursively_empty(c) for c in e.iterchildren()))
context = ET.iterwalk(clean)
for action, elem in context:
parent = elem.getparent()
if recursively_empty(elem):
parent.remove(elem)
# remove nodes with blank attribute
for element in clean.xpath(".//*[#*='']"):
element.getparent().remove(element)
# remove nodes with attribute "null"
for element in clean.xpath(".//*[#*='null']"):
element.getparent().remove(element)
# finished cleanup
# write out to intermediate file
with open('clean.xml', 'wb') as f:
f.write(ET.tostring(clean))
print("XML is now clean")
# parse the clean xml
cleanxml = ET.iterparse('clean.xml', events=('end', ))
# find the <dc> nodes
for event, elem in cleanxml:
if elem.tag == '{http://purl.org/dc/elements/1.1/}record':
# name new files using the <dc:identifier> tag
identifier = elem.find('{http://purl.org/dc/elements/1.1/}dc:identifier').text
filename = format(identifier + "_DC.xml")
# write out to new file
with open(output_path+filename, 'wb') as f:
f.write("<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n")
f.write(ET.tostring(elem, pretty_print = True))
print("Writing", filename)
# remove the intermediate file
os.remove('clean.xml')
print("All done!")
The cmd prints the "XML is now clean" and "All done!" statements, HOWEVER, there are no files in the SplitXML directory (or anywhere). My attempt at de-bugging was to comment out the os.remove('clean.xml') line so I could look at the cleaned xml. I've done this with the MODS .py script, and the xml file looks like what you'd expect. However, the clean.xml file on the DC one is clean, but just one long string of code, rather than using different lines and tabs, like this:
<collection xmlns:xsi="http:www.w3.org/2001/XMLSchema-instance"><record xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:dcterms="http://purl.org/dc/terms/" xmlns:dctype="http://purl.org/dc/dcmitype/" xmlns:xsi="http://www.w3.org/2001XMLSchema-instance" xsi:schemaLocation="http://purl.org/dc/elements/1.1/ http://dublincore.org/schemas/xmls/qdc/2008/02/11/dc.xsd http://purl.org/dc/terms/ http://dublincore.org/schemas/xmls/qdc/2008/02/11/dcterms.xsd http://purl.org/dc/dcmitype/ http://dublincore.org/schemas/xmls/qdc/2008/02/11/dcmitype.xsd http://dublincore.org/schemas/xmls/qdc/2008/02/11/qualifieddc.xsd"><dc:title>[Mary Adams at the organ]</dc:title><dc:creator>MacAfee, Don</dc:creator><dc:date>4/14/1964</dc:date><dc:subject>organs</dc:subject><dc:subject>musical instruments</dc:subject><dc:subject>musicians</dc:subject><dc:subject>Adams, Mary</dc:subject><dc:description>Music instructor Mary C. Adams playing the organ.</dc:description><dc:format>1 print : b&w ; 6.5 x 6.5 in.</dc:format><dcterms:spatial>Alexandria, Virginia</dcterms:spatial><dc:type>Photograph</dc:type><dc:format>Image</dc:format><dc:identifier>MS332-01-01-001</dc:identifier><dc:rights>Copyright has not been assigned to the Alexandria Library. All requests for permission to publish or quote from manuscripts must be submitted in writing to the Alexandria Library. Permission for publication is given on behalf of the Alexandria Library as the owner of the physical items and is not intended to include or imply permission of the copyright holder, which must also be obtained by the researcher.</dc:rights></record><record xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:dcterms="http://purl.org/dc/terms/" xmlns:dctype="http://purl.org/dc/dcmitype/" xmlns:xsi="http://www.w3.org/2001XMLSchema-instance" xsi:schemaLocation="http://purl.org/dc/elements/1.1/ http://dublincore.org/schemas/xmls/qdc/2008/02/11/dc.xsd http://purl.org/dc/terms/ http://dublincore.org/schemas/xmls/qdc/2008/02/11/dcterms.xsd http://purl.org/dc/dcmitype/ http://dublincore.org/schemas/xmls/qdc/2008/02/11/dcmitype.xsd http://dublincore.org/schemas/xmls/qdc/2008/02/11/qualifieddc.xsd"><dc:title>[Portrait of Dr. Robert Adeson]</dc:title><dc:date>1980</dc:date><dc:subject>physicians</dc:subject><dc:subject>doctors</dc:subject><dc:subject>Adeson, Robert, M.D.</dc:subject><dc:description>Dr. Robert L. Adeson, Alexandria Hospital.</dc:description><dc:format>1 print : b&w ; 5 x 7 in.</dc:format><dcterms:spatial>Alexandria, Virginia</dcterms:spatial><dc:type>Photograph</dc:type><dc:format>Image</dc:format><dc:identifier>MS332-01-01-002</dc:identifier><dc:rights>Copyright has not been assigned to the Alexandria Library. All requests for permission to publish or quote from manuscripts must be submitted in writing to the Alexandria Library. Permission for publication is given on behalf of the Alexandria Library as the owner of the physical items and is not intended to include or imply permission of the copyright holder, which must also be obtained by the researcher.</dc:rights></record></collection>
If it helps, here's the original Python code for cleaning and splitting MODS. I got it from calhist on github.
# Split XML containing many <mods> elements into invidual files
# Modified from script found here: http://stackoverflow.com/questions/36155049/splitting-xml-file-into-multiple-at-given-tags
# by Bill Levay for California Historical Society
import os, lxml.etree as ET
# uncomment below modules if doing MODS cleanup on existing Islandora objects
import codecs, json
output_path = 'C:\\Users\\Staff\\Desktop\\Metadata\\SplitXML\\'
# parse source.xml with lxml
tree = ET.parse('source.xml')
# start cleanup
# remove any element tails
for element in tree.iter():
element.tail = None
# remove any line breaks or tabs in element text
if element.text:
if '\n' in element.text:
element.text = element.text.replace('\n', '')
if '\t' in element.text:
element.text = element.text.replace('\t', '')
# remove any remaining whitespace
parser = ET.XMLParser(remove_blank_text=True, remove_comments=True, recover=True)
treestring = ET.tostring(tree)
clean = ET.XML(treestring, parser)
# remove recursively empty nodes
# found here: https://stackoverflow.com/questions/12694091/python-lxml-how-to-remove-empty-repeated-tags
def recursively_empty(e):
if e.text:
return False
return all((recursively_empty(c) for c in e.iterchildren()))
context = ET.iterwalk(clean)
for action, elem in context:
parent = elem.getparent()
if recursively_empty(elem):
parent.remove(elem)
# remove nodes with blank attribute
# for element in clean.xpath(".//*[#*='']"):
# element.getparent().remove(element)
# remove nodes with attribute "null"
for element in clean.xpath(".//*[#*='null']"):
element.getparent().remove(element)
# finished cleanup
# write out to intermediate file
with open('clean.xml', 'wb') as f:
f.write(ET.tostring(clean))
print("XML is now clean")
# parse the clean xml
cleanxml = ET.iterparse('clean.xml', events=('end', ))
###
# uncomment this section if doing MODS cleanup on existing Islandora objects
# getting islandora IDs for existing collections
###
# item_list = []
# json_path = 'C:\\mods\\data.json'
# with codecs.open(json_path, encoding='utf-8') as filename:
# item_list = json.load(filename)
# filename.close
###
# find the <mods> nodes
for event, elem in cleanxml:
if elem.tag == '{http://www.loc.gov/mods/v3}mods':
# the filenames of the resulting xml files will be based on the <identifier> element
# edit the specific element or attribute if necessary
identifier = elem.find('{http://www.loc.gov/mods/v3}identifier[#type="local"]').text
filename = format(identifier + "_MODS.xml")
###
# uncomment this section if doing MODS cleanup on existing Islandora objects
# look through the list of object metadata and get the islandora ID by matching the digital object ID
###
# for item in item_list:
# local_ID = item["identifier-type:local"]
# islandora_ID = item["PID"]
# if identifier == local_ID:
# filename = format(islandora_ID + "_MODS.xml")
###
# write out to new file
with open(output_path+filename, 'wb') as f:
f.write("<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n")
f.write(ET.tostring(elem, pretty_print = True))
print("Writing", filename)
# remove the intermediate file
os.remove('clean.xml')
print("All done!")
I found two namespace-related problems:
The record element is in no namespace. Therefore, you need to change
if elem.tag == '{http://purl.org/dc/elements/1.1/}record':
to
if elem.tag == 'record':
elem.find('{http://purl.org/dc/elements/1.1/}dc:identifier') is not correct. The dc: bit must be removed.

Is there a way to parse a XML according to its attributes?

I'm trying to parse my xml using minidom.parse but the program crushes when debugger reaches line xmldoc = minidom.parse(self)
Here is what have I tried:
attribValList = list()
xmldoc = minidom.parse(path)
equipments = xmldoc.getElementsByTagName(xmldoc , elementName)
equipNames = equipments.getElementsByTagName(xmldoc , attributeKey)
for item in equipNames:
attribValList.append(item.value)
return attribValList
Maybe my XML is too specific for minidom. Here is how it looks like:
<TestSystem id="...">
<Port>58</Port>
<TestSystemEquipment>
<Equipment type="BCAFC">
<Name>System1</Name>
<DU-Junctions>
...
</DU-Junctions>
</Equipment>
Basically I need to get for each Equipment its name and write the names into a list.
Can anybody tell what I'm doing wrong?
enter image description here

Categories