Parse XML file and output JSON with Python - python

I am quite new to Python. I'm currently trying to parse xml files getting their information and printing them as JSON.
I have managed to parse the xml file, but I cannot print them as JSON. In addition, in my printjson function, the function did not run through all results and only print one time. The parse function worked and run through all input files while printjson didn't.
My code is as follow.
from xml.dom import minidom
import os
import json
#input multiple files
def get_files(d):
return [os.path.join(d, f) for f in os.listdir(d) if os.path.isfile(os.path.join(d,f))]
#parse xml
def parse(files):
for xml_file in files:
#indentify all xml files
tree = minidom.parse(xml_file)
#Get some details
NCT_ID = ("NCT ID : %s" % tree.getElementsByTagName("nct_id")[0].firstChild.data)
brief_title = ("brief title : %s" % tree.getElementsByTagName("brief_title")[0].firstChild.data)
official_title = ("official title : %s" % tree.getElementsByTagName("official_title")[0].firstChild.data)
return NCT_ID,brief_title,official_title
#print result in json
def printjson(results):
for result in results:
output_json = json.dumps(result)
print(output_json)
printjson(parse(get_files('my files path')))
Output when running the file
"NCT ID : NCT00571389"
"brief title : Isolation and Culture of Immune Cells and Circulating Tumor Cells From Peripheral Blood and Leukapheresis Products"
"official title : A Study to Facilitate Development of an Ex-Vivo Device Platform for Circulating Tumor Cell and Immune Cell Harvesting, Banking, and Apoptosis-Viability Assay"
Expected output
{
"NCT ID" : "NCT00571389",
"brief title" : "Isolation and Culture of Immune Cells and Circulating Tumor Cells From Peripheral Blood and Leukapheresis Products",
"official title" : "A Study to Facilitate Development of an Ex-Vivo Device Platform for Circulating Tumor Cell and Immune Cell Harvesting, Banking, and Apoptosis-Viability Assay"
}
The sample indexed xml file that I used is named as COVID-19 Clinical Trials dataset and can be found in kaggle

The issue is that your parse function is returning too early (it's returning after getting the details from the first XML file. Instead, you should return a list of dictionaries that stores this information, so each item in the list represents a different file, and each dictionary contains the necessary information regarding the corresponding XML file.
Here's the updated code:
def parse(files):
xml_information = []
for xml_file in files:
#indentify all xml files
tree = minidom.parse(xml_file)
#Get some details
NCT_ID = ("NCT ID : %s" % tree.getElementsByTagName("nct_id")[0].firstChild.data)
brief_title = ("brief title : %s" % tree.getElementsByTagName("brief_title")[0].firstChild.data)
official_title = ("official title : %s" % tree.getElementsByTagName("official_title")[0].firstChild.data)
xml_information.append({"NCT_ID": NCT_ID, "brief title": brief_title, "official title": official_title})
return xml_information
def printresults(results):
for result in results:
print(result)
printresults(parse(get_files('my files path')))
If you absolutely want to return format to be json, you can similarly use json.dumps on each dictionary.
Note: If you have a lot of XML files, I would recommend using yield in the function instead of returning a whole list of dictionaries in order to improve speed and performance.

I don't know much about xml.dom library but you can generate the json with a dictionary, because the dumps function is only for convert json to string.
Some like this.
def parse(files):
for xml_file in files:
#indentify all xml files
tree = minidom.parse(xml_file)
dicJson = {}
dicJson.setdefault("NCT ID",tree.getElementsByTagName("nct_id")[0].firstChild.data)
dicJson.setdefault("brief title",tree.getElementsByTagName("brief_title")[0].firstChild.data)
dicJson.setdefault("official title", tree.getElementsByTagName("official_title")[0].firstChild.data)
return dicJson
and in the function prinJson:
def printJson(results):
# This function return the dictionary but in string, how to write to a JSON file.
print(json.dumps(results))

Related

Having challenge with xml file

want to print this xml file such that I can be able to loop through it. my aim is to combine it with a csv file having the same column name, before creating a database with this combined file. I'm not allow to use non standard Libraries.
code------
xml_file = ET.parse("E:/Research work/My connect/Sam/CETM50 - 2022_3 - Assignment Data/user_data.xml")
get the parent tag
root = xml_file.getroot()
print the attributes of the first tag
e = ET.tostring(xml_file.getroot(), encoding='unicode', method='xml')
print(e)
output
<user firstName="Jayne" lastName="Wilson" age="69" sex="Female" retired="False" dependants="1" marital_status="divorced" salary="36872" pension="0" company="Wall, Reed and Whitehouse" commute_distance="10.47" address_postcode="TD95 7FL"

"not well-formed (invalid token): " error for trying to parse an XML file

I am having this error. I am trying to access an xmlfile called "people-kb.xml".
I am having the problem on a line known as: xmldoc = minidom.parse(xmlfile) #Accesses file.
xmldoc is "people-kb.xml" which is passed into a method such as:
parseXML('people-kb.xml')
So the problem I was having came from the save file I had created as I was trying to make a multiple trials that would contain information on two people. for now I only have one trial included and not multiple yet as I am starting with creating the file and after I would edit if it already exists.
the code for making the file is:
import xml.etree.cElementTree as ET
def saveXML(xmlfile):
root = ET.Element("Simulation")
ET.SubElement(root, "chaserStartingCoords").text = "1,1"
ET.SubElement(root, "runnerStartingCoords").text = "9,9"
doc = ET.SubElement(root, "trail")
ET.SubElement(doc, "number").text = "1"
doc1 = ET.SubElement(doc, "number", name="number").text = "1" #Trying to make multiple trials
ET.SubElement(doc1, "chaserEndCoords").text = "10,10"
ET.SubElement(doc1, "runnerInitialCoords").text = "10,10"
tree = ET.ElementTree(root)
tree.write(xmlfile)
if __name__ == '__main__':
saveXML('output.xml')
Where it says "number" I am trying to make it the amount of trials it would be. So what I am trying to make it expect is an output like this:
<simulation>
<chaserStartingCoords>1,1<chaserStartingCoords>
<runnerStartingCoords>9,9<runnerStartingoords>
<trial>
<number>1</number>
<move number="1">
<chaserEndcoords>10,10<chaserEndCoords>
<runnerInitialCoords>10,10<runnerInitialCoords>
</move>
</trial>
</simulation>
I've been having a problem trying to get the <move number="1"> part as later I expect to be able to go into the file and iterate through each node called "move" to check positions.
You say "when trying to name a node of the file, it shows a red highlight on "1" "
That suggests you're trying to use "1", or something beginning with "1", as an element or attribute name, which would be invalid.

Is there a way to parse a XML according to its attributes?

I'm trying to parse my xml using minidom.parse but the program crushes when debugger reaches line xmldoc = minidom.parse(self)
Here is what have I tried:
attribValList = list()
xmldoc = minidom.parse(path)
equipments = xmldoc.getElementsByTagName(xmldoc , elementName)
equipNames = equipments.getElementsByTagName(xmldoc , attributeKey)
for item in equipNames:
attribValList.append(item.value)
return attribValList
Maybe my XML is too specific for minidom. Here is how it looks like:
<TestSystem id="...">
<Port>58</Port>
<TestSystemEquipment>
<Equipment type="BCAFC">
<Name>System1</Name>
<DU-Junctions>
...
</DU-Junctions>
</Equipment>
Basically I need to get for each Equipment its name and write the names into a list.
Can anybody tell what I'm doing wrong?
enter image description here

Correcting to the correct URL

I have written a simple script to access JSON to get the keywords needed to be used for the URL.
Below is the script that I have written:
import urllib2
import json
f1 = open('CatList.text', 'r')
f2 = open('SubList.text', 'w')
lines = f1.read().splitlines()
for line in lines:
url ='https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle='+line+'&cmlimit=100'
json_obj = urllib2.urlopen(url)
data = json.load(json_obj)
for item in data['query']:
for i in data['query']['categorymembers']:
print i['title']
print '-----------------------------------------'
f2.write((i['title']).encode('utf8')+"\n")
In this script, the program will first read CatList which provides a list of keywords used for the URL.
Here is a sample of what the CatList.text contains.
Category:Branches of geography
Category:Geography by place
Category:Geography awards and competitions
Category:Geography conferences
Category:Geography education
Category:Environmental studies
Category:Exploration
Category:Geocodes
Category:Geographers
Category:Geographical zones
Category:Geopolitical corridors
Category:History of geography
Category:Land systems
Category:Landscape
Category:Geography-related lists
Category:Lists of countries by geography
Category:Navigation
Category:Geography organizations
Category:Places
Category:Geographical regions
Category:Surveying
Category:Geographical technology
Category:Geography terminology
Category:Works about geography
Category:Geographic images
Category:Geography stubs
My program get the keywords and placed it in the URL.
However I am not able to get the result.I have checked the code by printing the URL:
import urllib2
import json
f1 = open('CatList.text', 'r')
f2 = open('SubList2.text', 'w')
lines = f1.read().splitlines()
for line in lines:
url ='https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle='+line+'&cmlimit=100'
json_obj = urllib2.urlopen(url)
data = json.load(json_obj)
f2.write(url+'\n')
The result I get is as follows in sublist2:
https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category:Branches of geography&cmlimit=100
https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category:Geography by place&cmlimit=100
https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category:Geography awards and competitions&cmlimit=100
https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category:Geography conferences&cmlimit=100
https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category:Geography education&cmlimit=100
https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category:Environmental studies&cmlimit=100
https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category:Exploration&cmlimit=100
https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category:Geocodes&cmlimit=100
https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category:Geographers&cmlimit=100
https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category:Geographical zones&cmlimit=100
https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category:Geopolitical corridors&cmlimit=100
https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category:History of geography&cmlimit=100
https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category:Land systems&cmlimit=100
https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category:Landscape&cmlimit=100
https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category:Geography-related lists&cmlimit=100
https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category:Lists of countries by geography&cmlimit=100
https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category:Navigation&cmlimit=100
https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category:Geography organizations&cmlimit=100
https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category:Places&cmlimit=100
https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category:Geographical regions&cmlimit=100
https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category:Surveying&cmlimit=100
https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category:Geographical technology&cmlimit=100
https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category:Geography terminology&cmlimit=100
https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category:Works about geography&cmlimit=100
https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category:Geographic images&cmlimit=100
https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category:Geography stubs&cmlimit=100
It shows that the URL is placed correctly.
But when I run the full code it was not able to get the correct result.
One thing I notice is when I place in the link to the address bar for example:
https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category:Branches of geography&cmlimit=100
It gives the correct result because the address bar auto corrects it to :
https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category:Branches%20of%20geography&cmlimit=100
I believe that if %20 is added in place of an empty space between the word " Category: Branches of Geography" , my script will be able to get the correct JSON items.
Problem:
But I am not sure how to modify this statement in the above code to get the replace the blank spaces that is contained in CatList with %20.
Please forgive me for the bad formatting and the long post, I am still trying to learn python.
Thank you for helping me.
Edit:
Thank you Tim. Your solution works:
url ='https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle='+urllib2.quote(line)+'&cmlimit=100'
It was able to print the correct result:
https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category%3ABranches%20of%20geography&cmlimit=100
https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category%3AGeography%20by%20place&cmlimit=100
https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category%3AGeography%20awards%20and%20competitions&cmlimit=100
https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category%3AGeography%20conferences&cmlimit=100
https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category%3AGeography%20education&cmlimit=100
https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category%3AEnvironmental%20studies&cmlimit=100
https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category%3AExploration&cmlimit=100
https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category%3AGeocodes&cmlimit=100
https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category%3AGeographers&cmlimit=100
https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category%3AGeographical%20zones&cmlimit=100
https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category%3AGeopolitical%20corridors&cmlimit=100
https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category%3AHistory%20of%20geography&cmlimit=100
https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category%3ALand%20systems&cmlimit=100
https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category%3ALandscape&cmlimit=100
https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category%3AGeography-related%20lists&cmlimit=100
https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category%3ALists%20of%20countries%20by%20geography&cmlimit=100
https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category%3ANavigation&cmlimit=100
https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category%3AGeography%20organizations&cmlimit=100
https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category%3APlaces&cmlimit=100
https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category%3AGeographical%20regions&cmlimit=100
https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category%3ASurveying&cmlimit=100
https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category%3AGeographical%20technology&cmlimit=100
https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category%3AGeography%20terminology&cmlimit=100
https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category%3AWorks%20about%20geography&cmlimit=100
https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category%3AGeographic%20images&cmlimit=100
https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category%3AGeography%20stubs&cmlimit=100
use urllib.quote() to replace special characters in an url:
Python 2:
import urllib
line = 'Category:Branches of geography'
url ='https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=' + urllib.quote(line) + '&cmlimit=100'
https://docs.python.org/2/library/urllib.html#urllib.quote
Python 3:
import urllib.parse
line = 'Category:Branches of geography'
url ='https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=' + urllib.parse.quote(line) + '&cmlimit=100'
https://docs.python.org/3.5/library/urllib.parse.html#urllib.parse.quote

Python read a xml 1 -M and create json

i have develop a python script taht read a tag of xml file and after convert the result into json.
Now the problem is that the xml for each one element have some tag (relation 1 - M)
<idpoint>1021</idpoint>
<tipopoint>1</tipopoint>
<latitude>45.188377380</latitude>
<longitude>8.612257004</longitude>
<previsione time="2015-07-11T12:00:00">
<id_tempo>1</id_tempo>
<desc_tempo>sereno</desc_tempo>
<symbol_day>1</symbol_day>
<temp>33</temp>
</previsione>
<previsione time="2015-07-11T18:00:00">
<id_tempo>1</id_tempo>
<desc_tempo>sereno</desc_tempo>
<symbol_day>1</symbol_day>
<temp>29</temp>
</previsione>
My code python read the first tag and when i arrive to tag previsione that is repeat 2 time for the same point i take the first value of first tag previsioni but doesn't take the second.
I could recreate a same record but this time take the value of second tag previsioni.
this is a snippet of my python code
json_array = [];
for path in files:
with open(path, 'r') as fr:
print "Parsing xmldoc %s" % path
xmldoc = minidom.parse(fr)
if tipo == "allerte":
items = xmldoc.getElementsByTagName("point")
else:
items = xmldoc.getElementsByTagName("localita")
for item in items:
obj = dict()
if tipo == "allerte":
obj['id'] = item.getElementsByTagName("idpoint")[0].firstChild.nodeValue
else:
obj['id'] = item.getElementsByTagName("idpoint")[0].firstChild.nodeValue
obj['latitude'] = float(item.getElementsByTagName("latitude")[0].firstChild.nodeValue)
obj['longitude'] = float(item.getElementsByTagName("longitude")[0].firstChild.nodeValue)
#TODO: IL symbol code va recuperato dalla prima previsione
sobj['symbolcode'] = int(item.getElementsByTagName("id_tempo")[0].firstChild.nodeValue)
json_array.append(obj)
return json.dumps(json_array)
Any help to integrate this code for create into json file 2 element for the 2 tag relation?
Thanks
There is a quick way to get json from xml, using xmltodict. This module creates nice dict from your xml, and you can easily manipulate your data like it is pure json.
Let's assume, that your xml sample is saved as t2.xml file, enveloped by <xml>...</xml> tags.
Then this script
#!/usr/bin/env python
# coding: utf-8
import sys
import xmltodict
import json
with open('t2.xml', 'r') as data:
print "Parsing xmldoc test.xml"
dict = xmltodict.parse(data)
print(json.dumps(dict, indent=4, sort_keys=True))
will produce json as following:
{
"xml": {
"idpoint": "1021",
"latitude": "45.188377380",
"longitude": "8.612257004",
"previsione": [
{
"#time": "2015-07-11T12:00:00",
"desc_tempo": "sereno",
"id_tempo": "1",
"symbol_day": "1",
"temp": "33"
},
{
"#time": "2015-07-11T18:00:00",
"desc_tempo": "sereno",
"id_tempo": "1",
"symbol_day": "1",
"temp": "29"
}
],
"tipopoint": "1"
}
}
In particular, you get both of your previsione elements properly in an array and can use them as you need.

Categories