Create XML file by iterating over lists in python - python

I have checked this link but it doesnt solved my problem.
I have 2 lists:
a = [['txt','stxt','pi','min','max'],['txt1','stxt1','pi1','min1','max1']]
b = [[0.45,1.23],[[0.75,1.53]]
for l1 in a:
for l2 in b:
root = ET.Element("Class ",name = l1[0])
doc = ET.SubElement(root, "subclass" , name = l1[1])
ET.SubElement(doc, l1[4], min = str(l2 [0]),max = str(l2 [1]))
tree = ET.ElementTree(root)
tree.write(FilePath)
The last record is overwriting all the previous records. So if i want all the records to be written to the xml file? how can i do that using python programming. I also want each record to be saved to the xml file in new line but not pretty printing.
Output i need to be added to the xml:
<Class name="txt"><subclass name="stxt"><pi max="1.23" min="0.45" /></subclass></Class >
<Class name="txt1"><subclass name="stxt1"><pi1 max1="1.53" min1="0.75" /></subclass></Class >
But i am getting is only one record in xml:
<Class name="txt1"><subclass name="stxt1"><pi1 max1="0.1077" min1="-0.0785" /></subclass></Class >

You are writing to same file every time. You need to create new file for every input and the two for loops will make 4 files with undesired combinations. Instead zip is what you need
a = [['txt','stxt','pi','min','max'],['txt1','stxt1','pi1','min1','max1']]
b = [[0.45,1.23],[0.75,1.53]]
from xml.etree import ElementTree as ET
root = ET.Element("xml")
for l1 in zip(a,b):
sroot_root = ET.Element("Class ",name = l1[0][0])
doc = ET.SubElement(sroot_root, "subclass" , name = l1[0][1])
ET.SubElement(doc, l1[0][4], min = str(l1[1][0]),max = str(l1[1][1]))
root.append(sroot_root)
tree = ET.ElementTree(root)
tree.write("test.xml")
Output :
Filename: test.xml
<xml><Class name="txt"><subclass name="stxt"><max max="1.23" min="0.45" /></subclass></Class ><Class name="txt1"><subclass name="stxt1"><max1 max="1.53" min="0.75" /></subclass></Class ></xml>

Related

Listing path and data from a xml file to store in a dataframe

Here is a xml file :
<SOAP-ENV:Envelope xmlns:SOAP-ENV="http://schemas.xmlsoap.org/soap/envelope/">
<SOAP-ENV:Header />
<SOAP-ENV:Body>
<ADD_LandIndex_001>
<CNTROLAREA>
<BSR>
<status>ADD</status>
<NOUN>LandIndex</NOUN>
<REVISION>001</REVISION>
</BSR>
</CNTROLAREA>
<DATAAREA>
<LandIndex>
<reportId>AMI100031</reportId>
<requestKey>R3278458</requestKey>
<SubmittedBy>EN4871</SubmittedBy>
<submittedOn>2015/01/06 4:20:11 PM</submittedOn>
<LandIndex>
<agreementdetail>
<agreementid>001 4860</agreementid>
<agreementtype>NATURAL GAS</agreementtype>
<currentstatus>
<status>ACTIVE</status>
<statuseffectivedate>1965/02/18</statuseffectivedate>
<termdate>1965/02/18</termdate>
</currentstatus>
<designatedrepresentative></designatedrepresentative>
</agreementdetail>
</LandIndex>
</LandIndex>
</DATAAREA>
</ADD_LandIndex_001>
</SOAP-ENV:Body>
</SOAP-ENV:Envelope>
I want to save in a dataframe : 1) the path and 2) the text of the elements corresponding to the path. To do this dataframe, I am thinking to do a dictionary to store both. So first I would like to get a dictionary like that (where I have the values associated to the corresonding path).
{'/Envelope/Body/ADD_LandIndex_001/CNTROLAREA/BSR/status': 'ADD', /Envelope/Body/ADD_LandIndex_001/CNTROLAREA/BSR/NOUN: 'LandIndex',...}
Like that I just have to use the function df=pd.DataFrame() to create a dataframe that I can export in a excel sheet. I have already a part for the listing of the path, however I can not get text from those paths. I do not get how the lxml library works. I tried the function .text() and text_content() but I have an error.
Here is my code :
from lxml import etree
import xml.etree.ElementTree as et
from bs4 import BeautifulSoup
import pandas as pd
filename = 'file_try.xml'
with open(filename, 'r') as f:
soap = f.read()
root = etree.XML(soap.encode())
tree = etree.ElementTree(root)
mylist_path = []
mylist_data = []
mydico = {}
mylist = []
for target in root.xpath('//text()'):
if len(target.strip())>0:
path = tree.getpath(target.getparent()).replace('SOAP-ENV:','')
mydico[path] = target.text()
mylist_path.append(path)
mylist_data.append(target.text())
mylist.append(mydico)
df=pd.DataFrame(mylist)
df.to_excel("data_xml.xlsx")
print(mylist_path)
print(mylist_data)
Thank you for the help !
Here is an example of traversing XML tree. For this purpose recursive function will be needed. Fortunately lxml provides all functionality for this.
from lxml import etree as et
from collections import defaultdict
import pandas as pd
d = defaultdict(list)
root = et.fromstring(xml)
tree = et.ElementTree(root)
def traverse(el, d):
if len(list(el)) > 0:
for child in el:
traverse(child, d)
else:
if el.text is not None:
d[tree.getelementpath(el)].append(el.text)
traverse(root, d)
df = pd.DataFrame(d)
df.head()
Output:
{
'{http://schemas.xmlsoap.org/soap/envelope/}Body/ADD_LandIndex_001/CNTROLAREA/BSR/status': ['ADD'],
'{http://schemas.xmlsoap.org/soap/envelope/}Body/ADD_LandIndex_001/CNTROLAREA/BSR/NOUN': ['LandIndex'],
'{http://schemas.xmlsoap.org/soap/envelope/}Body/ADD_LandIndex_001/CNTROLAREA/BSR/REVISION': ['001'],
'{http://schemas.xmlsoap.org/soap/envelope/}Body/ADD_LandIndex_001/DATAAREA/LandIndex/reportId': ['AMI100031'],
'{http://schemas.xmlsoap.org/soap/envelope/}Body/ADD_LandIndex_001/DATAAREA/LandIndex/requestKey': ['R3278458'],
'{http://schemas.xmlsoap.org/soap/envelope/}Body/ADD_LandIndex_001/DATAAREA/LandIndex/SubmittedBy': ['EN4871'],
'{http://schemas.xmlsoap.org/soap/envelope/}Body/ADD_LandIndex_001/DATAAREA/LandIndex/submittedOn': ['2015/01/06 4:20:11 PM'],
'{http://schemas.xmlsoap.org/soap/envelope/}Body/ADD_LandIndex_001/DATAAREA/LandIndex/LandIndex/agreementdetail/agreementid': ['001 4860'],
'{http://schemas.xmlsoap.org/soap/envelope/}Body/ADD_LandIndex_001/DATAAREA/LandIndex/LandIndex/agreementdetail/agreementtype': ['NATURAL GAS'],
'{http://schemas.xmlsoap.org/soap/envelope/}Body/ADD_LandIndex_001/DATAAREA/LandIndex/LandIndex/agreementdetail/currentstatus/status': ['ACTIVE'],
'{http://schemas.xmlsoap.org/soap/envelope/}Body/ADD_LandIndex_001/DATAAREA/LandIndex/LandIndex/agreementdetail/currentstatus/statuseffectivedate': ['1965/02/18'],
'{http://schemas.xmlsoap.org/soap/envelope/}Body/ADD_LandIndex_001/DATAAREA/LandIndex/LandIndex/agreementdetail/currentstatus/termdate': ['1965/02/18']
}
Please note, the dictionary d contains lists as values. That's because elements can be repeated in XML and otherwise last value will override previous one. If that's not the case for your particular XML, use regular dict instead of defaultdict d = {} and use assignment instead of appending d[tree.getelementpath(el)] = el.text.
The same when reading from file:
d = defaultdict(list)
with open('output.xml', 'rb') as file:
root = et.parse(file).getroot()
tree = et.ElementTree(root)
def traverse(el, d):
if len(list(el)) > 0:
for child in el:
traverse(child, d)
else:
if el.text is not None:
d[tree.getelementpath(el)].append(el.text)
traverse(root, d)
df = pd.DataFrame(d)
print(d)

creating dynamic nested xml from Excel

I'm trying to convert Excel to nested XML and could not succeed as expected.
Here is my code.
import openpyxl
import xml.etree.ElementTree as etree
# reading data from the source, xls
wb1 = openpyxl.load_workbook(filename='C:\GSH\parent_child.xlsx')
ws1 = wb1.get_sheet_by_name('Sheet1')
row_max = ws1.max_row
# creating xml tree structure
root = etree.Element('Hierarchy')
# iterating through the xls and creating children based on the condition
for row_values in range(2, row_max+1):
parent = etree.SubElement(root, 'parent')
parent.text = ws1.cell(column=1, row=row_values).value
root.append(parent)
if (ws1.cell(column=1, row = row_values).value == ws1.cell(column=2, row = row_values-1).value):
print("------Inside if condition")
print(ws1.cell(column=2, row=row_values).value)
child = etree.SubElement(parent, 'child')
child.text = ws1.cell(column=2, row=row_values).value
parent.append(child)
print("-------Inside if condition")
tree = etree.ElementTree(root)
tree.write('C:\GSH\gsh.xml')
I am getting XML like this..
However, my XML should look like this.
Any suggestions, please.
The above is the source XLS from which I am working on.
You can set variable name instead of parent and child. This code is only part of your list and seems tricky but works fine. d[child[i]].text = " " is only to show both sides of tags. For making var in loop with dictionary, please refer this.
import xml.etree.ElementTree as ET
India = ET.Element('India') # set root
parent = ['India', 'Telangana', 'Telangana', 'Telangana','Nalgonda'] # parent list
child = ['Telangana', 'Cyberabad', 'Warangal','Nalgonda','BusStation'] # child list
d = {} # use dictionary to define var in loop
d['India'] = India
for i in range(len(child)):
for k, v in d.items():
if k == parent[i]:
pa = v
break
d[child[i]] = ET.SubElement(pa, child[i])
d[child[i]].text = " " # to get both side of tags
tree = ET.ElementTree(India)
tree.write('gsh.xml')
# <India>
# <Telangana>
# <Cyberabad> </Cyberabad>
# <Warangal> </Warangal>
# <Nalgonda>
# <BusStation> </BusStation>
# </Nalgonda>
# </Telangana>
# </India>

How to get the content of specific grandchild from xml file through python

Hi I am very new to python programming. I have an xml file of structure:
<?xml version="1.0" encoding="UTF-8"?>
-<LidcReadMessage xsi:schemaLocation="http://www.nih.gov http://troll.rad.med.umich.edu/lidc/LidcReadMessage.xsd"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns="http://www.nih.gov" uid="1.3.6.1.4.1.14519.5.2.1.6279.6001.1307390687803.0">
-<ResponseHeader>
<Version>1.8.1</Version>
<MessageId>-421198203</MessageId>
<DateRequest>2007-11-01</DateRequest>
<TimeRequest>12:30:44</TimeRequest>
<RequestingSite>removed</RequestingSite>
<ServicingSite>removed</ServicingSite>
<TaskDescription>Second unblinded read</TaskDescription>
<CtImageFile>removed</CtImageFile>
<SeriesInstanceUid>1.3.6.1.4.1.14519.5.2.1.6279.6001.179049373636438705059720603192</SeriesInstanceUid>
<DateService>2008-08-18</DateService>
<TimeService>02:05:51</TimeService>
<ResponseDescription>1 - Reading complete</ResponseDescription>
<StudyInstanceUID>1.3.6.1.4.1.14519.5.2.1.6279.6001.298806137288633453246975630178</StudyInstanceUID>
</ResponseHeader>
-<readingSession>
<annotationVersion>3.12</annotationVersion>
<servicingRadiologistID>540461523</servicingRadiologistID>
-<unblindedReadNodule>
<noduleID>Nodule 001</noduleID>
-<characteristics>
<subtlety>5</subtlety>
<internalStructure>1</internalStructure>
<calcification>6</calcification>
<sphericity>3</sphericity>
<margin>3</margin>
<lobulation>3</lobulation>
<spiculation>4</spiculation>
<texture>5</texture>
<malignancy>5</malignancy>
</characteristics>
-<roi>
<imageZposition>-125.000000 </imageZposition>
<imageSOP_UID>1.3.6.1.4.1.14519.5.2.1.6279.6001.110383487652933113465768208719</imageSOP_UID>
......
There are four which contains multiple . Each contains an . I need to extract the information in from all of these headers.
Right now I am doing this:
import xml.etree.ElementTree as ET
tree = ET.parse('069.xml')
root = tree.getroot()
#lst = []
for readingsession in root.iter('readingSession'):
for roi in readingsession.findall('roi'):
id = roi.findtext('imageSOP_UID')
print(id)
but it ouputs like this:
Process finished with exit code 0.
If anyone can help.
The real problem as been wit the namespace. I tried with and without it, but it didn't work with this code.
ds = pydicom.dcmread("000071.dcm")
uid = ds.SOPInstanceUID
tree = ET.parse("069.xml")
root = tree.getroot()
for child in root:
print(child.tag)
if child.tag == '{http://www.nih.gov}readingSession':
read = child.find('{http://www.nih.gov}unblindedReadNodule')
if read != None:
nodule_id = read.find('{http://www.nih.gov}noduleID').text
xml_uid = read.find('{http://www.nih.gov}roi').find('{http://www.nih.gov}imageSOP_UID').text
if xml_uid == uid:
print(xml_uid, "=", uid)
roi= read.find('{http://www.nih.gov}roi')
print(roi)
This work completely fine to get a uid from dicom image of LIDC/IDRI dataset and then extract the same uid from xml file for it region of interest.

How to create multiple subelements (having children) from a dataframe under a single root in XML using python?

I am trying to create a tree whose Subelement has multiple categories which further has the children. The Subelement of same type but different categories should lie under the same root.
Expected:
I know there should be a loop in the working so that the same line of code for different categories would run over those lines to append under the same root but having a hard time figuring it.
My code is below
This only generates different roots instead.
for i in range(len(df["animalId"])):
category = df[i][0]
name = df[i][1]
legs = df[i][2]
type = df[i][3]
animals = etree.Element("animals")
etree.SubElement(animals, "category").text = str(category)
etree.SubElement(category, "version").text = str(version)
etree.SubElement(version, "name").text = str(name)
etree.SubElement(version, "legs").text = str(legs)
etree.SubElement(version, "type").text = str(type)
xmlstr = minidom.parseString(etree.toString(animals)).toprettyxml(indent = " ")
print (xmlstr)
Resulting:

XML reading the last entry from the file in python

I want to read the last entry of the xml file and get its value. Here is my xml file
<TestSuite>
<TestCase>
<name>tcname1</name>
<total>1</total>
<totalpass>0</totalpass>
<totalfail>0</totalfail>
<totalerror>1</totalerror>
</TestCase>
<TestCase>
<name>tcname2</name>
<total>1</total>
<totalpass>0</totalpass>
<totalfail>0</totalfail>
<totalerror>1</totalerror>
</TestCase>
</TestSuite>
I want to get the <total> , <totalpass>,<totalfail> and <totalerror> value in the last tag of the file. I have tried this code to do that.
import xmltodict
with open(filename) as fd:
doc = xmltodict.parse(fd.read())
length=len(doc['TestSuite']['TestCase'])
tp=doc['TestSuite']['TestCase'][length-1]['totalpass']
tf=doc['TestSuite']['TestCase'][length-1]['totalfail']
te=doc['TestSuite']['TestCase'][length-1]['totalerror']
total=doc['TestSuite']['TestCase'][length-1]['total']
This works for the xml with 2 or more testcase tags in xml files , But fails with this error for the file with only one testcase tag .
Traceback (most recent call last):
File "HTMLReportGenerationFromXML.py", line 52, in <module>
tp=doc['TestSuite']['TestCase'][length-1]['totalpass']
KeyError: 4 .
Because instead of the count , it is taking the subtag ( etc value as length). Please help me resolve this issue.
Since you only want the last one, you can use negative indices to retrieve it:
import xml.etree.ElementTree as et
tree = et.parse('test.xml')
# collect all the test cases
test_cases = [test_case for test_case in tree.findall('TestCase')]
# Pull data from the last one
last = test_cases[-1]
total = last.find('total').text
totalpass = last.find('totalpass').text
totalfail = last.find('totalfail').text
totalerror = last.find('totalerror').text
print total,totalpass,totalfail,totalerror
Why didn't I do t his in the first place! Use xpath.
The first example involves processing the xml file with just one TestCase element, the second with two of them. The key point is to use the xpath last selector.
>>> from lxml import etree
>>> tree = etree.parse('temp.xml')
>>> last_TestCase = tree.xpath('.//TestCase[last()]')[0]
>>> for child in last_TestCase.iterchildren():
... child.tag, child.text
...
('name', 'tcname2')
('total', '1')
('totalpass', '0')
('totalfail', '0')
('totalerror', '1')
>>>
>>> tree = etree.parse('temp_2.xml')
>>> last_TestCase = tree.xpath('.//TestCase[last()]')[0]
>>> for child in last_TestCase.iterchildren():
... child.tag, child.text
...
('name', 'tcname1')
('reason', 'reason')
('total', '2')
('totalpass', '0')
('totalfail', '0')
('totalerror', '2')
I have tried this this works for me
import xml.etree.ElementTree as ET
import sys
tree = ET.parse('temp.xml')
root = tree.getroot()
print root
total=[]
totalpass=[]
totalfail=[]
totalerror=[]
for test in root.findall('TestCase'):
total.append(test.find('total').text)
totalpass.append(test.find('totalpass').text)
totalfail.append(test.find('totalfail').text)
totalerror.append(test.find('totalerror').text)
length=len(total)
print total[length-1],totalpass[length-1],totalfail[length-1],totalerror[length-1]
This one works for me
The reason of your error is that with xmltidict doc['TestSuite']['TestCase'] is a list just for long XMLs
>>> type(doc2['TestSuite']['TestCase']) # here doc2 is more than one-entry long XML file
>>> list
but it is just a kind of dictionary for a one-entry long file:
>>> type(doc['TestSuite']['TestCase']) # doc is one-entry long
>>> collections.OrderedDict
That's the reason. You could try to manage the issue in the following way:
import xmltodict
with open(filename) as fd:
doc = xmltodict.parse(fd.read())
if type(doc['TestSuite']['TestCase']) == list:
tp=doc['TestSuite']['TestCase'][length-1]['totalpass']
tf=doc['TestSuite']['TestCase'][length-1]['totalfail']
te=doc['TestSuite']['TestCase'][length-1]['totalerror']
total=doc['TestSuite']['TestCase'][length-1]['total']
else: # you have just a dict here
tp=doc['TestSuite']['TestCase']['totalpass']
tf=doc['TestSuite']['TestCase']['totalfail']
te=doc['TestSuite']['TestCase']['totalerror']
total=doc['TestSuite']['TestCase']['total']
Otherwise, you can use another library for the XML parsing.
...let me know if it helps!

Categories