Transform XML to Pandas dataframes by detecting automatically the column names - python

I try to write a parse function to translate an specific xml data into a data frame in python.
The XML data has the following structure
<?xml version='1.0' encoding='UTF-8'?><package_D15D.HISTORY xmlns="http://xml.mscibarra.com/ns/msci/deal/D15D.HISTORY" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://xml.mscibarra.com/ns/msci/deal/D15D.HISTORY 20140602_20140630_CORE_DM_SEC_MAIN_DAILY.xsd">
<dataset_D15D>
<entry calc_date="2014-06-02" ... >
</entry>
<entry ...
One find a sample of data here.
Well, every row is between two </entry>. I wrote the following function:
def parse_XML(xml_file, df_cols):
xtree = et.parse(xml_file)
xroot = xtree.getroot()
rows = []
for node in xroot:
print(xroot)
res = []
res.append(node.attrib.get(df_cols[0]))
for el in df_cols[1:]:
if node is not None and node.find(el) is not None:
res.append(node.find(el).text)
else:
res.append(None)
rows.append({df_cols[i]: res[i]
for i, _ in enumerate(df_cols)})
out_df = pd.DataFrame(rows, columns=df_cols)
out_df.to_csv('C:/Users/dataa', sep=';', encoding='utf-8')
return out_df
However, the data frame is empty by calling the function.
Any idea?

Related

Group branches in an XML tree with Python on a common field

I have a list of order details in a CSV, and want to join all items from the lines together on one order.
Example date is:
Order|Line|Item|Price
123456789|1|IK123456|199.99
654987321|1|MASGE12385|29.95
654987321|2|KLEAN458792|9.99
654987321|3|LP12489|1959.95
I want everything to be listed in an XML with the root as the Order Number, Child as the Line Number and Sub-Children as Item and Price.
I want the output to look like:
<Order number = "123456789">
<Line number = "1">
<Item>IK123456</Item>
<Price>199.99</Price>
</Line>
</Order>
<Order number = "654987321">
<Line = "1">
<Item>MASGE12385</Item>
<Price>29.95</Price>
</Line>
<Line = "2">
<Item>KLEAN458792</Item>
<Price>9.99</Price>
</Line>
<Line = "3">
<Item>LP12489</Item>
<Price>1959.95</Price>
</Line>
</Order>
Here is my code:
import csv
import xml.etree.ElementTree as ET
file = 'C:/github.txt'
with open (file, 'r') as f:
reader = csv.reader(f, delimiter = '|')
header = next(reader)
order_num = reader[0]
root = ET.Element("Order") #BUILD A ROOT FOR THE XML TREE
root.set('number', order_num) #ADD ATTRIBUTE
for row in reader: #ITERATE THROUGH EACH ROW AND POPULATE DATA IN BRANCHES OF XML TREE
line = ET.SubElement(root, 'line', number= reader[1])
item = ET.SubElement(line, 'item code')
item.text = reader[2]
price = ET.SubElement(line, 'price')
price.text = reader[3]
tree = ET.ElementTree(root)
tree.write('C:/github.xml', encoding = 'utf-8', xml_declaration = True)
(NOTE: I moved something and got an error, but not sure what happened)
During loop, consider keeping a tracker on Number to conditionally decide to create an element and keep related underlying items together. Additionally, consider csv.DictReader to iterate csv rows as a dictionary which takes first row headers as keys. Finally, use the built-in minidom to pretty print output. Below will incorporate all XML items under the single <Orders> root:
import csv
import xml.etree.ElementTree as ET
import xml.dom.minidom as mn
file = 'C:/github.txt'
curr_order = None
with open (file, 'r') as f:
reader = csv.DictReader(f, delimiter = '|')
# BUILD A ROOT FOR THE XML TREE
root = ET.Element("Orders")
# ITERATE THROUGH EACH ROW AS DICTIONARY
for d in reader:
# CONDITIONALLY BUILD ORDER ELEMENT
if curr_order != str(d['Order']):
orderElem = ET.SubElement(root, "Order")
curr_order = str(d['Order'])
# CREATE DESCENDANTS OF ORDER
orderElem.set('number', str(d['Order']))
line = ET.SubElement(orderElem, 'line', number = str(d['Line']))
ET.SubElement(line, 'item_code').text = str(d['Item'])
ET.SubElement(line, 'price').text = str(d['Price'])
# PRETTY PRINT OUTPUT
dom = mn.parseString(ET.tostring(root, encoding = 'utf-8'))
with open('C:/github.xml', 'wb') as f:
f.write(dom.toprettyxml(indent=" ", encoding = 'utf-8'))
Online Demo

How do I sort XML alphabetically using python?

I have some XML files that i want to sort by the element name. These xml files are considered used as Profiles in my salesforce sandbox/org. Ive built some code that takes an xml file and appends it to the bottom of each profile xml file.Allowing me to add code to multiple files all at once rather than having to copy/paste to each file. The issue here, the xml needs to be sorted alphabetically by the element name, ex:(classAccesses, fieldPermissions, layoutAssignments, recordTypeVisibilities, objectPermissions) I have pasted an example of the xml below. The format of the file needs to be consistent and cant change as salesforce might not like it.
<?xml version="1.0" encoding="UTF-8"?>
<Profile xmlns="http://soap.sforce.com/2006/04/metadata">
<fieldPermissions>
<editable>false</editable>
<field>Branch_Queue__c.Cell_Phone_Number__c</field>
<readable>true</readable>
</fieldPermissions>
<fieldPermissions>
<editable>false</editable>
<field>Branch_Queue__c.Branch__c</field>
<readable>true</readable>
</fieldPermissions>
<fieldPermissions>
<editable>false</editable>
<field>Branch_Queue__c.Source__c</field>
<readable>true</readable>
</fieldPermissions>
<fieldPermissions>
<editable>false</editable>
<field>Branch_Queue__c.Served_By__c</field>
<readable>true</readable>
</fieldPermissions>
<fieldPermissions>
<editable>false</editable>
<field>Branch_Queue__c.Update__c</field>
<readable>true</readable>
</fieldPermissions>
<recordTypeVisibilities>
<default>false</default>
<recordType>Knowledge__kav.RealEstate</recordType>
<visible>true</visible>
</recordTypeVisibilities>
<recordTypeVisibilities>
<default>false</default>
<recordType>Knowledge__kav.RealEstate_Community_Connection</recordType>
<visible>true</visible>
</recordTypeVisibilities>
<objectPermissions>
<allowCreate>false</allowCreate>
<allowDelete>false</allowDelete>
<allowEdit>false</allowEdit>
<allowRead>true</allowRead>
<modifyAllRecords>false</modifyAllRecords>
<object>Branch_Queue__c</object>
<viewAllRecords>true</viewAllRecords>
</objectPermissions>
<classAccesses>
<apexClass>BranchQueueDisplayList</apexClass>
<enabled>true</enabled>
</classAccesses>
<classAccesses>
<apexClass>BranchQueueDisplayList_Test</apexClass>
<enabled>true</enabled>
</classAccesses>
<classAccesses>
<apexClass>BranchQueueService</apexClass>
<enabled>true</enabled>
</classAccesses>
</Profile>
if it helps, here is the python script i have built. if you have any questions please feel free to ask. Thanks!
import os
import json
directory = 'C:/Users/HB35401/MAXDev/force-app/main/default/profiles' #folder containing profiles to be modified
os.chdir(directory)
newData = 'C:/testXMLBatch/additionalXML/addXML.xml' #xml file to append to profile-xml files.
for nameOfFile in os.listdir(directory): #for each profile in the directory
if nameOfFile.endswith(".xml"):
g = open(newData)
data = g.read() #set the value of the newXML to the data variable
f = open(nameOfFile)
fileContent = f.read() #save the content of the profile to fileContent
if data in fileContent:
print('ERROR: XML is already inside the Profile.' + nameOfFile)
else:
EndLine = fileContent[-11:] #save the </Profile> tag from the bottom of the file to EndLine variable.
#print(EndLine) # theEndLine will be appended back after we add our new XML.
test = fileContent[:-11] #remove the </Profile> tag and write back to the profile the removal of the </Profile> tag
with open(nameOfFile, "w") as w:
w.write(test)
with open(nameOfFile) as t:
fileContent2 = t.read()
#print(fileContent2)
h = open(nameOfFile, "a") #add the new data to the profile along with the </Profile> tag
h.write(data + "\n"+ EndLine)
h.close()
Try this .
from simplified_scrapy import SimplifiedDoc, utils
xml = utils.getFileContent('your xml file.xml')
doc = SimplifiedDoc(xml)
root = doc.Profile
nodes = root.children # Get all nodes
count = len(nodes)
if count:
sorted_nodes = sorted(nodes, key=operator.itemgetter('tag')) # Sort by tag
sorted_htmls = []
for node in sorted_nodes:
sorted_htmls.append(node.outerHtml) # Get the string of sorted nodes
for i in range(0, count):
nodes[i].repleaceSelf(sorted_htmls[i]) # Replace the nodes in the original text with the sorted nodes
print(doc.html)

Python - Error when trying to convert xml to csv

I have the below code that reads a xml file and tries to convert it to csv. The below works fine, however when the data has one additional sub-level it throws an error child index out of range
Given below is the data set I am trying to work with:
<?xml version="1.0" encoding="utf-8" standalone="yes" ?>
<Document>
<Customer>
<CustomerCode>ABC</CustomerCode>
<CustomerName>ABC Co</CustomerName>
<CustomerBusinessHours>
<CustomerBusinessHoursTimeZoneOffset>1.000000</CustomerBusinessHoursTimeZoneOffset>
</CustomerBusinessHours>
</Customer>
</Document>
Code that I have tried building:
import xml.etree.ElementTree as ET
import csv
tree = ET.parse("/users/desktop/sample.xml")
root = tree.getroot()
# open a file for writing
Resident_data = open('/users/desktop/file.csv', 'w')
# create the csv writer object
csvwriter = csv.writer(Resident_data)
resident_head = []
count = 0
for member in root.findall('Customer'):
resident = []
address_list = []
if count == 0:
CustomerCode = member.find('CustomerCode').tag
resident_head.append(CustomerCode)
CustomerName = member.find('CustomerName').tag
resident_head.append(CustomerName)
CustomerBusinessHours = member[3].tag
resident_head.append(CustomerBusinessHours)
csvwriter.writerow(resident_head)
count = count + 1
CustomerCode = member.find('CustomerCode').text
resident.append(CustomerCode)
CustomerName = member.find('CustomerName').text
resident.append(CustomerName)
CustomerBusinessHours = member[3][1].text
address_list.append(CustomerBusinessHours)
CustomerBusinessHoursTimeZoneOffset = member[3][2].text
address_list.append(CustomerBusinessHoursTimeZoneOffset)
csvwriter.writerow(resident)
Resident_data.close()
I get the below error:
CustomerBusinessHours = member[3][1].text
IndexError: child index out of range
Expected output:
CustomerCode,CustomerName,CustomerBusinessHoursTimeZoneOffset
ABC,ABC Co,1.000000
The code below is able to collect the data you are looking for.
import xml.etree.ElementTree as ET
xml = '''<?xml version="1.0" encoding="utf-8" standalone="yes" ?>
<Document>
<Customer>
<CustomerCode>ABC</CustomerCode>
<CustomerName>ABC Co</CustomerName>
<CustomerBusinessHours>
<CustomerBusinessHoursTimeZoneOffset>1.000000</CustomerBusinessHoursTimeZoneOffset>
</CustomerBusinessHours>
</Customer>
</Document>'''
tree = ET.fromstring(xml)
for customer in tree.findall('Customer'):
print(customer.find('CustomerCode').text)
print(customer.find('CustomerName').text)
print(customer.find('CustomerBusinessHours').find('CustomerBusinessHoursTimeZoneOffset').text)
Output
ABC
ABC Co
1.000000

Include file name to be part of xml to csv conversion in Python

I am trying to convert an XML file into csv. I have got this below code working to do just that. I however am also trying to include the file name to be part of the extract but I am not able to have that included in this code.
df = pd.DataFrame()
for file in allFiles:
def iter_docs(cis):
for docall in cis:
doc_dict = {}
for doc in docall:
tag = [elem.tag for elem in doc]
txt = [elem.text for elem in doc]
if len(tag) > 0:
doc_dict.update(dict(zip(tag, txt)))
else:
doc_dict[doc.tag] = doc.text
yield doc_dict
etree = ET.parse(file)
df = df.append(pd.DataFrame(list(iter_docs(etree.getroot()))))
Try
df = df.append(pd.DataFrame([file] + list(iter_docs(etree.getroot()))))
to get a column with the filename added
By the way, this approach will give you bad performance.
A better approach is to collect the df in a list and convert that to a big one at the end.
list_of_df = []
for file in allFiles:
def iter_docs(cis):
# your code
list_of_df.append(pd.DataFrame([file] + list(iter_docs(etree.getroot()))))
# at the end
df = pd.concat(list_of_df)

python XML to CSV Parse result non

i have this xml but having issue parsing it into csv, i tried simple print statement but still getting no value:
<?xml version="1.0" encoding="UTF-8"?>
<Document xmlns="urn:iso:std:iso:20022:tech:xsd:pain.008.001.02" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<CstmrDrctDbtInitn>
<GrpHdr>
<MsgId>1820</MsgId>
<CreDtTm>2016-05-17T11:56:12</CreDtTm>
<NbOfTxs>197</NbOfTxs>
<CtrlSum>136661.81</CtrlSum>
<InitgPty>
<Nm>GS Netherlands CDZ C.V.</Nm>
</InitgPty>
</GrpHdr>
</CstmrDrctDbtInitn>
<CstmrDrctDbtInitn>
<GrpHdr>
<CreDtTm>2016-05-18T10:34:51</CreDtTm>
<NbOfTxs>1</NbOfTxs>
<CtrlSum>758.99</CtrlSum>
<InitgPty>
<Nm>GS Netherlands CDZ C.V.</Nm>
</InitgPty></GrpHdr></CstmrDrctDbtInitn>
</Document>
and i want to iterate value for each node.
So far i have written code as below:
import xml.etree.ElementTree as ET
import csv
with open("D:\Python\Dave\\17_05_16_1820_DD201606B10_Base.xml") as myFile:
tree = ET.parse(myFile)
ns = {'d': 'urn:iso:std:iso:20022:tech:xsd:pain.008.001.02'}
# open a file for writing
Resident_data = open('Bank.csv', 'w')
# create the csv writer object
csvwriter = csv.writer(Resident_data)
resident_head = []
#write Header
MsgId = 'MsgId'
resident_head.append(MsgId)
CreDtTm = 'CreDtTm'
resident_head.append(CreDtTm)
NbOfTxs = 'NbOfTxs'
resident_head.append(NbOfTxs)
CtrlSum = 'CtrlSum'
resident_head.append(CtrlSum)
csvwriter.writerow(resident_head)
for member in tree.findall('.//d:Document/d:CstmrDrctDbtInitn/d:GrpHdr/d:MsgId', ns):
resident = []
#write values
MsgId = member.find('MsgId').text
resident.append(MsgId)
CreDtTm = member.find('CreDtTm').text
resident.append(CreDtTm)
NbOfTxs = member.find('NbOfTxs').text
resident.append(NbOfTxs)
CtrlSum = member.find('CtrlSum').text
resident.append(CtrlSum)
csvwriter.writerow(resident)
Resident_data.close()
I get no error and my Bank.csv has only header but no data please help

Categories