I have a XML file that I want to parse. In the file I have 3 unique tags -
3
2
1
Each of these have 1 unique value for a metricX. I want to be extract these values in form a dict in python.
Something like
Desired Output
{ 3 : {"metricX":100}, 2 : {"metricX":11}, 1 : {"metricX":44}}
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="MeasDataCollection.xsl"?>
<!DOCTYPE mdc SYSTEM "MeasDataCollection.dtd">
<mdc xmlns:HTML="http://www.w3.org/TR/REC-xml">
<md>
<neid>
<neun>936001_STURGEON_BAY_MEYER</neun>
</neid>
<mi>
<mi>
<mts>20170924161500Z</mts>
<gp>900</gp>
<mt>metricX</mt>
<mv>
<moid>3</moid>
<r>100</r>
</mv>
<mv>
<moid>2</moid>
<r>11</r>
</mv>
<mv>
<moid>1</moid>
<r>44</r>
</mv>
</mi>
</mi>
</md>
</mdc>
So far I have tried using Element Tree.
import os
import xml.etree.ElementTree as ET
fullpath = os.getcwd()
os.chdir(r"C:\Users\sss\Documents\Zabbix_work\xml_parsing")
tree = ET.ElementTree(file='smaple.xml')
for elem in tree.iter():
print (elem.tag, elem.text)
Output so far is -
mdc
md
neid
neun 936001_STURGEON_BAY_MEYER
mi
mi
mts 20170924161500Z
gp 900
mt metricX
mv
moid 3
r 100
mv
moid 2
r 11
mv
moid 1
r 44
Not so sure now how to organize it further in form of a dict.
This should do the trick:
import xml.etree.ElementTree as ET
import os
file_path = os.path.expanduser('~/Desktop/input123.xml') # filepath here
tree = ET.ElementTree(file=file_path)
my_dict = {}
for node in tree.getroot().find('md').find('mi').find('mi').findall('mv'):
my_dict[int(node.find('moid').text)] = { 'metricX': int(node.find('r').text) }
print(my_dict)
...output:
{3: {'metricX': 100}, 2: {'metricX': 11}, 1: {'metricX': 44}}
Related
I am trying to create a dataframe out the XML code as shown below
<Structure>
<Field>
<Field_Name>GAMEREF</Field_Name>
<Field_Type>Numeric</Field_Type>
<Field_Len>4</Field_Len>
<Field_Dec>0</Field_Dec>
</Field>
...
<Field>
<Field_Name>WINLOSS</Field_Name>
<Field_Type>Character</Field_Type>
<Field_Len>1</Field_Len>
<Field_Dec>0</Field_Dec>
</Field>
</Structure>
<Records>
<Record>
<GAMEREF>1217</GAMEREF>
<YEAR>2021</YEAR>
(MORE ELEMENTS I DO NOT CARE ABOUT)
<GAMENO>1</GAMENO>
<WINLOSS>W</WINLOSS>
</Record>
...
<Record>
<GAMEREF>1220</GAMEREF>
<YEAR>2021</YEAR>
(MORE ELEMENTS I DO NOT CARE ABOUT)
<GAMENO>4</GAMENO>
<WINLOSS>L</WINLOSS>
</Record>
</Records>
The structure section of the XML code that is irrelevant to the dataframe I am trying to create.
I am trying to only use the XML elements of GAMEREF, YEAR, GAMENO, and WINLOSS as there are more in the XML for the Record elements.
I have tried using code as shown below to get this to work, but when I run the code I get the error of "AttributeError: 'NoneType' object has no attribute 'text'"
Code is below.
import pandas as pd
import xml.etree.ElementTree as et
xtree = et.parse("gamedata.xml")
xroot = xtree.getroot()
df_cols = ["GAME REF","YEAR", "GAME NO", "WIN LOSS"]
rows = []
for child in xroot.iter():
s_gameref = child.find('GAMEREF').text,
s_year = child.find('YEAR').text,
s_game_no = child.find('GAMENO').text,
s_winloss = child.find('WINLOSS').text
rows.append({"GAME REF": s_gameref,"YEAR": s_year,
"GAME NO": s_game_no, "WIN LOSS": s_winloss})
df = pd.DataFrame(rows, columns = df_cols)
The code is based off other stuff I have seen on the Stack and other sites, but nothing is working yet.
Ideal dataframe output is below
GAME REF
YEAR
GAME NO
WIN LOSS
1217
2021
1
W
1218
2021
2
W
1219
2021
3
L
1220
2021
4
L
Thanks
EDIT - NOT SURE WHAT IS GOING ON WITH MY TABLE, BUT IT SHOULD LOOK LIKE THIS
I think the below is what you are looking for. (Just loop over the "interesting" sub elements of Record). The logic of the code is in the line that starts with data = [.... The 2 loops can be found there.
import pandas as pd
import xml.etree.ElementTree as ET
xml = '''<r><Structure>
<Field>
<Field_Name>GAMEREF</Field_Name>
<Field_Type>Numeric</Field_Type>
<Field_Len>4</Field_Len>
<Field_Dec>0</Field_Dec>
</Field>
...
<Field>
<Field_Name>WINLOSS</Field_Name>
<Field_Type>Character</Field_Type>
<Field_Len>1</Field_Len>
<Field_Dec>0</Field_Dec>
</Field>
</Structure>
<Records>
<Record>
<GAMEREF>1217</GAMEREF>
<YEAR>2021</YEAR>
<GAMENO>1</GAMENO>
<WINLOSS>W</WINLOSS>
</Record>
<Record>
<GAMEREF>1220</GAMEREF>
<YEAR>2021</YEAR>
<GAMENO>4</GAMENO>
<WINLOSS>L</WINLOSS>
</Record>
</Records></r>'''
fields = {'GAMEREF':'GAME REF', 'YEAR':'YEAR', 'GAMENO':'GAME NO','WINLOSS':'WIN LOSS'}
root = ET.fromstring(xml)
data = [{display_name: rec.find(element_name).text for element_name,display_name in fields.items()} for rec in root.findall('.//Record')]
df = pd.DataFrame(data)
print(df)
output
GAME REF YEAR GAME NO WIN LOSS
0 1217 2021 1 W
1 1220 2021 4 L
import pandas as pd
import xml.etree.ElementTree as et
xtree = et.parse("gamedata.xml")
xroot = xtree.getroot()
df_cols = ["GAME REF","YEAR", "GAME NO", "WIN LOSS"]
rows = []
for record in xroot:
s_gameref = record.find('GAMEREF').text
s_year = record.find('YEAR').text
s_game_no = record.find('GAMENO').text
s_winloss = record.find('WINLOSS').text
rows.append({"GAME REF": s_gameref,"YEAR": s_year,
"GAME NO": s_game_no, "WIN LOSS": s_winloss})
df = pd.DataFrame(rows, columns = df_cols)
Remove .iter()
working on XML, for which I will have to loop through and compare the values before or afterwords.
<TRANS DESCRIPTION ="" NAME ="EXPRR" >
<FIELD EXPR ="A1" NAME ="SD" PORTTYPE ="INPUT/OUTPUT"/>
<FIELD EXPR ="V" NAME ="DDS" PORTTYPE ="VARIABLE"/>
<FIELD EXPR ="C" NAME ="SSS" PORTTYPE ="OUTPUT"/>
<FIELD EXPR ="SD" NAME ="SS" PORTTYPE ="VARIABLE"/>
<FIELD EXPR ="XX" NAME ="EEEE" PORTTYPE ="INPUT/OUTPUT"/>
</TRANS>
I would like to put this in the temp memory where I can look through the values and add a sequence.
for ex.
seq key value
1 A1 SD
2 V DDS
3 C SSS
4 SD SSS
5 XX EEEE
Once I have this I will have to compare if value exists in the below rows.
For example SD exists in below row. so on.
Is there any data structure I can use to perform this operation in Python 3 ?.
ONE WAY:
import xml.etree.ElementTree as ET
import xmltodict
import pandas as pd
tree = ET.parse('<your xml file path here>')
xml_data = tree.getroot()
# here you can change the encoding type to be able to set it to the one you need
xmlstr = ET.tostring(xml_data, encoding='utf-8', method='xml')
data_dict = dict(xmltodict.parse(xmlstr))
df = pd.DataFrame(data_dict['TRANS']['FIELD']).drop('#PORTTYPE', 1)
print(df)
OUTPUT:
#EXPR #NAME
0 A1 SD
1 V DDS
2 C SSS
3 SD SS
4 XX EEEE
You could use collections.defaultdict to collate your data before creating a dataframe :
data = """<TRANS DESCRIPTION ="" NAME ="EXPRR" >
<FIELD EXPR ="A1" NAME ="SD" PORTTYPE ="INPUT/OUTPUT"/>
<FIELD EXPR ="V" NAME ="DDS" PORTTYPE ="VARIABLE"/>
<FIELD EXPR ="C" NAME ="SSS" PORTTYPE ="OUTPUT"/>
<FIELD EXPR ="SD" NAME ="SS" PORTTYPE ="VARIABLE"/>
<FIELD EXPR ="XX" NAME ="EEEE" PORTTYPE ="INPUT/OUTPUT"/>
</TRANS>
"""
import xml.etree.ElementTree as ET
root = ET.fromstring(data)
from collections import defaultdict
collection = defaultdict(list)
for child in root:
collection['key'].append(child.attrib['EXPR'])
collection['value'].append(child.attrib['NAME'])
pd.DataFrame(collection).rename_axis('seq')
key value
seq
0 A1 SD
1 V DDS
2 C SSS
3 SD SS
4 XX EEEE
I am trying to build an xml file from an excel spreadsheet using python but am having trouble building the structure. The xml schema is unique to a software so the opening few tags and ending few would be easier to be written to the xml file just as variables, shown below. They are constant so are pulled from the "
I believe the script neeeds to loop through another sheet, being the ".XML Framework" sheet to build the .xml structure as these are the values which will be ultimately changing. The structure of this sheet is provided below.
here is the .xml structure, from which the python is outputting well up to the unique values, and the changing values are shown in bold. This shows just one row of the data from the workbook. When the workbook has a second row, the .xml structure repeats again where it starts with .
The data structure in the excel sheet ".XML Framework" is:
col 1 = **equals**
col 2 = **74**
col 3 = **Data**"
col 4 = col 3
col 5 = **Name 07**
col 6 = col 5
col 7 = **wstring**
col 8 = /**SM15-HVAC-SUPP-TM-37250-ST**
THIS IS THE DESIRED XML STRUCTURE
<?xml version="1.0" encoding="UTF-8" ?>
<exchange xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="http://download.autodesk.com/us/navisworks/schemas/nw-exchange-12.0.xsd" units="m" filename="" filepath="">
<selectionsets>
<selectionset name="Dev_1">
<findspec mode="all" disjoint="0">
<conditions>
<condition test="**equals**" flags="**74**">
<category>
<name internal="**Data**">**Data**</name>
</category>
<property>
<name internal="**Name 07**">**Name 07**</name>
</property>
<value>
<data type="**wstring**">/**SM15-HVAC-SUPP-TM-37250-ST**</data>
</value>
</condition>
</conditions>
<locator>/</locator>
</findspec>
</exchange>
Here is my attempt from the python:
path = (r"C:\\Users\\ciara\\desktop\\")
book = os.path.join(path + "Search_Set.xlsm")
wb = openpyxl.load_workbook(book)
sh = wb.get_sheet_by_name('.XML Framework')
df1 = pd.read_excel(book, "<CLEAN>", header=None)
#opening 5 lines of .xml search
print(df1)
cV1 = df1.iloc[0,0] #xml header
print (cV1)
cV2 = df1.iloc[1,0] #<exchange>
print (cV2)
cV3 = df1.iloc[2,0] #<selectionsets>
print (cV3)
cV4 = df1.iloc[3,0] #<selection set name>
print (cV4)
cV5 = df1.iloc[4,0] #<findspec mode>
print (cV5)
cV6 = df1.iloc[5,0] #<findspec mode>
print (cV6)
E = lxml.builder.ElementMaker()
root = ET.Element(cV1)
doc0 = ET.SubElement(root, cV2)
doc1 = ET.SubElement(doc0, cV3)
doc2 = ET.SubElement(doc1, cV4)
doc3 = ET.SubElement(doc2, cV5)
doc4 = ET.SubElement(doc3, cV6)
the_doc = root(
doc0(
doc1(
doc2(
doc3(
FIELD1('condition test=', name='blah'),
FIELD2('some value2', name='asdfasd'),
)
)
)
)
)
print (lxml.etree.tostring(the_doc, pretty_print=True))
tree = ET.ElementTree(root)
tree.write("filename.xml")
I am trying to find all ancestors of node.
my xml,
xmldata="""
<OrganizationTreeInfo xmlns:i="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://schemas.datacontract.org/2004/07/YSM.PMS.Web.Service.DataTransfer.Models">
<Name>Parent</Name>
<OrganizationId>4345</OrganizationId>
<Children>
<OrganizationTreeInfo>
<Name>A</Name>
<OrganizationId>123</OrganizationId>
<Children>
<OrganizationTreeInfo>
<Name>B</Name>
<OrganizationId>54</OrganizationId>
<Children/>
</OrganizationTreeInfo>
</Children>
</OrganizationTreeInfo>
<OrganizationTreeInfo>
<Name>C</Name>
<OrganizationId>34</OrganizationId>
<Children>
<OrganizationTreeInfo>
<Name>D</Name>
<OrganizationId>32323</OrganizationId>
<Children>
<OrganizationTreeInfo>
<Name>E</Name>
<OrganizationId>3234</OrganizationId>
<Children/>
</OrganizationTreeInfo>
</Children>
</OrganizationTreeInfo>
</Children>
</OrganizationTreeInfo>
</Children>
"""
for e.g. If I input value of OrganizationId as 3234 , then output should be like,
{'parent':4345,'C':34,'D':32323,'E':3234 }
Here is my try,
root = ET.fromstring(xmldata)
for target in root.xpath('.//OrganizationTreeInfo/OrganizationId[text()="3234"]'):
d = {
dept.find('Name').text: int(dept.find('OrganizationId').text)
for dept in target.xpath('ancestor-or-self::OrganizationTreeInfo')
}
print(d)
But it is not giving any output. I am unable to find out whats wrong with it.
You are not getting correct answer because of namespace
xmlns="http://schemas.datacontract.org/2004/07/YSM.PMS.Web.Service.DataTransfer.Models"
Following code with namespace:
code:
import lxml.etree as ET
root = ET.fromstring(xmldata)
result = {}
count = 1
namespaces1={'xmlns':'http://schemas.datacontract.org/2004/07/YSM.PMS.Web.Service.DataTransfer.Models',}
for target in root.xpath('.//xmlns:OrganizationTreeInfo/xmlns:OrganizationId[text()="3234"]',\
namespaces=namespaces1):
result[count] = {}
for dept in target.xpath('ancestor-or-self::xmlns:OrganizationTreeInfo', namespaces=namespaces1):
result[count][dept.find('xmlns:Name', namespaces=namespaces1).text] = int(dept.find('xmlns:OrganizationId', namespaces=namespaces1).text)
count += 1
import pprint
pprint.pprint(result)
Output:
:~/workspace/vtestproject/study$ python test1.py
{1: {'C': 34, 'D': 32323, 'E': 3234, 'Parent': 4345}}
Replace xmlns= string with other temp string.
code:
import lxml.etree as ET
new_xmldata = xmldata.replace("xmlns=", "xmlnamespace=")
root = ET.fromstring(new_xmldata)#, namespace="{http://schemas.datacontract.org/2004/07/YSM.PMS.Web.Service.DataTransfer.Models}")
result = {}
count = 1
for target in root.xpath('.//OrganizationTreeInfo/OrganizationId[text()="3234"]'):
result[count] = {}
for dept in target.xpath('ancestor-or-self::OrganizationTreeInfo'):
result[count][dept.find('Name').text] = int(dept.find('OrganizationId').text)
count += 1
import pprint
pprint.pprint(result)
Output:
:~/workspace/vtestproject/study$ python test1.py
{1: {'C': 34, 'D': 32323, 'E': 3234, 'Parent': 4345}}
I am trying to parse an XML document that contains repeating child elements using Python. When I attempt to parse the data, it creates an empty file. If I comment out the repeating child elements code (see bolded section in python script below), the document generates correctly. Can someone help?
XML:
<?xml version="1.0" encoding="ISO-8859-1" standalone="no"?>
<FRPerformance xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<FRPerformanceShareClassCurrency>
<FundCode>00190</FundCode>
<CurrencyID>USD</CurrencyID>
<FundShareClassCode>A</FundShareClassCode>
<ReportPeriodFrequency>Quarterly</ReportPeriodFrequency>
<ReportPeriodEndDate>06/30/2012</ReportPeriodEndDate>
<Net>
<Annualized>
<Year1>-4.909000000</Year1>
<Year3>10.140000000</Year3>
<Year5>-22.250000000</Year5>
<Year10>-7.570000000</Year10>
<Year15>-4.730000000</Year15>
<Year20>-0.900000000</Year20>
<SI>1.900000000</SI>
</Annualized>
</Net>
<Gross>
<Annualized>
<Month3>1.279000000</Month3>
<YTD>7.294000000</YTD>
<Year1>-0.167000000</Year1>
<Year3>11.940000000</Year3>
<Year5>-21.490000000</Year5>
<Year10>-7.120000000</Year10>
<Year15>-4.420000000</Year15>
<Year20>-0.660000000</Year20>
<SI>2.110000000</SI>
</Annualized>
<Cumulative>
<Month1Back>2.288000000</Month1Back>
<Month2Back>-1.587000000</Month2Back>
<Month3Back>0.610000000</Month3Back>
<CurrentYear>7.294000000</CurrentYear>
<Year1Back>-2.409000000</Year1Back>
<Year2Back>13.804000000</Year2Back>
<Year3Back>20.287000000</Year3Back>
<Year4Back>-78.528000000</Year4Back>
<Year5Back>-0.101000000</Year5Back>
<Year6Back>9.193000000</Year6Back>
<Year7Back>2.659000000</Year7Back>
<Year8Back>9.208000000</Year8Back>
<Year9Back>25.916000000</Year9Back>
<Year10Back>-3.612000000</Year10Back>
</Cumulative>
<HistoricReturns>
<HistoricReturns_Item>
<Date>Fri, 28 Feb 1997 00:00:00 -0600</Date>
<Return>32058.090000000</Return>
</HistoricReturns_Item>
<HistoricReturns_Item>
<Date>Fri, 28 Feb 2003 00:00:00 -0600</Date>
<Return>36415.110000000</Return>
</HistoricReturns_Item>
<HistoricReturns_Item>
<Date>Fri, 29 Feb 2008 00:00:00 -0600</Date>
<Return>49529.290000000</Return>
</HistoricReturns_Item>
<HistoricReturns_Item>
<Date>Fri, 30 Apr 1993 00:00:00 -0600</Date>
<Return>21621.500000000</Return>
</HistoricReturns_Item>
</<HistoricReturns>
Python script
## Create command line arguments for XML file and tageName
xmlFile = sys.argv[1]
tagName = sys.argv[2]
tree = ET.parse(xmlFile)
root = tree.getroot()
## Setup the file for output
saveout = sys.stdout
output_file = open('parsedXML.csv', 'w')
sys.stdout = output_file
## Parse XML
for node in root.findall(tagName):
fundCode = node.find('FundCode').text
curr = node.find('CurrencyID').text
shareClass = node.find('FundShareClassCode').text
for node2 in node.findall('./Net/Annualized'):
year1 = node2.findtext('Year1')
year3 = node2.findtext('Year3')
year5 = node2.findtext('Year5')
year10 = node2.findtext('Year10')
year15 = node2.findtext('Year15')
year20 = node2.findtext('Year20')
SI = node2.findtext('SI')
for node3 in node.findall('./Gross'):
for node4 in node3.findall('./Annualized'):
month3 = node4.findtext('Month3')
ytd = node4.findtext('YTD')
year1g = node4.findtext('Year1')
year3g = node4.findtext('Year3')
year5g = node4.findtext('Year5')
year10g = node4.findtext('Year10')
year15g = node4.findtext('Year15')
year20g = node4.findtext('Year2')
SIg = node4.findtext('SI')
for node5 in node3.findall('./Cumulative'):
month1b = node5.findtext('Month1Back')
month2b = node5.findtext('Month2Back')
month3b = node5.findtext('Month3Back')
curYear = node5.findtext('CurrentYear')
year1b = node5.findtext('Year1Back')
year2b = node5.findtext('Year2Back')
year3b = node5.findtext('Year3Back')
year4b = node5.findtext('Year4Back')
year5b = node5.findtext('Year5Back')
year6b = node5.findtext('Year6Back')
year7b = node5.findtext('Year7Back')
year8b = node5.findtext('Year8Back')
year9b = node5.findtext('Year9Back')
year10b = node5.findtext('Year10Back')
**for node6 in node.findall('./HistoricReturns'):
for node7 in node6.findall('./HistoricReturns_Item'):
hDate = node7.findall('Date')
hReturn = node7.findall('Return')**
print(fundCode, curr, shareClass,year1, year3, year5, year10, year15, year15, year20, SI,month3, ytd, year1g, year3g, year5g, year10g, year15g, year20g, SIg, month1b, month2b, month3b, curYear, year1b, year2b, year3b, year4b, year5b, year6b, year7b, year8b,year9b,year10b, hDate, hReturn)
The sample XML and the python code don't match up in terms of structure. Either
you're missing a closing </Gross> tag from the XML (which should be before the <HistoricReturns> section starts) - in which case the code is correct or
the code should be for node6 in node3.findall('./HistoricReturns'): i.e. node3 instead of node
N.B. The XML sample isn't complete (it isn't well-formed XML) because it's missing closing tags for Gross, FRPerformanceShareClassCurrency and FRPerformance so this makes it impossible to answer the question definitively. Hope this helps though.