How to generate in memory data/table in python? - python

working on XML, for which I will have to loop through and compare the values before or afterwords.
<TRANS DESCRIPTION ="" NAME ="EXPRR" >
<FIELD EXPR ="A1" NAME ="SD" PORTTYPE ="INPUT/OUTPUT"/>
<FIELD EXPR ="V" NAME ="DDS" PORTTYPE ="VARIABLE"/>
<FIELD EXPR ="C" NAME ="SSS" PORTTYPE ="OUTPUT"/>
<FIELD EXPR ="SD" NAME ="SS" PORTTYPE ="VARIABLE"/>
<FIELD EXPR ="XX" NAME ="EEEE" PORTTYPE ="INPUT/OUTPUT"/>
</TRANS>
I would like to put this in the temp memory where I can look through the values and add a sequence.
for ex.
seq key value
1 A1 SD
2 V DDS
3 C SSS
4 SD SSS
5 XX EEEE
Once I have this I will have to compare if value exists in the below rows.
For example SD exists in below row. so on.
Is there any data structure I can use to perform this operation in Python 3 ?.

ONE WAY:
import xml.etree.ElementTree as ET
import xmltodict
import pandas as pd
tree = ET.parse('<your xml file path here>')
xml_data = tree.getroot()
# here you can change the encoding type to be able to set it to the one you need
xmlstr = ET.tostring(xml_data, encoding='utf-8', method='xml')
data_dict = dict(xmltodict.parse(xmlstr))
df = pd.DataFrame(data_dict['TRANS']['FIELD']).drop('#PORTTYPE', 1)
print(df)
OUTPUT:
#EXPR #NAME
0 A1 SD
1 V DDS
2 C SSS
3 SD SS
4 XX EEEE

You could use collections.defaultdict to collate your data before creating a dataframe :
data = """<TRANS DESCRIPTION ="" NAME ="EXPRR" >
<FIELD EXPR ="A1" NAME ="SD" PORTTYPE ="INPUT/OUTPUT"/>
<FIELD EXPR ="V" NAME ="DDS" PORTTYPE ="VARIABLE"/>
<FIELD EXPR ="C" NAME ="SSS" PORTTYPE ="OUTPUT"/>
<FIELD EXPR ="SD" NAME ="SS" PORTTYPE ="VARIABLE"/>
<FIELD EXPR ="XX" NAME ="EEEE" PORTTYPE ="INPUT/OUTPUT"/>
</TRANS>
"""
import xml.etree.ElementTree as ET
root = ET.fromstring(data)
from collections import defaultdict
collection = defaultdict(list)
for child in root:
collection['key'].append(child.attrib['EXPR'])
collection['value'].append(child.attrib['NAME'])
pd.DataFrame(collection).rename_axis('seq')
key value
seq
0 A1 SD
1 V DDS
2 C SSS
3 SD SS
4 XX EEEE

Related

Create dataframe of certain XML element's text python pandas

I am trying to create a dataframe out the XML code as shown below
<Structure>
<Field>
<Field_Name>GAMEREF</Field_Name>
<Field_Type>Numeric</Field_Type>
<Field_Len>4</Field_Len>
<Field_Dec>0</Field_Dec>
</Field>
...
<Field>
<Field_Name>WINLOSS</Field_Name>
<Field_Type>Character</Field_Type>
<Field_Len>1</Field_Len>
<Field_Dec>0</Field_Dec>
</Field>
</Structure>
<Records>
<Record>
<GAMEREF>1217</GAMEREF>
<YEAR>2021</YEAR>
(MORE ELEMENTS I DO NOT CARE ABOUT)
<GAMENO>1</GAMENO>
<WINLOSS>W</WINLOSS>
</Record>
...
<Record>
<GAMEREF>1220</GAMEREF>
<YEAR>2021</YEAR>
(MORE ELEMENTS I DO NOT CARE ABOUT)
<GAMENO>4</GAMENO>
<WINLOSS>L</WINLOSS>
</Record>
</Records>
The structure section of the XML code that is irrelevant to the dataframe I am trying to create.
I am trying to only use the XML elements of GAMEREF, YEAR, GAMENO, and WINLOSS as there are more in the XML for the Record elements.
I have tried using code as shown below to get this to work, but when I run the code I get the error of "AttributeError: 'NoneType' object has no attribute 'text'"
Code is below.
import pandas as pd
import xml.etree.ElementTree as et
xtree = et.parse("gamedata.xml")
xroot = xtree.getroot()
df_cols = ["GAME REF","YEAR", "GAME NO", "WIN LOSS"]
rows = []
for child in xroot.iter():
s_gameref = child.find('GAMEREF').text,
s_year = child.find('YEAR').text,
s_game_no = child.find('GAMENO').text,
s_winloss = child.find('WINLOSS').text
rows.append({"GAME REF": s_gameref,"YEAR": s_year,
"GAME NO": s_game_no, "WIN LOSS": s_winloss})
df = pd.DataFrame(rows, columns = df_cols)
The code is based off other stuff I have seen on the Stack and other sites, but nothing is working yet.
Ideal dataframe output is below
GAME REF
YEAR
GAME NO
WIN LOSS
1217
2021
1
W
1218
2021
2
W
1219
2021
3
L
1220
2021
4
L
Thanks
EDIT - NOT SURE WHAT IS GOING ON WITH MY TABLE, BUT IT SHOULD LOOK LIKE THIS
I think the below is what you are looking for. (Just loop over the "interesting" sub elements of Record). The logic of the code is in the line that starts with data = [.... The 2 loops can be found there.
import pandas as pd
import xml.etree.ElementTree as ET
xml = '''<r><Structure>
<Field>
<Field_Name>GAMEREF</Field_Name>
<Field_Type>Numeric</Field_Type>
<Field_Len>4</Field_Len>
<Field_Dec>0</Field_Dec>
</Field>
...
<Field>
<Field_Name>WINLOSS</Field_Name>
<Field_Type>Character</Field_Type>
<Field_Len>1</Field_Len>
<Field_Dec>0</Field_Dec>
</Field>
</Structure>
<Records>
<Record>
<GAMEREF>1217</GAMEREF>
<YEAR>2021</YEAR>
<GAMENO>1</GAMENO>
<WINLOSS>W</WINLOSS>
</Record>
<Record>
<GAMEREF>1220</GAMEREF>
<YEAR>2021</YEAR>
<GAMENO>4</GAMENO>
<WINLOSS>L</WINLOSS>
</Record>
</Records></r>'''
fields = {'GAMEREF':'GAME REF', 'YEAR':'YEAR', 'GAMENO':'GAME NO','WINLOSS':'WIN LOSS'}
root = ET.fromstring(xml)
data = [{display_name: rec.find(element_name).text for element_name,display_name in fields.items()} for rec in root.findall('.//Record')]
df = pd.DataFrame(data)
print(df)
output
GAME REF YEAR GAME NO WIN LOSS
0 1217 2021 1 W
1 1220 2021 4 L
import pandas as pd
import xml.etree.ElementTree as et
xtree = et.parse("gamedata.xml")
xroot = xtree.getroot()
df_cols = ["GAME REF","YEAR", "GAME NO", "WIN LOSS"]
rows = []
for record in xroot:
s_gameref = record.find('GAMEREF').text
s_year = record.find('YEAR').text
s_game_no = record.find('GAMENO').text
s_winloss = record.find('WINLOSS').text
rows.append({"GAME REF": s_gameref,"YEAR": s_year,
"GAME NO": s_game_no, "WIN LOSS": s_winloss})
df = pd.DataFrame(rows, columns = df_cols)
Remove .iter()

Building XML from excel data with Python

I am trying to build an xml file from an excel spreadsheet using python but am having trouble building the structure. The xml schema is unique to a software so the opening few tags and ending few would be easier to be written to the xml file just as variables, shown below. They are constant so are pulled from the "
I believe the script neeeds to loop through another sheet, being the ".XML Framework" sheet to build the .xml structure as these are the values which will be ultimately changing. The structure of this sheet is provided below.
here is the .xml structure, from which the python is outputting well up to the unique values, and the changing values are shown in bold. This shows just one row of the data from the workbook. When the workbook has a second row, the .xml structure repeats again where it starts with .
The data structure in the excel sheet ".XML Framework" is:
col 1 = **equals**
col 2 = **74**
col 3 = **Data**"
col 4 = col 3
col 5 = **Name 07**
col 6 = col 5
col 7 = **wstring**
col 8 = /**SM15-HVAC-SUPP-TM-37250-ST**
THIS IS THE DESIRED XML STRUCTURE
<?xml version="1.0" encoding="UTF-8" ?>
<exchange xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="http://download.autodesk.com/us/navisworks/schemas/nw-exchange-12.0.xsd" units="m" filename="" filepath="">
<selectionsets>
<selectionset name="Dev_1">
<findspec mode="all" disjoint="0">
<conditions>
<condition test="**equals**" flags="**74**">
<category>
<name internal="**Data**">**Data**</name>
</category>
<property>
<name internal="**Name 07**">**Name 07**</name>
</property>
<value>
<data type="**wstring**">/**SM15-HVAC-SUPP-TM-37250-ST**</data>
</value>
</condition>
</conditions>
<locator>/</locator>
</findspec>
</exchange>
Here is my attempt from the python:
path = (r"C:\\Users\\ciara\\desktop\\")
book = os.path.join(path + "Search_Set.xlsm")
wb = openpyxl.load_workbook(book)
sh = wb.get_sheet_by_name('.XML Framework')
df1 = pd.read_excel(book, "<CLEAN>", header=None)
#opening 5 lines of .xml search
print(df1)
cV1 = df1.iloc[0,0] #xml header
print (cV1)
cV2 = df1.iloc[1,0] #<exchange>
print (cV2)
cV3 = df1.iloc[2,0] #<selectionsets>
print (cV3)
cV4 = df1.iloc[3,0] #<selection set name>
print (cV4)
cV5 = df1.iloc[4,0] #<findspec mode>
print (cV5)
cV6 = df1.iloc[5,0] #<findspec mode>
print (cV6)
E = lxml.builder.ElementMaker()
root = ET.Element(cV1)
doc0 = ET.SubElement(root, cV2)
doc1 = ET.SubElement(doc0, cV3)
doc2 = ET.SubElement(doc1, cV4)
doc3 = ET.SubElement(doc2, cV5)
doc4 = ET.SubElement(doc3, cV6)
the_doc = root(
doc0(
doc1(
doc2(
doc3(
FIELD1('condition test=', name='blah'),
FIELD2('some value2', name='asdfasd'),
)
)
)
)
)
print (lxml.etree.tostring(the_doc, pretty_print=True))
tree = ET.ElementTree(root)
tree.write("filename.xml")

XML to CSV in PYTHON: Extract series of subnodes for every node

My goal is to convert an .XML file into a .CSV file.
This part of the code is already functional.
However, I also want to extract the sub-sub-nodes of one of the "father" nodes.
Maybe an example would be more self explanatory;
Here is the structure of my XML:
<nedisCatalogue>
<headerInfo>
<feedVersion>1-0</feedVersion>
<dateCreated>2018-01-22T23:37:01+0100</dateCreated>
<supplier>Nedis_BENED</supplier>
<locale>nl_BE</locale>
</headerInfo>
<productList>
<product>
<nedisPartnr><![CDATA[VS-150/63BA]]></nedisPartnr>
<nedisArtlid>17005</nedisArtlid>
<vendorPartnr><![CDATA[TONFREQ-ELKOS / BIPOL 150, 5390]]></vendorPartnr>
<brand><![CDATA[Visaton]]></brand>
<EAN>4007540053905</EAN>
<intrastatCode>8532220000</intrastatCode>
<UNSPSC>52161514</UNSPSC>
<headerText><![CDATA[Crossover Foil capacitor]]></headerText>
<internetText><![CDATA[Bipolaire elco met een ruwe folie en een zeer goede prijs/kwaliteits-verhouding voor de bouw van cross-overs. 63 Vdc, 10% tolerantie.]]></internetText>
<generalText><![CDATA[Dimensions 16 x 35 mm
]]></generalText>
<images>
<image type="2" order="15">767736.JPG</image>
</images>
<attachments>
</attachments>
<categories>
<tree name="Internet_Tree_ISHP">
<entry depth="001" id="1067858"><![CDATA[Audio]]></entry>
<entry depth="002" id="1067945"><![CDATA[Speakers]]></entry>
<entry depth="003" id="1068470"><![CDATA[Accessoires]]></entry>
</tree>
</categories>
<properties>
<property id="360" multiplierID="" unitID="" valueID=""><![CDATA[...]]></property>
</properties>
<status>
<code status="NORMAL"></code>
</status>
<packaging quantity="1" weight="8"></packaging>
<introductionDate>2015-10-26</introductionDate>
<serialnumberKeeping>N</serialnumberKeeping>
<priceLevels>
<normalPricing from="2017-02-13" to="2018-01-23">
<price level="1" moq="1" currency="EUR">2.48</price>
</normalPricing>
<specialOfferPricing></specialOfferPricing>
<goingPriceInclVAT currency="EUR" quantity="1">3.99</goingPriceInclVAT>
</priceLevels>
<tax>
</tax>
<stock>
<inStockLocal>25</inStockLocal>
<inStockCentral>25</inStockCentral>
<ATP>
<nextExpectedStockDateLocal></nextExpectedStockDateLocal>
<nextExpectedStockDateCentral></nextExpectedStockDateCentral>
</ATP>
</stock>
</product>
....
</nedisCatalogue>
And here is the code that I have now:
import xml.etree.ElementTree as ET
import csv
tree = ET.parse("/Users/BE07861/Documents/nedis_catalog_2018-01-23_nl_BE_32191_v1-0_xml")
root = tree.getroot()
f = open('/Users/BE07861/Documents/test2.csv', 'w')
csvwriter = csv.writer(f, delimiter='ç')
count = 0
head = ['Nedis Part Number', 'Nedis Article ID', 'Vendor Part Number', 'Brand', 'EAN', 'Header text', 'Internet Text', 'General Text', 'categories']
prdlist = root[1]
prdct = prdlist[5]
cat = prdct[12]
tree1=cat[0]
csvwriter.writerow(head)
for time in prdlist.findall('product'):
row = []
nedis_number = time.find('nedisPartnr').text
row.append(nedis_number)
nedis_art_id = time.find('nedisArtlid').text
row.append(nedis_art_id)
vendor_part_nbr = time.find('vendorPartnr').text
row.append(vendor_part_nbr)
Brand = time.find('brand').text
row.append(Brand)
ean = time.find('EAN').text
row.append(ean)
header_text = time.find('headerText').text
row.append(header_text)
internet_text = time.find('internetText').text
row.append(internet_text)
general_text = time.find('generalText').text
row.append(general_text)
categ = time.find('categories').find('tree').find('entry').text
row.append(categ)
csvwriter.writerow(row)
f.close()
If you run the code, you'll see that I only retrieve the first "entry" of the categories/tree; which is normal. However, I don't know how to create a loop that, for every node "categories", creates new columns such as categories1, categories2 & categories3 with the values: "entry".
My result should look like this
Nedis Part Number Nedis Article ID Vendor Part Number
VS-150/63BA 17005 TONFREQ-ELKOS / BIPOL 150, 5390
Brand EAN Header text Internet Text
Visaton 4,00754E+12 Crossover Foil capacitor Bipolaire elco …
General Text Category1 Categroy2 Category3
Dimensions 16 x 35 mm Audio Speakers Accessoires
I've really tried my best but didn't manage to find the solution.
Any help would be very much appreciated!!! :)
Thanks a lot,
Allan
I think this is what you're looking for:
for child in time.find('categories').find('tree'):
categ = child.text
row.append(categ)
Here's a solution that loops through the xml once to figure out how many headers to add, adds the headers, and then loops through each product's category list:
**Updated to iterate through images in addition to categories. This is the biggest difference:
for child in time.find('categories').find('tree'):
categ = child.text
row.append(categ)
curcat += 1
while curcat < maxcat:
row.append('')
curcat += 1
It's going to figure out the maximum number of categories on a single record and then and that many columns. If a particular record has less categories, this code will stick blank values in as placeholders so the column headers always line up with the data.
For instance:
Cat1 Cat2 Cat3 Img1 Img2 Img3
A B C 1 2 3
D E <blank> 4 <blank> <blank>
Here's the full solution:
import xml.etree.ElementTree as ET
import csv
tree = ET.parse("c:\\python\\xml.xml")
root = tree.getroot()
f = open('c:\\python\\xml.csv', 'w')
csvwriter = csv.writer(f, delimiter=',')
count = 0
head = ['Nedis Part Number', 'Nedis Article ID', 'Vendor Part Number', 'Brand', 'EAN', 'Header text', 'Internet Text', 'General Text']
prdlist = root[1]
maxcat = 0
for time in prdlist.findall('product'):
cur = 0
for child in time.find('categories').find('tree'):
cur += 1
if cur > maxcat:
maxcat = cur
for cnt in range (0, maxcat):
head.append('Category ' + str(cnt + 1))
maximg = 0
for time in prdlist.findall('product'):
cur = 0
for child in time.find('images'):
cur += 1
if cur > maximg:
maximg = cur
for cnt in range(0, maximg):
head.append('Image ' + str(cnt + 1))
csvwriter.writerow(head)
for time in prdlist.findall('product'):
row = []
nedis_number = time.find('nedisPartnr').text
row.append(nedis_number)
nedis_art_id = time.find('nedisArtlid').text
row.append(nedis_art_id)
vendor_part_nbr = time.find('vendorPartnr').text
row.append(vendor_part_nbr)
Brand = time.find('brand').text
row.append(Brand)
ean = time.find('EAN').text
row.append(ean)
header_text = time.find('headerText').text
row.append(header_text)
internet_text = time.find('internetText').text
row.append(internet_text)
general_text = time.find('generalText').text
row.append(general_text)
curcat = 0
for child in time.find('categories').find('tree'):
categ = child.text
row.append(categ)
curcat += 1
while curcat < maxcat:
row.append('')
curcat += 1
curimg = 0
for img in time.find('images'):
image = img.text
row.append(image)
curimg += 1
while curimg < maximg:
row.append('')
curimg += 1
csvwriter.writerow(row)
f.close()

getting value from xml in a dict

I have a XML file that I want to parse. In the file I have 3 unique tags -
3
2
1
Each of these have 1 unique value for a metricX. I want to be extract these values in form a dict in python.
Something like
Desired Output
{ 3 : {"metricX":100}, 2 : {"metricX":11}, 1 : {"metricX":44}}
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="MeasDataCollection.xsl"?>
<!DOCTYPE mdc SYSTEM "MeasDataCollection.dtd">
<mdc xmlns:HTML="http://www.w3.org/TR/REC-xml">
<md>
<neid>
<neun>936001_STURGEON_BAY_MEYER</neun>
</neid>
<mi>
<mi>
<mts>20170924161500Z</mts>
<gp>900</gp>
<mt>metricX</mt>
<mv>
<moid>3</moid>
<r>100</r>
</mv>
<mv>
<moid>2</moid>
<r>11</r>
</mv>
<mv>
<moid>1</moid>
<r>44</r>
</mv>
</mi>
</mi>
</md>
</mdc>
So far I have tried using Element Tree.
import os
import xml.etree.ElementTree as ET
fullpath = os.getcwd()
os.chdir(r"C:\Users\sss\Documents\Zabbix_work\xml_parsing")
tree = ET.ElementTree(file='smaple.xml')
for elem in tree.iter():
print (elem.tag, elem.text)
Output so far is -
mdc
md
neid
neun 936001_STURGEON_BAY_MEYER
mi
mi
mts 20170924161500Z
gp 900
mt metricX
mv
moid 3
r 100
mv
moid 2
r 11
mv
moid 1
r 44
Not so sure now how to organize it further in form of a dict.
This should do the trick:
import xml.etree.ElementTree as ET
import os
file_path = os.path.expanduser('~/Desktop/input123.xml') # filepath here
tree = ET.ElementTree(file=file_path)
my_dict = {}
for node in tree.getroot().find('md').find('mi').find('mi').findall('mv'):
my_dict[int(node.find('moid').text)] = { 'metricX': int(node.find('r').text) }
print(my_dict)
...output:
{3: {'metricX': 100}, 2: {'metricX': 11}, 1: {'metricX': 44}}

Python 3: XML Tag Value not being written to csv file

My python 3 script takes an xml file and creates a csv file.
Small excerpt of xml file:
<?xml version="1.0" encoding="UTF-8" ?>
<metadata>
<dc>
<title>Golden days for boys and girls, 1895-03-16, v. XVI #17</title>
<subject>Children's literature--Children's periodicals</subject>
<description>Archives & Special Collections at the Thomas J. Dodd Research Center, University of Connecticut Libraries</description>
<publisher>James Elverson, 1880-</publisher>
<date>1895-06-15</date>
<type>Text | periodicals</type>
<format>image/jp2</format>
<handle>http://hdl.handle.net/11134/20002:860074494</handle>
<accessionNumber/>
<barcode/>
<identifier>20002:860074494 | local: 868010272 | local: 997186613502432 | local: 39153019382870 | hdl:  | http://hdl.handle.net/11134/20002:860074494</identifier>
<rights>These Materials are provided for educational and research purposes only. The University of Connecticut Libraries hold the copyright except where noted. Permission must be obtained in writing from the University of Connecticut Libraries and/or theowner(s) of the copyright to publish reproductions or quotations beyond "fair use." | The collection is open and available for research.</rights>
<creator/>
<relation/>
<coverage/>
<language/>
</dc>
</metadata>
Python3 code:
import csv
import xml.etree.ElementTree as ET
tree = ET.ElementTree(file='ctda_set1_uniqueTags.xml')
doc = ET.parse("ctda_set1_uniqueTags.xml")
root = tree.getroot()
oaidc_data = open('ctda_set1_uniqueTags.csv', 'w', encoding='utf-8')
titles = 'dc/title'
subjects = 'dc/subject'
csvwriter = csv.writer(oaidc_data)
oaidc_head = ['Title', 'Subject', 'Description', 'Publisher', 'Date', 'Type', 'Format', 'Handle', 'Accession Number', 'Barcode', 'Identifiers', 'Rights', 'Creator', 'Relation', 'Coverage', 'Language']
count = 0
for member in root.findall('dc'):
if count == 0:
csvwriter.writerow(oaidc_head)
count = count + 1
dcdata = []
titles = member.find('title').text
dcdata.append(titles)
subjects = member.find('subject').text
dcdata.append(subjects)
descriptions = member.find('description').text
dcdata.append(descriptions)
publishers = member.find('publisher').text
dcdata.append(publishers)
dates = member.find('date').text
dcdata.append(dates)
types = member.find('type').text
dcdata.append(types)
formats = member.find('format').text
dcdata.append(formats)
handle = member.find('handle').text
dcdata.append(handle)
accessionNo = member.find('accessionNumber').text
dcdata.append(accessionNo)
barcodes = member.find('barcode').text
dcdata.append(barcodes)
identifiers = member.find('identifier').text
dcdata.append(identifiers)
rt = member.find('rights').text
print(member.find('rights').text)
dcdata.append('rt')
ct = member.find('creator').text
dcdata.append('ct')
rt = member.find('relation').text
dcdata.append('rt')
ce = member.find('coverage').text
dcdata.append('ce')
lang = member.find('language').text
dcdata.append('lang')
csvwriter.writerow(dcdata)
oaidc_data.close()
Everything works as expected except for rt, ce, and lang. What happens is that in the csv, all the data is written with the comma delimiter. For rt, the value is always rt, for ce, ce, lang, lang, etc.
Here's a snippet of the output:
Title,Subject,Description,Publisher,Date,Type,Format,Handle,Accession Number,Barcode,Identifiers,Rights,Creator,Relation,Coverage,Language
"Golden days for boys and girls, 1895-03-16, v. XVI #17",Children's literature--Children's periodicals,"Archives & Special Collections at the Thomas J. Dodd Research Center, University of Connecticut Libraries","James Elverson, 1880-",1895-06-15,Text | periodicals,image/jp2,hdl.handle.net/11134/20002:860074494,,,20002:860074494 | local: 868010272 | local: 997186613502432 | local: 39153019382870,**rt,ct,rt,ce,lang**
Some of the rights statements get very long - perhaps that's the issue. That's why I added the print(member.find('rights')) to see the output. The text is printed just fine. The text just isn't written to the csv. What I'd like is to have the value or text written for these xml tags. Any help would be appreciated.
Thanks.
Jennifer
In the line dcdata.append('rt') there is no need for the quotes. Try dcdata.append(rt). Similarly, there are unnecessary quotes in the ce and lang lines.

Categories