Started playing with Python and ElementTree very recently to acheive something quite specific. I am very nearly there I think but there is one thing that I can't quite work out. I am querying an xml file and pulling back the relevant data - then putting that data into a csv file. It all works but the issue is that the elem.attrib["text"] actually returns multiple lines - when I put it into a variable the variable and export to a csv it only exports the first line - below is the code I am using...
import os
import csv
import xml.etree.cElementTree as ET
path = "/share/new"
c = csv.writer(open("/share/redacted.csv", "wb"))
c.writerow(["S","R","T","R2","R3"])
for filename in os.listdir(path):
if filename.endswith('.xml'):
fullname = os.path.join(path, filename)
tree = ET.ElementTree(file=(fullname))
for elem in tree.iterfind('PropertyList/Property[#name="Sender"]'):
c1 = elem.attrib["value"]
for elem in tree.iterfind('PropertyList/Property[#name="Recipient"]'):
c2 = elem.attrib["value"]
for elem in tree.iterfind('PropertyList/Property[#name="Date"]'):
c3 = elem.attrib["value"]
for elem in tree.iterfind('ChildContext/ResponseList/Response/TextualAnalysis/ExpressionList/Expression/Match'):
c4 = elem.attrib["textView"]
for elem in tree.iterfind('ChildContext/ResponseList/Response/TextualAnalysis/ExpressionList/Expression/Match/Matched'):
c5 = elem.attrib["text"]
print elem.attrib["text"]
print c5
c.writerow([(c1),(c2),(c3),(c4),(c5)])
The most important part is right near the bottom - the output of print elem.atrrib["text"] is:
Apples
Bananas
The output of 'print c5' is the same (just to be clear that is Apples and Bananas on seperate lines)
But, outputting c5 to a csv only outputs the first line and therefore only Apples appears in the csv.
I hope this makes sense - what I need to do is output both Apples and Bananas to the csv (in the same cell preferably). The below is in Python 2.7 in development but ideally I need it to work in 2.6 (I realise iterfind is not in 2.6 - I have 2 versions of code already)
I would post the xml but it is a bit of a beast. - As per suggestion in comments here is a cleaned up XML.
<?xml version="1.0" encoding="UTF-8" standalone="no" ?>
<Context>
<PropertyList duplicates="true">
<Property name="Sender" type="string" value="S:demo1#no-one.local"/>
<Property name="Recipient" type="string" value="RPFD:no-one.local"/>
<Property name="Date" type="string" value="Tue, 4 Aug 2015 13:24:16 +0100"/>
</PropertyList>
<ChildContext>
<ResponseList>
<Response>
<Description>
<Arg />
<Arg />
</Description>
<TextualAnalysis version="2.0">
<ExpressionList>
<Expression specified=".CLEAN.(Apples)" total="1" >
<Match textView="Body" truncated="false">
<Surrounding text="..."/>
<Surrounding text="How do you like them "/>
<Matched cleaned="true" text="Apples " type="expression"/>
<Surrounding text="???????? "/>
<Surrounding text="..."/>
</Match>
</Expression>
</ExpressionList>
</TextualAnalysis>
</Response>
</ResponseList>
</ChildContext>
<ChildContext>
<ResponseList>
<Response>
<Description>
<Arg />
<Arg />
</Description>
<TextualAnalysis version="2.0">
<ExpressionList>
<Expression specified=".CLEAN.(Bananas)" total="1" >
<Match textView="Attach" truncated="false">
<Surrounding text="..."/>
<Surrounding text="Also I don't like... "/>
<Matched cleaned="true" text="Bananas " type="expression"/>
<Surrounding text="!!!!!!! "/>
<Surrounding text="..."/>
</Match>
</Expression>
</ExpressionList>
</TextualAnalysis>
</Response>
</ResponseList>
</ChildContext>
</Context>
The following will join together all the text elements, and put them on separate lines in the same cell inside your CSV. You can change the '\n' separator to ' ' or ',' to put them on the same line. However, you might still have issues with some of your other stuff -- you don't have nested loops there and I don't really understand what you are trying to accomplish, so maybe you have more than one of each of those other things too. Anyway:
c5 = []
for elem in tree.iterfind('ChildContext/ResponseList/Response/TextualAnalysis/ExpressionList/Expression/Match/Matched'):
c5.append(elem.attrib["text"])
c.writerow([c1, c2, c3, c4, '\n'.join(c5)])
Related
I'm a bit new to XML and python. Below is a cut down version of a large XML file I'm trying to bring into python to eventually write into SQL Server db.
<?xml version="1.0" encoding="utf-8"?>
<MyOrgRefData:OrgRefData xmlns:MyOrgRefData="http://refdata.org/org/v2-0-0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://refdata.org/org/v2-0-0/MyOrgRefData.xsd">
<Manifest>
<Version value="2-0-0" />
<PublicationType value="Full" />
<PublicationSource value="TEST123" />
<PublicationDate value="2022-05-23" />
<PublicationSeqNum value="1659" />
<FileCreationDateTime value="2022-05-23T22:14:47" />
<RecordCount value="287654" />
<ContentDescription value="FullFile_20220523" />
<PrimaryRoleScope>
<PrimaryRole id="123" displayName="Free beer for me" />
<PrimaryRole id="456" displayName="Free air for you" />
</PrimaryRoleScope>
</Manifest>
<CodeSystems>
<CodeSystem name="OrganisationRecordClass" oid="1.2.3.4.5">
<concept id="RC2" code="2" displayName="World1" />
<concept id="RC1" code="1" displayName="World2" />
</CodeSystem>
<CodeSystem name="OrganisationRole" oid="5.4.7.8">
<concept id="B1ng0" code="179" displayName="BoomBastic" />
<concept id="R2D2a" code="180" displayName="Fantastic" />
</CodeSystem>
</CodeSystems>
</MyOrgRefData:OrgRefData>
I've tried with lxml, pandas.read_xml, xml.etree and I'm not able to understand how to get what I want.
Ideally I'd like to pull in Manifest into a dataframe ready to to send to SQL (pd.to_sql()). I would do the same with CodeSystems as well, but separately. (there are other sections but I cut them off to shorten)
For example, using pandas to read in, I can only get a column with the values in. But I would like to either have the tag (Version, PublicationType, PublicationSource etc) in a column by the side of the value, or have them as the column headers and the values pivoted across the row instead.
dataFolder = '/Some/directory'
df_bulk = pd.read_xml(
dataFolder+'Data_Full_20220523.xml',
xpath='//Manifest/*',
attrs_only=True ,
)
df_bulk.head()
This is the output I get:
inx
value
0
2-0-0
1
Full
2
TEST123
3
2022-05-23
4
1659
5
2022-05-23T22:14:47
6
287654
7
FullFile_20220523
Ideally I would like:
inx
value
Version
2-0-0
PublicationType
Full
PublicationSource
TEST123
PublicationDate
2022-05-23
PublicationSeqNum
1659
FileCreationDateTime
2022-05-23T22:14:47
FileCreationDateTime
287654
ContentDescription
FullFile_20220523
The eagle eyed among you will notice I've left out PrimaryRoleScope. I would ideally like to treat this separately in it's own dataframe as well. But I am unsure how to exclude it when pulling in the rest of the Manifest section.
Many thanks if you've read this far, even more thanks for any help.
One possibility is using the stylesheet parameter to transform the XML data internally with XSLT before processing it.
So your code could look like this:
dataFolder = '/Some/directory'
df_bulk = pd.read_xml(
dataFolder+'Data_Full_20220523.xml',
stylesheet='transform.xslt',
xpath='/Root/Item',
attrs_only=True ,
)
print(df_bulk.head(10))
The stylesheet(transform.xml) to be passed to read_xml could be (lxml is required)
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="xml" omit-xml-declaration="yes"/>
<xsl:template match="/">
<Root><xsl:apply-templates /></Root>
</xsl:template>
<xsl:template match="//Manifest/*[not(self::PrimaryRoleScope)]">
<Item name="{name()}" value="{#value}" />
</xsl:template>
</xsl:stylesheet>
In this example a new XML like the following is created. It is intermediate XML and not shown, but the xpath= parameter above has to be set accordingly.
<Root>
<Item name="Version" value="2-0-0"/>
<Item name="PublicationType" value="Full"/>
<Item name="PublicationSource" value="TEST123"/>
<Item name="PublicationDate" value="2022-05-23"/>
<Item name="PublicationSeqNum" value="1659"/>
<Item name="FileCreationDateTime" value="2022-05-23T22:14:47"/>
<Item name="RecordCount" value="287654"/>
<Item name="ContentDescription" value="FullFile_20220523"/>
</Root>
And the final output is
name value
0 Version 2-0-0
1 PublicationType Full
2 PublicationSource TEST123
3 PublicationDate 2022-05-23
4 PublicationSeqNum 1659
5 FileCreationDateTime 2022-05-23T22:14:47
6 RecordCount 287654
7 ContentDescription FullFile_20220523
The above approach uses only attributes, but you could also create an element structure with the XSLT if you prefer that. In this case change one template to
<xsl:template match="//Manifest/*[not(self::PrimaryRoleScope)]">
<Item>
<name><xsl:value-of select="name()" /></name>
<value><xsl:value-of select="#value" /></value>
</Item>
</xsl:template>
and your python code to
dataFolder = '/Some/directory'
df_bulk = pd.read_xml(
dataFolder+'Data_Full_20220523.xml',
stylesheet='transform.xslt',
xpath='/Root/Item',
)
print(df_bulk.head(10))
The output is the same.
There is this option when opening an xml file using Excel. You get prompted with the option as seen in the picture Here
It basically open that xml file in a table work and based on the analysis that I have done. It seems to do a pretty good job.
This is how it looks after I opened an xml file using excel as a tabel form Here
My Question: I want to convert an Xml into a table from like that feature in Excel does it. Is that possible?
The reason I want this result, is that working with tables inside excel is really easy using libraries like pandas. However, I don’t want to go an open every xml file with excel, show the table and then save it again. It is not very time efficient
This is my XML file
<?xml version="1.0" encoding="utf-8"?>
<ProjectData>
<FINAL>
<START id="ID0001" service_code="0x5196">
<Docs Docs_type="START">
<Rational>225196</Rational>
<Qualify>6251960000A0DE</Qualify>
</Docs>
<Description num="1213f2312">The parameter</Description>
<SetFile dg="" dg_id="">
<SetData value="32" />
</SetFile>
</START>
<START id="DG0003" service_code="0x517B">
<Docs Docs_type="START">
<Rational>23423</Rational>
<Qualify>342342</Qualify>
</Docs>
<Description num="3423423f3423">The third</Description>
<SetFile dg="" dg_id="">
<FileX dg="" axis_pts="2" name="" num="" dg_id="" />
<FileY unit="" axis_pts="20" name="TOOLS" text_id="23423" unit_id="" />
<SetData x="E1" value="21259" />
<SetData x="E2" value="0" />
</SetFile>
</START>
<START id="ID0048" service_code="0x5198">
<RawData rawdata_type="OPDATA">
<Request>225198</Request>
<Response>343243324234234</Response>
</RawData>
<Meaning text_id="434234234">The forth</Meaning>
<ValueDataset unit="m" unit_id="FEDS">
<FileX dg="kg" discrete="false" axis_pts="19" name="weight" text_id="SDF3" unit_id="SDGFDS" />
<SetData xin="sdf" xax="233" value="323" />
<SetData xin="123" xax="213" value="232" />
<SetData xin="2321" xax="232" value="23" />
</ValueDataset>
</START>
</FINAL>
</ProjectData>
So let's say I have the following input.xml file:
<main>
<item name="item1" image="a"></item>
<item name="item2" image="b"></item>
<item name="item3" image="c"></item>
<item name="item4" image="d"></item>
</main>
You can use the following code:
import xml.etree.ElementTree as ET
import pandas as pd
tree = ET.parse('input.xml')
tags = [ e.attrib for e in tree.getroot() ]
df = pd.DataFrame(tags)
# df:
# name image
# 0 item1 a
# 1 item2 b
# 2 item3 c
# 3 item4 d
And this should be independent of the number of attributes in a given file.
To write to a simple CSV file from pandas, you can use the to_csv command. See documentation. If it is necessary to be an excel sheet, you can use to_excel, see here.
# Write to csv without the row names
df.to_csv('file_name.csv', index = False)
# Write to xlsx sheet without the row names
df.to_excel('file_name.xlsx', index=False)
UPDATE:
For your XML file, and based on your clarification in the comments, I suggest the following, where all elements in the first level in the tree will be rows, and every attribute or node text will be column:
def has_children(e):
''' Check if element, e, has children'''
return(len(list(e)) > 0)
def has_attrib(e):
''' Check if element, e, has attributes'''
return(len(e.attrib)>0)
def get_uniqe_key(mydict, key):
''' Generate unique key if already exists in mydict'''
if key in mydict:
while key in mydict:
key = key + '*'
return(key)
tree = ET.parse('input2.xml')
root = tree.getroot()
# Get first level:
lvl_one = list(root)
myList = [];
for e in lvl_one:
mydict = {}
# Iterate over each node in level one element
for node in e.iter():
if (not has_children(node)) & (node.text != None):
uniqe_key = get_uniqe_key(mydict, node.tag)
mydict[uniqe_key] = node.text
if has_attrib(node):
for key in node.attrib:
uniqe_key = get_uniqe_key(mydict, key)
mydict[uniqe_key] = node.attrib[key]
myList.append(mydict)
print(pd.DataFrame(myList))
Notice in this code, I check if the column name exists for each key, and if it exists, I create a new column name by suffixing with '*'.
I have multiple xml files that look something like this:
<?xml version="1.0" encoding="UTF-8"?>
<parent>
<row AcceptedAnswerId="15" AnswerCount="5" Body="<p>How should
I elicit prior distributions from experts when fitting a Bayesian
model?</p>
" CommentCount="1" CreationDate="2010-07-
19T19:12:12.510" FavoriteCount="17" Id="1" LastActivityDate="2010-09-
15T21:08:26.077" OwnerUserId="8" PostTypeId="1" Score="26"
Tags="<bayesian><prior><elicitation>"
Title="Eliciting priors from experts" ViewCount="1457" />
I would like to be able to use PySpark to count the lines that DO NOT contain the string: <row
My current thought:
def startWithRow(line):
if line.strip().startswith("<row"):
return True
else:
return False
sc.textFile(localpath("folder_containing_xmg.gz_files")) \
.filter(lambda x: not startWithRow(x)) \
.count()
I have tried validating this, but am getting results from even a simple count lines that don't make sense (I downloaded the xml file and did a wc on it which did not match the word count from PySpark.)
Does anything about my approach above stand out as wrong/weird?
I will just use lxml library combined with Spark to count the line with row or filter something out.
from lxml import etree
def find_number_of_rows(path):
try:
tree = etree.fromstring(path)
except:
tree = etree.parse(path)
return len(tree.findall('row'))
rdd = spark.sparkContext.parallelize(paths) # paths is a list to all your paths
rdd.map(lambda x: find_number_of_rows(x)).collect()
For example, if you have list or XML string (just toy example), you can do the following:
text = [
"""
<parent>
<row ViewCount="1457" />
<row ViewCount="1457" />
</parent>
""",
"""
<parent>
<row ViewCount="1457" />
<row ViewCount="1457" />
<row ViewCount="1457" />
</parent>
"""
]
rdd = spark.sparkContext.parallelize(text)
rdd.map(lambda x: find_number_of_rows(x)).collect()
In your case, your function have to take in path to file instead. Then, you can count or filter those rows. I don't have a full file to test on. Let me know if you need extra help!
def badRowParser(x):
try:
line = ET.fromstring(x.strip().encode('utf-8'))
return True
except:
return False
posts = sc.textFile(localpath('folder_containing_xml.gz_files'))
rejected = posts.filter(lambda l: "<row" in l.encode('utf-
8')).map(lambda x: not badRowParser(x))
ans = rejected.collect()
from collections import Counter
Counter(ans)
What am I screwing up here?
I can't get this to return any results. I'm sure I'm doing something stupid. I'm not a programmer and this is driving me crazy. Trying to learn but after about 8 hours I'm frazzled.
Here is a sample of my XML:
<?xml version="1.0"?>
-<MyObjectBuilder_Sector xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema">
<!-- Saved '2014-08-23T15:28:07.8585220-05:00' with SEToolbox version '1.44.14.2' -->
-<Position>
<X>0</X>
<Y>0</Y>
<Z>0</Z>
</Position>
-<SectorEvents>
-<Events>
-<MyObjectBuilder_GlobalEventBase>
-<DefinitionId>
<TypeId>MyObjectBuilder_GlobalEventDefinition</TypeId>
<SubtypeId>SpawnCargoShip</SubtypeId>
</DefinitionId>
<Enabled>false</Enabled>
<ActivationTimeMs>401522</ActivationTimeMs>
</MyObjectBuilder_GlobalEventBase>
</Events>
</SectorEvents>
<AppVersion>1044014</AppVersion>
-<SectorObjects>
-<MyObjectBuilder_EntityBase xsi:type="MyObjectBuilder_VoxelMap">
<EntityId>72248529206701361</EntityId>
<PersistentFlags>CastShadows InScene</PersistentFlags>
-<PositionAndOrientation>
<Position z="-466" y="-8987" x="-95"/>
<Forward z="-1" y="0" x="0"/>
<Up z="0" y="1" x="0"/>
</PositionAndOrientation>
<Filename>BaseAsteroid.vox</Filename>
</MyObjectBuilder_EntityBase>
-<MyObjectBuilder_EntityBase xsi:type="MyObjectBuilder_VoxelMap">
<EntityId>72151252176979970</EntityId>
<PersistentFlags>CastShadows InScene</PersistentFlags>
-<PositionAndOrientation>
<Position z="-11301.9033" y="-1183.70569" x="-2126.84"/>
<Forward z="-1" y="0" x="0"/>
<Up z="0" y="1" x="0"/>
</PositionAndOrientation>
<Filename>asteroid0.vox</Filename>
</MyObjectBuilder_EntityBase>
-<MyObjectBuilder_EntityBase xsi:type="MyObjectBuilder_VoxelMap">
<EntityId>72108197145016458</EntityId>
<PersistentFlags>CastShadows InScene</PersistentFlags>
-<PositionAndOrientation>
<Position z="355.7873" y="18738.05" x="1064.912"/>
<Forward z="-1" y="0" x="0"/>
<Up z="0" y="1" x="0"/>
</PositionAndOrientation>
<Filename>asteroid1.vox</Filename>
</MyObjectBuilder_EntityBase>
Here is my code, it just never finds anything...:(
from xml.etree import cElementTree as ElementTree
ElementTree.register_namespace('xsi', 'http://www.w3.org/2001/XMLScheme-instance')
namespace = {'xsi': 'http://www.w3.org/2001/XMLScheme-instance'}
xmlPath = 'e:\\test.xml'
xmlRoot = ElementTree.parse(xmlPath).getroot()
#why this no return anything
results = xmlRoot.findall(".//SectorObjects/MyObjectBuilder_EntityBase[#xsi:type='MyObjectBuilder_VoxelMap']", namespaces=namespace)
print(results)
Your question is 'What am I screwing up here?' First of all your XML itself has issues and seems you cannot get it to paste here right. I did few things to make it workable.
1) Added lines below since they were not there in the XML:
</SectorObjects>
</MyObjectBuilder_Sector>
2) The findall function doesn't take a named argument 'namespaces' and the xsi part also gave an error (SyntaxError: prefix 'xsi' not found in prefix map). So I changed the call to:
results = xmlRoot.findall(".//SectorObjects/MyObjectBuilder_EntityBase")
When I ran the code with above changes, I got this output below:
[<Element 'MyObjectBuilder_EntityBase' at 0x025028A8>, <Element 'MyObjectBuilder_EntityBase' at 0x02502CC8>, <Element 'MyObjectBuilder_EntityBase' at 0x02502E18>]
If you want to do more with these like getting the value of EntityId, you can do this:
results = xmlRoot.findall(".//SectorObjects/MyObjectBuilder_EntityBase")
try:
for result in results:
print result.find('EntityId').text
except AttributeError as aE:
print str(aE)
Output:
72248529206701361
72151252176979970
72108197145016458
I have following code that tries to get values from XML document:
from xml.dom import minidom
xml = """<SoccerFeed TimeStamp="20130328T152947+0000">
<SoccerDocument uID="f131897" Type="Result" />
<Competition uID="c87">
<MatchData>
<MatchInfo TimeStamp="20070812T144737+0100" Weather="Windy"Period="FullTime" MatchType="Regular" />
<MatchOfficial uID="o11068"/>
<Stat Type="match_time">91</Stat>
<TeamData TeamRef="t810" Side="Home" Score="4" />
<TeamData TeamRef="t2012" Side="Away" Score="1" />
</MatchData>
<Team uID="t810" />
<Team uID="t2012" />
<Venue uID="v2158" />
</SoccerDocument>
</SoccerFeed>"""
xmldoc = minidom.parseString(xml)
soccerfeed = xmldoc.getElementsByTagName("SoccerFeed")[0]
soccerdocument = soccerfeed.getElementsByTagName("SoccerDocument")[0]
#Match Data
MatchData = soccerdocument.getElementsByTagName("MatchData")[0]
MatchInfo = MatchData.getElementsByTagName("MatchInfo")[0]
Goal = MatchData.getElementsByTagNameNS("Side", "Score")
The Goal is being set to [], but I would like to get the score value, which is 4.
It looks like you are searching for the wrong XML node. Check following line:
Goal = MatchData.getElementsByTagNameNS("Side", "Score")
You probably are looking for following:
Goal = MatchData.getElementsByTagName("TeamData")[0].getAttribute("Score")
NOTE: Document.getElementsByTagName, Document.getElementsByTagNameNS, Element.getElementsByTagName, Element.getElementsByTagNameNS return a list of nodes, not just a scalar value.