Python, XML parsing, and Elementtree - python

What am I screwing up here?
I can't get this to return any results. I'm sure I'm doing something stupid. I'm not a programmer and this is driving me crazy. Trying to learn but after about 8 hours I'm frazzled.
Here is a sample of my XML:
<?xml version="1.0"?>
-<MyObjectBuilder_Sector xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema">
<!-- Saved '2014-08-23T15:28:07.8585220-05:00' with SEToolbox version '1.44.14.2' -->
-<Position>
<X>0</X>
<Y>0</Y>
<Z>0</Z>
</Position>
-<SectorEvents>
-<Events>
-<MyObjectBuilder_GlobalEventBase>
-<DefinitionId>
<TypeId>MyObjectBuilder_GlobalEventDefinition</TypeId>
<SubtypeId>SpawnCargoShip</SubtypeId>
</DefinitionId>
<Enabled>false</Enabled>
<ActivationTimeMs>401522</ActivationTimeMs>
</MyObjectBuilder_GlobalEventBase>
</Events>
</SectorEvents>
<AppVersion>1044014</AppVersion>
-<SectorObjects>
-<MyObjectBuilder_EntityBase xsi:type="MyObjectBuilder_VoxelMap">
<EntityId>72248529206701361</EntityId>
<PersistentFlags>CastShadows InScene</PersistentFlags>
-<PositionAndOrientation>
<Position z="-466" y="-8987" x="-95"/>
<Forward z="-1" y="0" x="0"/>
<Up z="0" y="1" x="0"/>
</PositionAndOrientation>
<Filename>BaseAsteroid.vox</Filename>
</MyObjectBuilder_EntityBase>
-<MyObjectBuilder_EntityBase xsi:type="MyObjectBuilder_VoxelMap">
<EntityId>72151252176979970</EntityId>
<PersistentFlags>CastShadows InScene</PersistentFlags>
-<PositionAndOrientation>
<Position z="-11301.9033" y="-1183.70569" x="-2126.84"/>
<Forward z="-1" y="0" x="0"/>
<Up z="0" y="1" x="0"/>
</PositionAndOrientation>
<Filename>asteroid0.vox</Filename>
</MyObjectBuilder_EntityBase>
-<MyObjectBuilder_EntityBase xsi:type="MyObjectBuilder_VoxelMap">
<EntityId>72108197145016458</EntityId>
<PersistentFlags>CastShadows InScene</PersistentFlags>
-<PositionAndOrientation>
<Position z="355.7873" y="18738.05" x="1064.912"/>
<Forward z="-1" y="0" x="0"/>
<Up z="0" y="1" x="0"/>
</PositionAndOrientation>
<Filename>asteroid1.vox</Filename>
</MyObjectBuilder_EntityBase>
Here is my code, it just never finds anything...:(
from xml.etree import cElementTree as ElementTree
ElementTree.register_namespace('xsi', 'http://www.w3.org/2001/XMLScheme-instance')
namespace = {'xsi': 'http://www.w3.org/2001/XMLScheme-instance'}
xmlPath = 'e:\\test.xml'
xmlRoot = ElementTree.parse(xmlPath).getroot()
#why this no return anything
results = xmlRoot.findall(".//SectorObjects/MyObjectBuilder_EntityBase[#xsi:type='MyObjectBuilder_VoxelMap']", namespaces=namespace)
print(results)

Your question is 'What am I screwing up here?' First of all your XML itself has issues and seems you cannot get it to paste here right. I did few things to make it workable.
1) Added lines below since they were not there in the XML:
</SectorObjects>
</MyObjectBuilder_Sector>
2) The findall function doesn't take a named argument 'namespaces' and the xsi part also gave an error (SyntaxError: prefix 'xsi' not found in prefix map). So I changed the call to:
results = xmlRoot.findall(".//SectorObjects/MyObjectBuilder_EntityBase")
When I ran the code with above changes, I got this output below:
[<Element 'MyObjectBuilder_EntityBase' at 0x025028A8>, <Element 'MyObjectBuilder_EntityBase' at 0x02502CC8>, <Element 'MyObjectBuilder_EntityBase' at 0x02502E18>]
If you want to do more with these like getting the value of EntityId, you can do this:
results = xmlRoot.findall(".//SectorObjects/MyObjectBuilder_EntityBase")
try:
for result in results:
print result.find('EntityId').text
except AttributeError as aE:
print str(aE)
Output:
72248529206701361
72151252176979970
72108197145016458

Related

How to remove a sub child of an xml file using python?

I have an xml file, which has a particular set of child lines which should be deleted when the python code is run.
Below shown lines are my xml code.
<?xml version="1.0" encoding="utf-8" ?>
<visualization protocolVersion="10.4.0.0">
<globalSection/>
<coreObjectDefinition type="displayDefinition">
<version type="version" value="10.4.0.0"/>
<width>1920</width>
<height>810</height>
<referenceCheck>2</referenceCheck>
<defaultBgColor type="colorSet" r="255" g="255" b="255"/>
<defaultFgColor type="colorSet" r="0" g="0" b="0"/>
<defaultFont type="font" name="Tahoma" size="16" underline="false" strikethrough="false"/>
<defaultStroke type="stroke" width="1.0"/>
<grid type="grid" gridVisible="true" snappingActive="true" verticalSnapInterval="8" horizontalSnapInterval="8" onTop="false">
<color type="colorSet" r="0" g="0" b="0"/>
</grid>
<revisionHistory type="revisionHistory">
<revision type="revision" who="ADMIN" when="2020.05.03 09:46:15.566 CEST" what="Created" where="CPC-A0668-4138"/>
</revisionHistory>
<blinkDelay>500</blinkDelay>
<mousePassThrough>false</mousePassThrough>
<visibilityGroup type="componentData">
<htmlId>2</htmlId>
<name>Overview</name>
<description>Always shown</description>
<minimumZoomEnabled>true</minimumZoomEnabled>
<minimumZoomFactor>10.0</minimumZoomFactor>
</visibilityGroup>
<visibilityGroup type="componentData">
<htmlId>3</htmlId>
<name>Rough</name>
<description>Shown when viewing viewing a large area</description>
<minimumZoomEnabled>true</minimumZoomEnabled>
<minimumZoomFactor>25.0</minimumZoomFactor>
</visibilityGroup>
<visibilityGroup type="componentData">
<htmlId>4</htmlId>
<name>Standard</name>
<description>Shown when using the default view setting</description>
<minimumZoomEnabled>true</minimumZoomEnabled>
<minimumZoomFactor>100.0</minimumZoomFactor>
</visibilityGroup>
<visibilityGroup type="componentData">
<htmlId>5</htmlId>
<name>Detail</name>
<description>Shown only when viewing a small area</description>
<minimumZoomEnabled>true</minimumZoomEnabled>
<minimumZoomFactor>400.0</minimumZoomFactor>
</visibilityGroup>
<visibilityGroup type="componentData">
<htmlId>6</htmlId>
<name>Intricacies</name>
<description>Shown only when viewing a very small area</description>
<minimumZoomEnabled>true</minimumZoomEnabled>
<minimumZoomFactor>1000.0</minimumZoomFactor>
</visibilityGroup>
<visualizationLayer type="componentData">
<htmlId>1</htmlId>
<name>Layer1</name>
</visualizationLayer>
<componentCountHint>1</componentCountHint>
<ellipse type="componentData" x="851.99896" y="300.00006" top="92.000046" bottom="91.99985" left="99.99896" right="100.001526">
<htmlId>7</htmlId>
<stroke type="stroke" width="1.0"/>
<fillPaint type="paint">
<paint type="colorSet" r="255" g="255" b="255"/>
</fillPaint>
**<data type="data">
<action type="actionConnectTo">
<property type="property" name="ellipse.visible"/>
<filter type="filter">
<value>0.0</value>
</filter>
<connection type="connection">
<direction>1</direction>
<itemName>AOG.Templates.Alarm</itemName>
<itemId>2.1.3.0.0.2.1.8</itemId>
</connection>
</action>
</data>**
</ellipse>
</coreObjectDefinition>
</visualization>
I want only the below part to be deleted from the entire xml file.
<data type="data">
<action type="actionConnectTo">
<property type="property" name="ellipse.visible"/>
<filter type="filter">
<value>0.0</value>
</filter>
<connection type="connection">
<direction>1</direction>
<itemName>AOG.Templates.Alarm</itemName>
<itemId>2.1.3.0.0.2.1.8</itemId>
</connection>
</action>
</data>
The below mentioned python code only removes the child section and not the sub child.. Kindly help me out on this
from xml.etree import ElementTree
root = ElementTree.parse("test1.xml").getroot()
b = root.getchildren()[0]
root.remove(b)
ElementTree.dump(root)
Try this.
from simplified_scrapy import SimplifiedDoc,utils,req
html = '''Your xml'''
doc = SimplifiedDoc(html)
data = doc.select('data#type=data') # Get the element
data.repleaceSelf("") # Remove it
print(doc.html) # This is what you want
Unfortunately, you can't access sub-child of an element using ElementTree. Each node only has "pointers" to the direct children of it. So, in order to access the <data/> node and remove it, you should refer to it from its direct parent node.
I'd do it in this way:
for d in root.findall('coreObjectDefinition'):
for e in d.findall('ellipse'):
for f in e.findall('data'):
e.remove(f)
This library has syntax that allows you to search a tree recursively, so you're able to find the element with root.findall('.//data'). So a shorter version of the above code would be:
for d in root.findall('.//ellipse'):
for e in d.findall('data'):
d.remove(e)

Extracting comments from XML file in Python

I would like to extract the comment section of the XML file. The information that I would like to extract is found between the Tag and then within Text tag which is "EXAMPLE".
The structure of the XML file looks below.
<Boxes>
<Box Id="3" ZIndex="13">
<Shape>Rectangle</Shape>
<Brush Id="0" />
<Pen>
<Color>#FF000000</Color>
</Pen>
<Tag><?xml version="1.0"?>
<PFDComment xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema">
<Text>**EXAMPLE** </Text>
</PFDComment></Tag>
</Box>
</Boxes>
I tried it something below but couldn't get the information that I want.
def read_cooments(xml):
tree = lxml.etree.parse(xml)
Comments= {}
for comment in tree.xpath("//Boxes/Box"):
#
get_id = comment.attrib['Id']
Comments[get_id] = []
for group in comment.xpath(".//Tag"):
#
Comments[get_id].append(group.text)
df_name1 = pd.DataFrame(dict([(k,pd.Series(v)) for k,v in Comments.items()]))
Can anyone help to extract comments from XML file shown above? Any help is appreciated!
Use the code given below:
def read_comments(xml):
tree = etree.parse(xml)
rows= []
for box in tree.xpath('Box'):
id = box.attrib['Id']
tagTxt = box.findtext('Tag')
if tagTxt is None:
continue
txtNode = etree.XML(tagTxt).find('Text')
if txtNode is None:
continue
rows.append([id, txtNode.text.strip()])
return pd.DataFrame(rows, columns=['id', 'Comment'])
Note that if you create a DataFrame within a function, it is a local
variable of this function and is not visible from outside.
A better and more readable approach (as I did) is that the function returns
this DataFrame.
This function contains also continue in 2 places, to guard against possible
"error cases", when either Box element does not contain Tag child or
Tag does not contain any Text child element.
I also noticed that there is no need to replace < or > with < or
> with my own code, as lxml performs it on its own.
Edit
My test is as follows: Start form imports:
import pandas as pd
from lxml import etree
I used a file containing:
<Boxes>
<Box Id="3" ZIndex="13">
<Shape>Rectangle</Shape>
<Brush Id="0" />
<Pen>
<Color>#FF000000</Color>
</Pen>
<Tag><?xml version="1.0"?>
<PFDComment xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema">
<Text>**EXAMPLE** </Text>
</PFDComment></Tag>
</Box>
</Boxes>
I called the above function:
df_name1 = read_comments('Boxes.xml')
and when I printed df_name1, I got:
id Comment
0 3 **EXAMPLE**
If something goes wrong, use the "extended" version of the above function,
with test printouts:
def read_comments(xml):
tree = etree.parse(xml)
rows= []
for box in tree.xpath('Box'):
id = box.attrib['Id']
tagTxt = box.findtext('Tag')
if tagTxt is None:
print('No Tag element')
continue
txtNode = etree.XML(tagTxt).find('Text')
if txtNode is None:
print('No Text element')
continue
txt = txtNode.text.strip()
print(f'{id}: {txt}')
rows.append([id, txt])
return pd.DataFrame(rows, columns=['id', 'Comment'])
and take a look at printouts.

Parsing XML with namespace into dictionary

I'm having a hard time following the xml.etree.ElementTree documentation with regard to parsing an XML document with a namespace and nested tags.
To begin, the xml tree I am trying to parse looks like:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<ROOT-MAIN xmlns="http://fakeurl.com/page">
<Alarm> <--- I dont care about these types of objects
<Node>
<location>Texas></location>
<name>John</name>
</Node>
</Alarm>
<Alarm> <--- I care about these types of objects
<CreateTime>01/01/2011</CreateTime>
<Story>
<Node>
<Name>Ethan</name
<Address category="residential>
<address>1421 Morning SE</address>
</address>
</Node>
</Story>
<Build>
<Action category="build_value_1">Build was successful</Action>
</Build>
<OtherData type="string" meaning="favoriteTVShow">Purple</OtherData>
<OtherData type="string" meaning="favoriteColor">Seinfeld</OtherData>
</Alarm>
</ROOT-MAIN>
I am trying to build an array of dictionaries that have a similar structure to the second < Alarm > object. When parsing this XML file, I do the following:
import xml.etree.ElementTree as ET
tree = ET.parse('data/'+filename)
root = tree.getroot()
namespace= '{http://fakeurl.com/page}'
for alarm in tree.findall(namespace+'Alarm'):
for elem in alarm.iter():
try:
creation_time = elem.find(namespace+'CreateTime')
for story in elem.findall(namespace+'Story'):
for node in story.findall(namespace+'Node'):
for Address in node.findall(namespace+'Address'):
address = Address.find(namespace+'address').text
for build in elem.findall(namespace+'Build'):
category= build.find(namespace+'Action').attrib
action = build.find(namespace+'Action').text
for otherdata in elem.findall(namespace+'OtherData'):
#not sure how to get the 'meaning' attribute value as well as the text value for these <OtherData> tags
except:
pass
Right I'm just trying to get values for:
< address >
< Action > (attribute value and text value)
< OtherData > (attribute value and text value)
I'm sort of able to do this with for loops within for-loops but I was hoping for a cleaner, xpath solution which I haven't figured out how to do with a namespace.
Any suggestions would be much appreciated.
Here (collecting a subset of the elements you mentioned -- add more code to collect rest of elements)
import xml.etree.ElementTree as ET
import re
xmlstring = '''<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<root xmlns="http://fakeurl.com/page">
<Alarm>
<Node>
<location>Texas></location>
<name>John</name>
</Node>
</Alarm>
<Alarm>
<CreateTime>01/01/2011</CreateTime>
<Story>
<Node>
<Name>Ethan</Name>
<Address category="residential">
<address>1421 Morning SE</address>
</Address>
</Node>
</Story>
<Build>
<Action category="build_value_1">Build was successful</Action>
</Build>
<OtherData type="string" meaning="favoriteTVShow">Purple</OtherData>
<OtherData type="string" meaning="favoriteColor">Seinfeld</OtherData>
</Alarm>
</root>'''
xmlstring = re.sub(' xmlns="[^"]+"', '', xmlstring, count=1)
root = ET.fromstring(xmlstring)
alarms = root.findall('Alarm')
alarms_list = []
for alarm in alarms:
create_time = alarm.find('CreateTime')
if create_time is not None:
entry = {'create_time': create_time.text}
alarms_list.append(entry)
actions = alarm.findall('Build/Action')
if actions:
entry['builds'] = []
for action in actions:
entry['builds'].append({'category': action.attrib['category'], 'status': action.text})
print(alarms_list)

python ElementTree remove issue

I have xml file as following:
<plugin-config>
<properties>
<property name="AZSRVC_CONNECTION" value="diamond_plugins#AZSRVC" />
<property name="DIAMOND_HOST" value="10.0.230.1" />
<property name="DIAMOND_PORT" value="3333" />
</properties>
<pack-list>
<vsme-pack id="monthly_50MB">
<campaign-list>
<campaign id="2759" type="SOB" />
<campaign id="2723" type="SUBSCRIBE" />
</campaign-list>
</vsme-pack>
<vsme-pack id="monthly_500MB">
<campaign-list>
<campaign id="3879" type="SOB" />
<campaign id="3885" type="SOB" />
<campaign id="2724" type="SUBSCRIBE" />
<campaign id="1111" type="COB" /></campaign-list>
</vsme-pack>
</pack-list>
</plugin-config>
And trying to run this Python script to remove 'campaign' with specific id.
import xml.etree.ElementTree as ET
tree = ET.parse('pack-assign-config.xml')
root = tree.getroot()
pack_list = root.find('pack-list')
camp_list = pack_list.find(".//vsme-pack[#id='{pack_id}']".format(pack_id=pack_id)).find('campaign-list').findall('campaign')
for camp in camp_list:
if camp.get('id') == '2759':
camp_list.remove(camp)
tree.write('out.xml')
I run script but out is the same as input file, so does not remove element.
Issue :
this is wrong way to find the desired node . you are searching for vsme-pack and the trying to find campaign-list and campaign ? which incorrect format.
camp_list = pack_list.find(".//vsme-pack[#id='{pack_id}']".format(pack_id=pack_id)).find('campaign-list').findall('campaign')
Fixed Code Example
here is the working code which removes the node from xml
import xml.etree.ElementTree as ET
root = ET.parse('pack-assign-config.xml')
# Alternatively, parse the XML that lives in 'filename_path'
# tree = ElementTree.parse(filename_path)
# root = tree.getroot()
# Find the parent element of each "weight" element, using XPATH
for parent in root.findall('.//pack-list/'):
# Find each weight element
for element in parent.findall('campaign-list'):
for camp_list in element.findall('campaign'):
if camp_list.get('id') == '2759' or camp_list.get('id') == '3879' :
element.remove(camp_list)
root.write("out.xml")
hope this helps

adding multiple values of elem.attrib into a variable

Started playing with Python and ElementTree very recently to acheive something quite specific. I am very nearly there I think but there is one thing that I can't quite work out. I am querying an xml file and pulling back the relevant data - then putting that data into a csv file. It all works but the issue is that the elem.attrib["text"] actually returns multiple lines - when I put it into a variable the variable and export to a csv it only exports the first line - below is the code I am using...
import os
import csv
import xml.etree.cElementTree as ET
path = "/share/new"
c = csv.writer(open("/share/redacted.csv", "wb"))
c.writerow(["S","R","T","R2","R3"])
for filename in os.listdir(path):
if filename.endswith('.xml'):
fullname = os.path.join(path, filename)
tree = ET.ElementTree(file=(fullname))
for elem in tree.iterfind('PropertyList/Property[#name="Sender"]'):
c1 = elem.attrib["value"]
for elem in tree.iterfind('PropertyList/Property[#name="Recipient"]'):
c2 = elem.attrib["value"]
for elem in tree.iterfind('PropertyList/Property[#name="Date"]'):
c3 = elem.attrib["value"]
for elem in tree.iterfind('ChildContext/ResponseList/Response/TextualAnalysis/ExpressionList/Expression/Match'):
c4 = elem.attrib["textView"]
for elem in tree.iterfind('ChildContext/ResponseList/Response/TextualAnalysis/ExpressionList/Expression/Match/Matched'):
c5 = elem.attrib["text"]
print elem.attrib["text"]
print c5
c.writerow([(c1),(c2),(c3),(c4),(c5)])
The most important part is right near the bottom - the output of print elem.atrrib["text"] is:
Apples
Bananas
The output of 'print c5' is the same (just to be clear that is Apples and Bananas on seperate lines)
But, outputting c5 to a csv only outputs the first line and therefore only Apples appears in the csv.
I hope this makes sense - what I need to do is output both Apples and Bananas to the csv (in the same cell preferably). The below is in Python 2.7 in development but ideally I need it to work in 2.6 (I realise iterfind is not in 2.6 - I have 2 versions of code already)
I would post the xml but it is a bit of a beast. - As per suggestion in comments here is a cleaned up XML.
<?xml version="1.0" encoding="UTF-8" standalone="no" ?>
<Context>
<PropertyList duplicates="true">
<Property name="Sender" type="string" value="S:demo1#no-one.local"/>
<Property name="Recipient" type="string" value="RPFD:no-one.local"/>
<Property name="Date" type="string" value="Tue, 4 Aug 2015 13:24:16 +0100"/>
</PropertyList>
<ChildContext>
<ResponseList>
<Response>
<Description>
<Arg />
<Arg />
</Description>
<TextualAnalysis version="2.0">
<ExpressionList>
<Expression specified=".CLEAN.(Apples)" total="1" >
<Match textView="Body" truncated="false">
<Surrounding text="..."/>
<Surrounding text="How do you like them "/>
<Matched cleaned="true" text="Apples " type="expression"/>
<Surrounding text="???????? "/>
<Surrounding text="..."/>
</Match>
</Expression>
</ExpressionList>
</TextualAnalysis>
</Response>
</ResponseList>
</ChildContext>
<ChildContext>
<ResponseList>
<Response>
<Description>
<Arg />
<Arg />
</Description>
<TextualAnalysis version="2.0">
<ExpressionList>
<Expression specified=".CLEAN.(Bananas)" total="1" >
<Match textView="Attach" truncated="false">
<Surrounding text="..."/>
<Surrounding text="Also I don't like... "/>
<Matched cleaned="true" text="Bananas " type="expression"/>
<Surrounding text="!!!!!!! "/>
<Surrounding text="..."/>
</Match>
</Expression>
</ExpressionList>
</TextualAnalysis>
</Response>
</ResponseList>
</ChildContext>
</Context>
The following will join together all the text elements, and put them on separate lines in the same cell inside your CSV. You can change the '\n' separator to ' ' or ',' to put them on the same line. However, you might still have issues with some of your other stuff -- you don't have nested loops there and I don't really understand what you are trying to accomplish, so maybe you have more than one of each of those other things too. Anyway:
c5 = []
for elem in tree.iterfind('ChildContext/ResponseList/Response/TextualAnalysis/ExpressionList/Expression/Match/Matched'):
c5.append(elem.attrib["text"])
c.writerow([c1, c2, c3, c4, '\n'.join(c5)])

Categories