Python, XML parsing, and Elementtree

Python, XML parsing, and Elementtree - python

What am I screwing up here?
I can't get this to return any results. I'm sure I'm doing something stupid. I'm not a programmer and this is driving me crazy. Trying to learn but after about 8 hours I'm frazzled.
Here is a sample of my XML:
<?xml version="1.0"?>
-<MyObjectBuilder_Sector xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema">
<!-- Saved '2014-08-23T15:28:07.8585220-05:00' with SEToolbox version '1.44.14.2' -->
-<Position>
<X>0</X>
<Y>0</Y>
<Z>0</Z>
</Position>
-<SectorEvents>
-<Events>
-<MyObjectBuilder_GlobalEventBase>
-<DefinitionId>
<TypeId>MyObjectBuilder_GlobalEventDefinition</TypeId>
<SubtypeId>SpawnCargoShip</SubtypeId>
</DefinitionId>
<Enabled>false</Enabled>
<ActivationTimeMs>401522</ActivationTimeMs>
</MyObjectBuilder_GlobalEventBase>
</Events>
</SectorEvents>
<AppVersion>1044014</AppVersion>
-<SectorObjects>
-<MyObjectBuilder_EntityBase xsi:type="MyObjectBuilder_VoxelMap">
<EntityId>72248529206701361</EntityId>
<PersistentFlags>CastShadows InScene</PersistentFlags>
-<PositionAndOrientation>
<Position z="-466" y="-8987" x="-95"/>
<Forward z="-1" y="0" x="0"/>
<Up z="0" y="1" x="0"/>
</PositionAndOrientation>
<Filename>BaseAsteroid.vox</Filename>
</MyObjectBuilder_EntityBase>
-<MyObjectBuilder_EntityBase xsi:type="MyObjectBuilder_VoxelMap">
<EntityId>72151252176979970</EntityId>
<PersistentFlags>CastShadows InScene</PersistentFlags>
-<PositionAndOrientation>
<Position z="-11301.9033" y="-1183.70569" x="-2126.84"/>
<Forward z="-1" y="0" x="0"/>
<Up z="0" y="1" x="0"/>
</PositionAndOrientation>
<Filename>asteroid0.vox</Filename>
</MyObjectBuilder_EntityBase>
-<MyObjectBuilder_EntityBase xsi:type="MyObjectBuilder_VoxelMap">
<EntityId>72108197145016458</EntityId>
<PersistentFlags>CastShadows InScene</PersistentFlags>
-<PositionAndOrientation>
<Position z="355.7873" y="18738.05" x="1064.912"/>
<Forward z="-1" y="0" x="0"/>
<Up z="0" y="1" x="0"/>
</PositionAndOrientation>
<Filename>asteroid1.vox</Filename>
</MyObjectBuilder_EntityBase>
Here is my code, it just never finds anything...:(
from xml.etree import cElementTree as ElementTree
ElementTree.register_namespace('xsi', 'http://www.w3.org/2001/XMLScheme-instance')
namespace = {'xsi': 'http://www.w3.org/2001/XMLScheme-instance'}
xmlPath = 'e:\\test.xml'
xmlRoot = ElementTree.parse(xmlPath).getroot()
#why this no return anything
results = xmlRoot.findall(".//SectorObjects/MyObjectBuilder_EntityBase[#xsi:type='MyObjectBuilder_VoxelMap']", namespaces=namespace)
print(results)

Your question is 'What am I screwing up here?' First of all your XML itself has issues and seems you cannot get it to paste here right. I did few things to make it workable.
1) Added lines below since they were not there in the XML:
</SectorObjects>
</MyObjectBuilder_Sector>
2) The findall function doesn't take a named argument 'namespaces' and the xsi part also gave an error (SyntaxError: prefix 'xsi' not found in prefix map). So I changed the call to:
results = xmlRoot.findall(".//SectorObjects/MyObjectBuilder_EntityBase")
When I ran the code with above changes, I got this output below:
[<Element 'MyObjectBuilder_EntityBase' at 0x025028A8>, <Element 'MyObjectBuilder_EntityBase' at 0x02502CC8>, <Element 'MyObjectBuilder_EntityBase' at 0x02502E18>]
If you want to do more with these like getting the value of EntityId, you can do this:
results = xmlRoot.findall(".//SectorObjects/MyObjectBuilder_EntityBase")
try:
for result in results:
print result.find('EntityId').text
except AttributeError as aE:
print str(aE)
Output:
72248529206701361
72151252176979970
72108197145016458

Related

How to remove a sub child of an xml file using python?

I have an xml file, which has a particular set of child lines which should be deleted when the python code is run.
Below shown lines are my xml code.
<?xml version="1.0" encoding="utf-8" ?>
<visualization protocolVersion="10.4.0.0">
<globalSection/>
<coreObjectDefinition type="displayDefinition">
<version type="version" value="10.4.0.0"/>
<width>1920</width>
<height>810</height>
<referenceCheck>2</referenceCheck>
<defaultBgColor type="colorSet" r="255" g="255" b="255"/>
<defaultFgColor type="colorSet" r="0" g="0" b="0"/>
<defaultFont type="font" name="Tahoma" size="16" underline="false" strikethrough="false"/>
<defaultStroke type="stroke" width="1.0"/>
<grid type="grid" gridVisible="true" snappingActive="true" verticalSnapInterval="8" horizontalSnapInterval="8" onTop="false">
<color type="colorSet" r="0" g="0" b="0"/>
</grid>
<revisionHistory type="revisionHistory">
<revision type="revision" who="ADMIN" when="2020.05.03 09:46:15.566 CEST" what="Created" where="CPC-A0668-4138"/>
</revisionHistory>
<blinkDelay>500</blinkDelay>
<mousePassThrough>false</mousePassThrough>
<visibilityGroup type="componentData">
<htmlId>2</htmlId>
<name>Overview</name>
<description>Always shown</description>
<minimumZoomEnabled>true</minimumZoomEnabled>
<minimumZoomFactor>10.0</minimumZoomFactor>
</visibilityGroup>
<visibilityGroup type="componentData">
<htmlId>3</htmlId>
<name>Rough</name>
<description>Shown when viewing viewing a large area</description>
<minimumZoomEnabled>true</minimumZoomEnabled>
<minimumZoomFactor>25.0</minimumZoomFactor>
</visibilityGroup>
<visibilityGroup type="componentData">
<htmlId>4</htmlId>
<name>Standard</name>
<description>Shown when using the default view setting</description>
<minimumZoomEnabled>true</minimumZoomEnabled>
<minimumZoomFactor>100.0</minimumZoomFactor>
</visibilityGroup>
<visibilityGroup type="componentData">
<htmlId>5</htmlId>
<name>Detail</name>
<description>Shown only when viewing a small area</description>
<minimumZoomEnabled>true</minimumZoomEnabled>
<minimumZoomFactor>400.0</minimumZoomFactor>
</visibilityGroup>
<visibilityGroup type="componentData">
<htmlId>6</htmlId>
<name>Intricacies</name>
<description>Shown only when viewing a very small area</description>
<minimumZoomEnabled>true</minimumZoomEnabled>
<minimumZoomFactor>1000.0</minimumZoomFactor>
</visibilityGroup>
<visualizationLayer type="componentData">
<htmlId>1</htmlId>
<name>Layer1</name>
</visualizationLayer>
<componentCountHint>1</componentCountHint>
<ellipse type="componentData" x="851.99896" y="300.00006" top="92.000046" bottom="91.99985" left="99.99896" right="100.001526">
<htmlId>7</htmlId>
<stroke type="stroke" width="1.0"/>
<fillPaint type="paint">
<paint type="colorSet" r="255" g="255" b="255"/>
</fillPaint>
**<data type="data">
<action type="actionConnectTo">
<property type="property" name="ellipse.visible"/>
<filter type="filter">
<value>0.0</value>
</filter>
<connection type="connection">
<direction>1</direction>
<itemName>AOG.Templates.Alarm</itemName>
<itemId>2.1.3.0.0.2.1.8</itemId>
</connection>
</action>
</data>**
</ellipse>
</coreObjectDefinition>
</visualization>
I want only the below part to be deleted from the entire xml file.
<data type="data">
<action type="actionConnectTo">
<property type="property" name="ellipse.visible"/>
<filter type="filter">
<value>0.0</value>
</filter>
<connection type="connection">
<direction>1</direction>
<itemName>AOG.Templates.Alarm</itemName>
<itemId>2.1.3.0.0.2.1.8</itemId>
</connection>
</action>
</data>
The below mentioned python code only removes the child section and not the sub child.. Kindly help me out on this
from xml.etree import ElementTree
root = ElementTree.parse("test1.xml").getroot()
b = root.getchildren()[0]
root.remove(b)
ElementTree.dump(root)

Try this.
from simplified_scrapy import SimplifiedDoc,utils,req
html = '''Your xml'''
doc = SimplifiedDoc(html)
data = doc.select('data#type=data') # Get the element
data.repleaceSelf("") # Remove it
print(doc.html) # This is what you want

Unfortunately, you can't access sub-child of an element using ElementTree. Each node only has "pointers" to the direct children of it. So, in order to access the <data/> node and remove it, you should refer to it from its direct parent node.
I'd do it in this way:
for d in root.findall('coreObjectDefinition'):
for e in d.findall('ellipse'):
for f in e.findall('data'):
e.remove(f)
This library has syntax that allows you to search a tree recursively, so you're able to find the element with root.findall('.//data'). So a shorter version of the above code would be:
for d in root.findall('.//ellipse'):
for e in d.findall('data'):
d.remove(e)

Extracting comments from XML file in Python

I would like to extract the comment section of the XML file. The information that I would like to extract is found between the Tag and then within Text tag which is "EXAMPLE".
The structure of the XML file looks below.
<Boxes>
<Box Id="3" ZIndex="13">
<Shape>Rectangle</Shape>
<Brush Id="0" />
<Pen>
<Color>#FF000000</Color>
</Pen>
<Tag><?xml version="1.0"?>
<PFDComment xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema">
<Text>**EXAMPLE** </Text>
</PFDComment></Tag>
</Box>
</Boxes>
I tried it something below but couldn't get the information that I want.
def read_cooments(xml):
tree = lxml.etree.parse(xml)
Comments= {}
for comment in tree.xpath("//Boxes/Box"):
#
get_id = comment.attrib['Id']
Comments[get_id] = []
for group in comment.xpath(".//Tag"):
#
Comments[get_id].append(group.text)
df_name1 = pd.DataFrame(dict([(k,pd.Series(v)) for k,v in Comments.items()]))
Can anyone help to extract comments from XML file shown above? Any help is appreciated!

Use the code given below:
def read_comments(xml):
tree = etree.parse(xml)
rows= []
for box in tree.xpath('Box'):
id = box.attrib['Id']
tagTxt = box.findtext('Tag')
if tagTxt is None:
continue
txtNode = etree.XML(tagTxt).find('Text')
if txtNode is None:
continue
rows.append([id, txtNode.text.strip()])
return pd.DataFrame(rows, columns=['id', 'Comment'])
Note that if you create a DataFrame within a function, it is a local
variable of this function and is not visible from outside.
A better and more readable approach (as I did) is that the function returns
this DataFrame.
This function contains also continue in 2 places, to guard against possible
"error cases", when either Box element does not contain Tag child or
Tag does not contain any Text child element.
I also noticed that there is no need to replace < or > with < or
> with my own code, as lxml performs it on its own.
Edit
My test is as follows: Start form imports:
import pandas as pd
from lxml import etree
I used a file containing:
<Boxes>
<Box Id="3" ZIndex="13">
<Shape>Rectangle</Shape>
<Brush Id="0" />
<Pen>
<Color>#FF000000</Color>
</Pen>
<Tag><?xml version="1.0"?>
<PFDComment xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema">
<Text>**EXAMPLE** </Text>
</PFDComment></Tag>
</Box>
</Boxes>
I called the above function:
df_name1 = read_comments('Boxes.xml')
and when I printed df_name1, I got:
id Comment
0 3 **EXAMPLE**
If something goes wrong, use the "extended" version of the above function,
with test printouts:
def read_comments(xml):
tree = etree.parse(xml)
rows= []
for box in tree.xpath('Box'):
id = box.attrib['Id']
tagTxt = box.findtext('Tag')
if tagTxt is None:
print('No Tag element')
continue
txtNode = etree.XML(tagTxt).find('Text')
if txtNode is None:
print('No Text element')
continue
txt = txtNode.text.strip()
print(f'{id}: {txt}')
rows.append([id, txt])
return pd.DataFrame(rows, columns=['id', 'Comment'])
and take a look at printouts.

Parsing XML with namespace into dictionary

I'm having a hard time following the xml.etree.ElementTree documentation with regard to parsing an XML document with a namespace and nested tags.
To begin, the xml tree I am trying to parse looks like:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<ROOT-MAIN xmlns="http://fakeurl.com/page">
<Alarm> <--- I dont care about these types of objects
<Node>
<location>Texas></location>
<name>John</name>
</Node>
</Alarm>
<Alarm> <--- I care about these types of objects
<CreateTime>01/01/2011</CreateTime>
<Story>
<Node>
<Name>Ethan</name
<Address category="residential>
<address>1421 Morning SE</address>
</address>
</Node>
</Story>
<Build>
<Action category="build_value_1">Build was successful</Action>
</Build>
<OtherData type="string" meaning="favoriteTVShow">Purple</OtherData>
<OtherData type="string" meaning="favoriteColor">Seinfeld</OtherData>
</Alarm>
</ROOT-MAIN>
I am trying to build an array of dictionaries that have a similar structure to the second < Alarm > object. When parsing this XML file, I do the following:
import xml.etree.ElementTree as ET
tree = ET.parse('data/'+filename)
root = tree.getroot()
namespace= '{http://fakeurl.com/page}'
for alarm in tree.findall(namespace+'Alarm'):
for elem in alarm.iter():
try:
creation_time = elem.find(namespace+'CreateTime')
for story in elem.findall(namespace+'Story'):
for node in story.findall(namespace+'Node'):
for Address in node.findall(namespace+'Address'):
address = Address.find(namespace+'address').text
for build in elem.findall(namespace+'Build'):
category= build.find(namespace+'Action').attrib
action = build.find(namespace+'Action').text
for otherdata in elem.findall(namespace+'OtherData'):
#not sure how to get the 'meaning' attribute value as well as the text value for these <OtherData> tags
except:
pass
Right I'm just trying to get values for:
< address >
< Action > (attribute value and text value)
< OtherData > (attribute value and text value)
I'm sort of able to do this with for loops within for-loops but I was hoping for a cleaner, xpath solution which I haven't figured out how to do with a namespace.
Any suggestions would be much appreciated.

Here (collecting a subset of the elements you mentioned -- add more code to collect rest of elements)
import xml.etree.ElementTree as ET
import re
xmlstring = '''<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<root xmlns="http://fakeurl.com/page">
<Alarm>
<Node>
<location>Texas></location>
<name>John</name>
</Node>
</Alarm>
<Alarm>
<CreateTime>01/01/2011</CreateTime>
<Story>
<Node>
<Name>Ethan</Name>
<Address category="residential">
<address>1421 Morning SE</address>
</Address>
</Node>
</Story>
<Build>
<Action category="build_value_1">Build was successful</Action>
</Build>
<OtherData type="string" meaning="favoriteTVShow">Purple</OtherData>
<OtherData type="string" meaning="favoriteColor">Seinfeld</OtherData>
</Alarm>
</root>'''
xmlstring = re.sub(' xmlns="[^"]+"', '', xmlstring, count=1)
root = ET.fromstring(xmlstring)
alarms = root.findall('Alarm')
alarms_list = []
for alarm in alarms:
create_time = alarm.find('CreateTime')
if create_time is not None:
entry = {'create_time': create_time.text}
alarms_list.append(entry)
actions = alarm.findall('Build/Action')
if actions:
entry['builds'] = []
for action in actions:
entry['builds'].append({'category': action.attrib['category'], 'status': action.text})
print(alarms_list)

python ElementTree remove issue

I have xml file as following:
<plugin-config>
<properties>
<property name="AZSRVC_CONNECTION" value="diamond_plugins#AZSRVC" />
<property name="DIAMOND_HOST" value="10.0.230.1" />
<property name="DIAMOND_PORT" value="3333" />
</properties>
<pack-list>
<vsme-pack id="monthly_50MB">
<campaign-list>
<campaign id="2759" type="SOB" />
<campaign id="2723" type="SUBSCRIBE" />
</campaign-list>
</vsme-pack>
<vsme-pack id="monthly_500MB">
<campaign-list>
<campaign id="3879" type="SOB" />
<campaign id="3885" type="SOB" />
<campaign id="2724" type="SUBSCRIBE" />
<campaign id="1111" type="COB" /></campaign-list>
</vsme-pack>
</pack-list>
</plugin-config>
And trying to run this Python script to remove 'campaign' with specific id.
import xml.etree.ElementTree as ET
tree = ET.parse('pack-assign-config.xml')
root = tree.getroot()
pack_list = root.find('pack-list')
camp_list = pack_list.find(".//vsme-pack[#id='{pack_id}']".format(pack_id=pack_id)).find('campaign-list').findall('campaign')
for camp in camp_list:
if camp.get('id') == '2759':
camp_list.remove(camp)
tree.write('out.xml')
I run script but out is the same as input file, so does not remove element.

Issue :
this is wrong way to find the desired node . you are searching for vsme-pack and the trying to find campaign-list and campaign ? which incorrect format.
camp_list = pack_list.find(".//vsme-pack[#id='{pack_id}']".format(pack_id=pack_id)).find('campaign-list').findall('campaign')
Fixed Code Example
here is the working code which removes the node from xml
import xml.etree.ElementTree as ET
root = ET.parse('pack-assign-config.xml')
# Alternatively, parse the XML that lives in 'filename_path'
# tree = ElementTree.parse(filename_path)
# root = tree.getroot()
# Find the parent element of each "weight" element, using XPATH
for parent in root.findall('.//pack-list/'):
# Find each weight element
for element in parent.findall('campaign-list'):
for camp_list in element.findall('campaign'):
if camp_list.get('id') == '2759' or camp_list.get('id') == '3879' :
element.remove(camp_list)
root.write("out.xml")
hope this helps

adding multiple values of elem.attrib into a variable

Started playing with Python and ElementTree very recently to acheive something quite specific. I am very nearly there I think but there is one thing that I can't quite work out. I am querying an xml file and pulling back the relevant data - then putting that data into a csv file. It all works but the issue is that the elem.attrib["text"] actually returns multiple lines - when I put it into a variable the variable and export to a csv it only exports the first line - below is the code I am using...
import os
import csv
import xml.etree.cElementTree as ET
path = "/share/new"
c = csv.writer(open("/share/redacted.csv", "wb"))
c.writerow(["S","R","T","R2","R3"])
for filename in os.listdir(path):
if filename.endswith('.xml'):
fullname = os.path.join(path, filename)
tree = ET.ElementTree(file=(fullname))
for elem in tree.iterfind('PropertyList/Property[#name="Sender"]'):
c1 = elem.attrib["value"]
for elem in tree.iterfind('PropertyList/Property[#name="Recipient"]'):
c2 = elem.attrib["value"]
for elem in tree.iterfind('PropertyList/Property[#name="Date"]'):
c3 = elem.attrib["value"]
for elem in tree.iterfind('ChildContext/ResponseList/Response/TextualAnalysis/ExpressionList/Expression/Match'):
c4 = elem.attrib["textView"]
for elem in tree.iterfind('ChildContext/ResponseList/Response/TextualAnalysis/ExpressionList/Expression/Match/Matched'):
c5 = elem.attrib["text"]
print elem.attrib["text"]
print c5
c.writerow([(c1),(c2),(c3),(c4),(c5)])
The most important part is right near the bottom - the output of print elem.atrrib["text"] is:
Apples
Bananas
The output of 'print c5' is the same (just to be clear that is Apples and Bananas on seperate lines)
But, outputting c5 to a csv only outputs the first line and therefore only Apples appears in the csv.
I hope this makes sense - what I need to do is output both Apples and Bananas to the csv (in the same cell preferably). The below is in Python 2.7 in development but ideally I need it to work in 2.6 (I realise iterfind is not in 2.6 - I have 2 versions of code already)
I would post the xml but it is a bit of a beast. - As per suggestion in comments here is a cleaned up XML.
<?xml version="1.0" encoding="UTF-8" standalone="no" ?>
<Context>
<PropertyList duplicates="true">
<Property name="Sender" type="string" value="S:demo1#no-one.local"/>
<Property name="Recipient" type="string" value="RPFD:no-one.local"/>
<Property name="Date" type="string" value="Tue, 4 Aug 2015 13:24:16 +0100"/>
</PropertyList>
<ChildContext>
<ResponseList>
<Response>
<Description>
<Arg />
<Arg />
</Description>
<TextualAnalysis version="2.0">
<ExpressionList>
<Expression specified=".CLEAN.(Apples)" total="1" >
<Match textView="Body" truncated="false">
<Surrounding text="..."/>
<Surrounding text="How do you like them "/>
<Matched cleaned="true" text="Apples " type="expression"/>
<Surrounding text="???????? "/>
<Surrounding text="..."/>
</Match>
</Expression>
</ExpressionList>
</TextualAnalysis>
</Response>
</ResponseList>
</ChildContext>
<ChildContext>
<ResponseList>
<Response>
<Description>
<Arg />
<Arg />
</Description>
<TextualAnalysis version="2.0">
<ExpressionList>
<Expression specified=".CLEAN.(Bananas)" total="1" >
<Match textView="Attach" truncated="false">
<Surrounding text="..."/>
<Surrounding text="Also I don't like... "/>
<Matched cleaned="true" text="Bananas " type="expression"/>
<Surrounding text="!!!!!!! "/>
<Surrounding text="..."/>
</Match>
</Expression>
</ExpressionList>
</TextualAnalysis>
</Response>
</ResponseList>
</ChildContext>
</Context>

The following will join together all the text elements, and put them on separate lines in the same cell inside your CSV. You can change the '\n' separator to ' ' or ',' to put them on the same line. However, you might still have issues with some of your other stuff -- you don't have nested loops there and I don't really understand what you are trying to accomplish, so maybe you have more than one of each of those other things too. Anyway:
c5 = []
for elem in tree.iterfind('ChildContext/ResponseList/Response/TextualAnalysis/ExpressionList/Expression/Match/Matched'):
c5.append(elem.attrib["text"])
c.writerow([c1, c2, c3, c4, '\n'.join(c5)])

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python, XML parsing, and Elementtree - python

Related

How to remove a sub child of an xml file using python?

Extracting comments from XML file in Python

Parsing XML with namespace into dictionary

python ElementTree remove issue

adding multiple values of elem.attrib into a variable

Categories

Resources