Replacing a xml element with lxml

Replacing a xml element with lxml - python

I have this complex xml file and we need to dynamically update some elements in it. I have successfully been able to update value strings (attributes) using lxml, but I'm completely unsure how to go about replacing an entire element. Here's some pseudo-code to show what I'm trying to do.
import os
from lxml import etree
directory_name = "C:\\apps"
file_name = "web.config"
xpath_identifier = '/configuration/applicationSettings/Things/setting[#name="CorsTrustedOrigins"]'
#contents of the xml file for reference:
<configuration>
<configSections>
<section name="log4net" type="log4net.Config.Log4NetConfigurationSectionHandler, log4net"/>
<sectionGroup name="applicationSettings" type="System.Configuration.ApplicationSettingsGroup, System, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089">
<section name="Things" type="System.Configuration.ClientSettingsSection, System, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089" requirePermission="false"/>
</sectionGroup>
</configSections>
<appSettings/>
<applicationSettings>
<Things>
<setting name="CorsTrustedOrigins" serializeAs="Xml">
<value>
<ArrayOfString>
<string>http://localhost:51363</string>
<string>http://localhost:3333</string>
</ArrayOfString>
</value>
</setting>
</Things>
</applicationSettings>
</configuration>
file_full_path = os.path.join(directory_name, file_name)
tree = etree.parse(file_full_path)
root = tree.getroot()
etree.tostring(root)
xpath_identifier = str(xpath_identifier)
value = root.xpath(xpath_identifier)
#This successfully prints the element I'm after, so I'm sure my xpath is good:
etree.tostring(value[0])
#This is the new xml element I want to replace the current xpath'ed element with:
newxml = '''
<setting name="CorsTrustedOrigins" serializeAs="Xml">
<value>
<ArrayOfString>
<string>http://maddafakka</string>
</ArrayOfString>
</value>
</setting>
'''
newtree = etree.fromstring(newxml)
#I've tried this:
value[0].getparent().replace(value[0], newtree)
#and this
value[0] = newtree
#The value of value[0] gets updated, but the "root document" does not.
What I'm trying to do is to update the "ArrayofStrings" element to reflect the values in the "newxml" var.
I'm kindof struggling to navigate the lxml infos on the web, but I can't seem to find an example similar to what I'm trying to do.
Any pointers appreciated!

You should just remove the indexed access on the node:
value[0].getparent().replace(value[0], newtree)
.... to:
value.getparent().replace(value, newtree)

Related

How do I remove a comment outside of the root element of an XML document using python lxml

How do you remove comments above or below the root node of an xml document using python's lxml module? I want to remove only one comment above the root node, NOT all comments in the entire document. For instance, given the following xml document
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!-- This comment needs to be removed -->
<root>
<!-- This comment needs to STAY -->
<a/>
</root>
I want to output
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<root>
<!-- This comment needs to STAY -->
<a/>
</root>
The usual way to remove an element would be to do element.getparent().remove(element), but this doesn't work for the root element since getparent returns None. I also tried the suggestions from this stackoverflow answer, but the first answer (using a parser that remove comments) removes all comments from the document including the ones I want to keep, and the second answer (adding a dummy opening and closing tag around the document) doesn't work if the document has a directive above the root element.
I can get access to the comment above the root element using the following code, but how do I remove it from the document?
from lxml import etree as ET
tree = ET.parse("./sample_file.xml")
root = tree.getroot()
comment = root.getprevious()
# What do I do with comment now??
I've tried doing the following, but none of them worked:
comment.getparent().remove(comment) says AttributeError: 'NoneType' object has no attribute 'remove'
del comment does nothing
comment.clear() does nothing
comment.text = "" renders an empty comment <!---->
root.remove(comment) says ValueError: Element is not a child of this node.
tree.remove(comment) says AttributeError: 'lxml.etree._ElementTree' object has no attribute 'remove'
tree[:] = [root] says TypeError: 'lxml.etree._ElementTree' object does not support item assignment
Initialize a new tree with tree = ET.ElementTree(root). Serializing this new tree still has the comments somehow.

You could just build another tree by using fromstring() and passing in the root element.
from lxml import etree
tree = etree.parse("sample_file.xml")
new_tree = etree.fromstring(etree.tostring(tree.getroot()))
print(etree.tostring(new_tree, xml_declaration=True, encoding="UTF-8", standalone=True).decode())
printed output...
<?xml version='1.0' encoding='UTF-8' standalone='yes'?>
<root>
<!-- This comment needs to STAY -->
<a/>
</root>
Note: This will also remove any processing instructions before root, so another option is to append the comment to root before removing...
from lxml import etree
tree = etree.parse("sample_file.xml")
root = tree.getroot()
for comment_to_delete in root.xpath("preceding::comment()"):
root.append(comment_to_delete)
root.remove(comment_to_delete)
print(etree.tostring(tree, xml_declaration=True, encoding="UTF-8", standalone=True).decode())
This produces the same output as above, but will retain any processing instructions that occur before root.

You can parse a XML file with comments with the xmlPullParser:
If your input file looks like:
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!-- This comment needs to be removed -->
<root>
<!-- This comment needs to STAY -->
<a/>
<b>Text</b>
</root>
Parse the file and write it to a new one:
import xml.etree.ElementTree as ET
import re
# Write XML declaration line into neu file without comment 1
def write_delte_xml(input):
with open('Cleaned.xml', 'a') as my_file:
my_file.write(f'{input}')
with open('Remove_Comment.xml', 'r', encoding='utf-8') as xml:
feedstring = xml.readlines()
parser = ET.XMLPullParser(['start','end', 'comment'])
for line in enumerate(feedstring):
if line[0] == 0 and line[1].startswith('<?'):
write_delte_xml(line[1])
parser.feed(line[1])
for event, elem in parser.read_events():
if event == "comment" and line[0] != 1:
write_delte_xml(line[1])
#print(line[1])
if event == "start" and r'\>' not in line[1]:
write_delte_xml(f"{line[1]}")
#print("start",f"{line[1]},Element: {elem}")
if event == "end":
write_delte_xml(f"{line[1]}")
#print(f"END: {line[1]}")
# Clean douplicates
xml_list = []
with open('Cleaned.xml', 'rb') as xml:
lines = xml.readlines()
for line in lines:
if line not in xml_list:
xml_list.append(line)
with open('Cleaned_final.xml', 'wb') as my_file:
for line in xml_list:
my_file.write(line)
print('Cleaned.xml')
Output:
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<root>
<!-- This comment needs to STAY -->
<a/>
<b>Text</b>
</root>

Removing Elements from a KML (Python)

I generated a KML file using Python's SimpleKML library and the following script, the output of which is also shown below:
import simplekml
kml = simplekml.Kml()
ground = kml.newgroundoverlay(name='Aerial Extent')
ground.icon.href = 'C:\\Users\\mdl518\\Desktop\\aerial_image.png'
ground.latlonbox.north = 46.55537
ground.latlonbox.south = 46.53134
ground.latlonbox.east = 48.60005
ground.latlonbox.west = 48.57678
ground.latlonbox.rotation = 0.090320
kml.save(".//aerial_extent.kml")
The output KML:
<?xml version="1.0" encoding="UTF-8"?>
<kml xmlns="http://www.opengis.net/kml/2.2" xmlns:gx="http://www.google.com/kml/ext/2.2">
<Document id="1">
<GroundOverlay id="2">
<name>Aerial Extent</name>
<Icon id="3">
<href>C:\\Users\\mdl518\\Desktop\\aerial_image.png</href>
</Icon>
<LatLonBox>
<north>46.55537</north>
<south>46.53134</south>
<east>48.60005</east>
<west>48.57678</west>
<rotation>0.090320</rotation>
</LatLonBox>
</GroundOverlay>
</Document>
However, I am trying to remove the "Document" tag from this KML since it is a default element generated with SimpleKML, while keeping the child elements (e.g. GroundOverlay). Additionally, is there a way to remove the "id" attributes associated with specific elements (i.e. for the GroundOverlay, Icon elements)? I am exploring the usage of ElementTree/lxml to enable this, but these seem to be more specific to XML files as opposed to KMLs. Here's what I'm trying to use to modify the KML, but it is unable to remove the Document element:
from lxml import etree
tree = etree.fromstring(open("C:\\Users\\mdl518\\Desktop\\aerial_extent.kml").read())
for item in tree.xpath("//Document[#id='1']"):
item.getparent().remove(item)
print(etree.tostring(tree, pretty_print=True))
Here is the final desired output XML:
<?xml version="1.0" encoding="UTF-8"?>
<kml xmlns="http://www.opengis.net/kml/2.2" xmlns:gx="http://www.google.com/kml/ext/2.2">
<GroundOverlay>
<name>Aerial Extent</name>
<Icon>
<href>C:\\Users\\mdl518\\Desktop\\aerial_image.png</href>
</Icon>
<LatLonBox>
<north>46.55537</north>
<south>46.53134</south>
<east>48.60005</east>
<west>48.57678</west>
<rotation>0.090320</rotation>
</LatLonBox>
</GroundOverlay>
</kml>
Any insights are most appreciated!

You are getting tripped up on the dreaded namespaces...
Try using something like this:
ns = {'kml': 'http://www.opengis.net/kml/2.2'}
for item in tree.xpath("//kml:Document[#id='1']",namespaces=ns):
item.getparent().remove(item)
Edit:
To remove just the parent and retain all its descendants, try the following:
retain = doc.xpath("//kml:Document[#id='1']/kml:GroundOverlay",namespaces=ns)[0]
for item in doc.xpath("//kml:Document[#id='1']",namespaces=ns):
anchor = item.getparent()
anchor.remove(item)
anchor.insert(1,retain)
print(etree.tostring(doc, pretty_print=True).decode())
This should get you the desired output.

How to replace xml lines using 'if statements' in python?

Hi I'm new to xml files in general, but I am trying to replace specific lines in a xml file using 'if statements' in python 3.6. I've been looking at suggestions to use ElementTree, but none of the posts online quite fit the problem I have, so here I am.
My file is as followed:
<?xml version="1.0" encoding="UTF-8"?>
-<StructureDefinition xmlns="http://hl7.org/fhir">
<url value="http://example.org/fhir/StructureDefinition/MyObservation"/>
<name value="MyObservation"/>
<status value="draft"/>
<fhirVersion value="3.0.1"/>
<kind value="resource"/>
<abstract value="false"/>
<type value="Observation"/>
<baseDefinition value="http://hl7.org/fhir/StructureDefinition/Observation"/>
<derivation value="constraint"/>
</StructureDefinition>
I want to replace
url value="http://example.org/fhir/StructureDefinition/MyObservation"/
to something like
url value="http://example.org/fhir/StructureDefinition/NewObservation"/
by using conditional statements - because these are repeated multiple times in other files.
I have tried for-looping through the xml find to find the exact string match (which I've succeeded), but I wasn't able to delete, or replace the line (probably having to do with the fact that this isn't a .txt file).
Any help is greatly appreciated!

Your sample file contains a "-"-token in ln 3 that may be overlooked when copy/pasting in order to find a solution.
Input File
<?xml version="1.0" encoding="UTF-8"?>
<StructureDefinition xmlns="http://hl7.org/fhir">
<url value="http://example.org/fhir/StructureDefinition/MyObservation"/>
<name value="MyObservation"/>
<status value="draft"/>
<fhirVersion value="3.0.1"/>
<kind value="resource"/>
<abstract value="false"/>
<type value="Observation"/>
<baseDefinition value="http://hl7.org/fhir/StructureDefinition/Observation"/>
<derivation value="constraint"/>
</StructureDefinition>
Script
from xml.dom.minidom import parse # use minidom for this task
dom = parse('june.xml') #read in your file
search = "http://example.org/fhir/StructureDefinition/MyObservation" #set search value
replace = "http://example.org/fhir/StructureDefinition/NewObservation" #set replace value
res = dom.getElementsByTagName('url') #iterate over url tags
for element in res:
if element.getAttribute('value') == search: #in case of match
element.setAttribute('value', replace) #replace
with open('june_updated.xml', 'w') as f:
f.write(dom.toxml()) #update the dom, save as new xml file
Output file
<?xml version="1.0" ?><StructureDefinition xmlns="http://hl7.org/fhir">
<url value="http://example.org/fhir/StructureDefinition/NewObservation"/>
<name value="MyObservation"/>
<status value="draft"/>
<fhirVersion value="3.0.1"/>
<kind value="resource"/>
<abstract value="false"/>
<type value="Observation"/>
<baseDefinition value="http://hl7.org/fhir/StructureDefinition/Observation"/>
<derivation value="constraint"/>
</StructureDefinition>

How to force ElementTree to keep xmlns attribute within its original element?

I have an input XML file:
<?xml version='1.0' encoding='utf-8'?>
<configuration>
<runtime name="test" version="1.2" xmlns:ns0="urn:schemas-microsoft-com:asm.v1">
<ns0:assemblyBinding>
<ns0:dependentAssembly />
</ns0:assemblyBinding>
</runtime>
</configuration>
...and Python script:
import xml.etree.ElementTree as ET
file_xml = 'test.xml'
tree = ET.parse(file_xml)
root = tree.getroot()
print (root.tag)
print (root.attrib)
element_runtime = root.find('.//runtime')
print (element_runtime.tag)
print (element_runtime.attrib)
tree.write(file_xml, xml_declaration=True, encoding='utf-8', method="xml")
...which gives the following output:
>test.py
configuration
{}
runtime
{'name': 'test', 'version': '1.2'}
...and has an undesirable side-effect of modifying XML into:
<?xml version='1.0' encoding='utf-8'?>
<configuration xmlns:ns0="urn:schemas-microsoft-com:asm.v1">
<runtime name="test" version="1.2">
<ns0:assemblyBinding>
<ns0:dependentAssembly />
</ns0:assemblyBinding>
</runtime>
</configuration>
My original script modifies XML so I do have to call tree.write and save edited file. But the problem is that ElementTree parser moves xmlns attribute from runtime element up to the root element configuration which is not desirable in my case.
I can't remove xmlns attribute from the root element (remove it from the dictionary of its attributes) as it is not listed in a list of its attributes (unlike the attributes listed for runtime element).
Why does xmlns attribute never gets listed within the list of attributes for any element?
How to force ElementTree to keep xmlns attribute within its original element?
I am using Python 3.5.1 on Windows.

xml.etree.ElementTree pulls all namespaces into the first element as it internally doesn't track on which element the namespace was declared originally.
If you don't want that, you'll have to write your own serialisation logic.
The better alternative would be to use lxml instead of xml.etree, because it preserves the location where a namespace prefix is declared.

Following #mata advice, here I give an answer with an example with code and xml file attached.
The xml input is as shown in the picture (original and modified)
The python codes check the NtnlCcy Name and if it is "EUR", convert the Price to USD (by multiplying EURUSD: = 1.2) and change the NtnlCcy Name to "USD".
The python code is as follows:
from lxml import etree
pathToXMLfile = r"C:\Xiang\codes\Python\afmreports\test_original.xml"
tree = etree.parse(pathToXMLfile)
root = tree.getroot()
EURUSD = 1.2
for Rchild in root:
print ("Root child: ", Rchild.tag, ". \n")
if Rchild.tag.endswith("Pyld"):
for PyldChild in Rchild:
print ("Pyld Child: ", PyldChild.tag, ". \n")
Doc = Rchild.find('{001.003}Document')
FinInstrNodes = Doc.findall('{001.003}FinInstr')
for FinInstrNode in FinInstrNodes:
FinCcyNode = FinInstrNode.find('{001.003}NtnlCcy')
FinPriceNode = FinInstrNode.find('{001.003}Price')
FinCcyNodeText = ""
if FinCcyNode is not None:
CcyNodeText = FinCcyNode.text
if CcyNodeText == "EUR":
PriceText = FinPriceNode.text
Price = float(PriceText)
FinPriceNode.text = str(Price * EURUSD)
FinCcyNode.text = "USD"
tree.write(r"C:\Xiang\codes\Python\afmreports\test_modified.xml", encoding="utf-8", xml_declaration=True)
print("\n the program runs to the end! \n")
As we compare the original and modified xml files, the namespace remains unchanged, the whole structure of the xml remains unchanged, only some NtnlCcy and Price Nodes have been changed, as desired.
The only minor difference we do not want is the first line. In the original xml file, it is <?xml version="1.0" encoding="UTF-8"?>, while in the modified xml file, it is <?xml version='1.0' encoding='UTF-8'?>. The quotation sign changes from double quotation to single quotation. But we think this minor difference should not matter.
The original file context will be attached for your easy test:
<?xml version="1.0" encoding="UTF-8"?>
<BizData xmlns="001.001">
<Hdr>
<AppHdr xmlns="001.002">
<Fr>
<Id>XXX01</Id>
</Fr>
<To>
<Id>XXX02</Id>
</To>
<CreDt>2019-10-25T15:38:30</CreDt>
</AppHdr>
</Hdr>
<Pyld>
<Document xmlns="001.003">
<FinInstr>
<Id>NLENX240</Id>
<FullNm>AO.AAI</FullNm>
<NtnlCcy>EUR</NtnlCcy>
<Price>9</Price>
</FinInstr>
<FinInstr>
<Id>NLENX681</Id>
<FullNm>AO.ABN</FullNm>
<NtnlCcy>USD</NtnlCcy>
<Price>10</Price>
</FinInstr>
<FinInstr>
<Id>NLENX320</Id>
<FullNm>AO.ING</FullNm>
<NtnlCcy>EUR</NtnlCcy>
<Price>11</Price>
</FinInstr>
</Document>
</Pyld>

Element Tree doesn't load a Google Earth-exported KML

I have a problem related to a Google Earth exported KML, as it doesn't seem to work well with Element Tree. I don't have a clue where the problem might lie, so I will explain how I do everything.
Here is the relevant code:
kmlFile = open( filePath, 'r' ).read( -1 ) # read the whole file as text
kmlFile = kmlFile.replace( 'gx:', 'gx' ) # we need this as otherwise the Element Tree parser
# will give an error
kmlData = ET.fromstring( kmlFile )
document = kmlData.find( 'Document' )
With this code, ET (Element Tree object) creates an Element object accessible via variable kmlData. It points to the root element ('kml' tag). However, when I run a search for the sub-element 'Document', it returns None. Although the 'Document' tag is present in the KML file!
Are there any other discrepancies between KMLs and XMLs apart from the 'gx: smth' tags? I have searched through the KML files I am dealing with and found nothing suspicious. Here is a simplified structure of an KML file the program is supposed to deal with:
<?xml version="1.0" encoding="UTF-8"?>
<kml xmlns="http://earth.google.com/kml/2.2">
<Document>
<name>UK.kmz</name>
<Style id="sh_blu-blank">
<IconStyle>
<scale>1.3</scale>
<Icon>
<href>http://maps.google.com/mapfiles/kml/paddle/blu-blank.png</href>
</Icon>
<hotSpot x="32" y="1" xunits="pixels" yunits="pixels"/>
</IconStyle>
<ListStyle>
<ItemIcon>
<href>http://maps.google.com/mapfiles/kml/paddle/blu-blank-lv.png</href>
</ItemIcon>
</ListStyle>
</Style>
[other style tags...]
<Folder>
<name>UK</name>
<Placemark>
<name>1262 Crossness Pumping Station</name>
<LookAt>
<longitude>0.1329926667038817</longitude>
<latitude>51.50303535104574</latitude>
<altitude>0</altitude>
<range>4246.539753518848</range>
<tilt>0</tilt>
<heading>-4.295161152207489</heading>
<altitudeMode>relativeToGround</altitudeMode>
<gx:altitudeMode>relativeToSeaFloor</gx:altitudeMode>
</LookAt>
<styleUrl>#msn_blu-blank15000</styleUrl>
<Point>
<coordinates>0.1389579668507301,51.50888923518947,0</coordinates>
</Point>
</Placemark>
[other placemark tags...]
</Folder>
</Document>
</kml>
Do you have an idea why I can't access any sub-elements of 'kml'? By the way, Python version is 2.7.

The KML document is in the http://earth.google.com/kml/2.2 namespace, as indicated by
<kml xmlns="http://earth.google.com/kml/2.2">
This means that the name of the Document element is in fact {http://earth.google.com/kml/2.2}Document.
Instead of this:
document = kmlData.find('Document')
you need this:
document = kmlData.find('{http://earth.google.com/kml/2.2}Document')
However, there is a problem with the XML file. There is an element called gx:altitudeMode. The gx bit is a namespace prefix. Such a prefix needs to be declared, but the declaration is missing.
You have worked around the problem by simply replacing gx: with gx. But the proper way to do this would be to add the namespace declaration. Based on https://developers.google.com/kml/documentation/altitudemode, I take it that gx is associated with the http://www.google.com/kml/ext/2.2 namespace. So for the document to be well-formed, the root element start tag should read
<kml xmlns="http://earth.google.com/kml/2.2" xmlns:gx="http://www.google.com/kml/ext/2.2">
Now the document can be parsed:
In [1]: from xml.etree import ElementTree as ET
In [2]: kmlData = ET.parse("kml2.xml")
In [3]: document = kmlData.find('{http://earth.google.com/kml/2.2}Document')
In [4]: document
Out[4]: <Element '{http://earth.google.com/kml/2.2}Document' at 0x1895810>
In [5]:

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Replacing a xml element with lxml - python

You should just remove the indexed access on the node: value[0].getparent().replace(value[0], newtree) .... to: value.getparent().replace(value, newtree)

Related

How do I remove a comment outside of the root element of an XML document using python lxml

Removing Elements from a KML (Python)

How to replace xml lines using 'if statements' in python?

How to force ElementTree to keep xmlns attribute within its original element?

Element Tree doesn't load a Google Earth-exported KML

Categories

Resources