Override text in XML with lxml - python

Say I have an XML file and I want to edit parts of it. The following does not work, probably because I am editing a copy of a child.
from lxml import etree as et
tree = et.parse(p_my_xml)
root = tree.getroot()
for child in root:
for entry in child.getchildren():
first_part = entry.getchildren()[1].text
second_part = entry.getchildren()[2].text
if first_part == 'some_condition'
second_part = 'something_else'
tree.write(p_my_xml, pretty_print=True)
How can I correctly modify parts of the XML so that the changes are done in the tree?

Save the reference to the element and reset the text:
second_elm = entry.getchildren()[2]
if first_part == 'some_condition'
second_elm.text = 'something_else'

For future readers, any XML transformation, styling, re-formatting, and re-structuring can be adequately and even efficiently handled with XSLT, the declarative programming language used for XML manipulation. And Python's lxml module maintains an XSLT processor.
See below generalized example using OP's needs:
Original XML
<?xml version="1.0" encoding="UTF-8"?>
<root>
<child>
<entry1>some text</entry1>
<entry2>other text</entry2>
</child>
<child>
<entry1>some text</entry1>
<entry2>other text</entry2>
</child>
</root>
XSLT Script
<xsl:transform version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:template match="root">
<root>
<xsl:for-each select="//child">
<child>
<xsl:copy-of select="entry1"/>
<xsl:if test="entry1='some text'">
<entry2>some new text</entry2>
</xsl:if>
</child>
</xsl:for-each>
</root>
</xsl:template>
</xsl:transform>
Python Script
import os
import lxml.etree as ET
cd = os.path.dirname(os.path.abspath(__file__))
dom = ET.parse(os.path.join(cd, 'Original.xml'))
xslt = ET.parse(os.path.join(cd, 'XSLTScript.xsl'))
transform = ET.XSLT(xslt)
newdom = transform(dom)
tree_out = ET.tostring(newdom, encoding='UTF-8', pretty_print=True, xml_declaration=True)
xmlfile = open(os.path.join(cd, 'Final.xml'),'wb')
xmlfile.write(tree_out)
xmlfile.close()
Final XML
<?xml version='1.0' encoding='UTF-8'?>
<root>
<child>
<entry1>some text</entry1>
<entry2>some new text</entry2>
</child>
<child>
<entry1>some text</entry1>
<entry2>some new text</entry2>
</child>
</root>
While the above may seem too involved and not a Pythonic one-liner, please note there may be an occasion where you require a complex, intricate XML restructuring where you can leverage XSLT's recursive, template formatting language and not run complex iteration loops in object-oriented programming (Python, PHP, Java, C#, etc).

Related

How to remove all occurences of element in XML file?

I'd like to edit a KML file and remove all occurences of ExtendedData elements, wherever they are located in the file.
Here's the input XML file:
<?xml version="1.0" encoding="UTF-8"?>
<kml xmlns="http://earth.google.com/kml/2.2">
<Document>
<Style id="placemark-red">
<IconStyle>
<Icon>
<href>http://maps.me/placemarks/placemark-red.png</href>
</Icon>
</IconStyle>
</Style>
<name>My track</name>
<ExtendedData xmlns:mwm="https://maps.me">
<mwm:name>
<mwm:lang code="default">Blah</mwm:lang>
</mwm:name>
<mwm:lastModified>2020-04-05T14:17:18Z</mwm:lastModified>
</ExtendedData>
<Placemark>
<name></name>
…
<ExtendedData xmlns:mwm="https://maps.me">
<mwm:localId>0</mwm:localId>
<mwm:visibility>1</mwm:visibility>
</ExtendedData>
</Placemark>
</Document>
</kml>
And here's the code that 1) only removes the outermost occurence, and 2) requires adding the namespace to find it:
from lxml import etree
from pykml import parser
from pykml.factory import KML_ElementMaker as KML
with open("input.xml") as f:
doc = parser.parse(f)
root = doc.getroot()
ns = "{http://earth.google.com/kml/2.2}"
for pm in root.Document.getchildren():
#No way to get rid of namespace, for easier search?
if pm.tag==f"{ns}ExtendedData":
root.Document.remove(pm)
#How to remove innermost occurence of ExtendedData?
print(etree.tostring(doc, pretty_print=True))
Is there a way to remove all occurences in one go, or should I parse the whole tree?
Thank you.
Edit: The BeautifulSoup solution below requires adding an option "BeautifulSoup(my_xml,features="lxml")" to avoid the warning "No parser was explicitly specified".
Here's a solution using BeautifulSoup:
soup = BeautifulSoup(my_xml) # this is your xml
while True:
elem = soup.find("extendeddata")
if not elem:
break
elem.decompose()
Here's the output for your data:
<?xml version="1.0" encoding="UTF-8"?>
<html>
<body>
<kml xmlns="http://earth.google.com/kml/2.2">
<document>
<style id="placemark-red">
<IconStyle>
<Icon>
<href>http://maps.me/placemarks/placemark-red.png</href>
</Icon>
</IconStyle>
</style>
<name>
My track
</name>
<placemark>
<name>
</name>
</placemark>
</document>
</kml>
</body>
</html>
If you know the XML structure, try:
xml_root = ElementTree.parse(filename_path).getroot()
elem = xml_root.find('./ExtendedData')
xml_root.remove(elem)
or
xml_root = ElementTree.parse(filename_path).getroot()
p_elem = xml_root.find('/Placemark')
c_elem = xml_root.find('/Placemark/ExtendedData')
p_elem.remove(c_elem)
play with this ideas :)
if you don't know the xml structure, I think you need to parse the whole tree.
Simply run the empty template with Identity Transform using XSLT 1.0 which Python's lxml can run. No for/while loops or if logic needed. To handle the default namespace, define a prefix like doc:
XSLT (save a .xsl file, a special .xml file)
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:doc="http://earth.google.com/kml/2.2">
<xsl:output method="xml" omit-xml-declaration="yes" indent="yes"/>
<xsl:strip-space elements="*"/>
<!-- IDENTITY TRANSFORM -->
<xsl:template match="node()|#*">
<xsl:copy>
<xsl:apply-templates select="node()|#*"/>
</xsl:copy>
</xsl:template>
<!-- REMOVE ALL OCCURRENCES OF NODE -->
<xsl:template match="doc:ExtendedData"/>
</xsl:stylesheet>
Python
import lxml.etree as et
# LOAD XML AND XSL SOURCES
xml = et.parse('Input.xml')
xsl = et.parse('XSLT_Script.xsl')
# TRANSFORM INPUT
transform = et.XSLT(xsl)
result = transform(xml)
# PRINT TO SCREEN
print(result)
# SAVE TO FILE
with open('Output.kml', 'wb') as f:
f.write(result)

Save (print) xml node with its parents but without children

From the XML document, I want to save one node to a file - with all parent nodes, but without any child nodes. For example, for the following XML:
<?xml version="1.0" encoding="UTF-8"?>
<kml xmlns="http://earth.google.com/kml/2.1">
<Document id="myid">
<name>ref.kml</name>
<Style id="normalState">
<IconStyle><scale>1.0</scale><Icon><href>yt.png</href></Icon></IconStyle>
</Style>
</Document>
</kml>
expected output for <Document> node will be like this:
<?xml version="1.0" encoding="UTF-8"?>
<kml xmlns="http://earth.google.com/kml/2.1">
<Document id="myid">
</Document>
</kml>
So far I only found a solution with iterated removal of all child elements before saving. But as I need to work with original XML after, I have to make a copy of the whole document:
#!/usr/bin/env python
import lxml.etree as ET # have to use [lxml] because [xml] doesn't support 'xml_declaration'
import copy
kml_file = ET.parse("myfile.kml")
kml_copied = copy.deepcopy(kml_file) # .copy() is not enough, need .deepcopy()
root = kml_copied.getroot()
my_node = root[0]
for child in my_node:
my_node.remove(child)
print ET.tostring(kml_copied, xml_declaration=True, encoding='utf-8')
Is there better way to do this? at least to avoid making a deepcopy of the whole document...
Consider XSLT, the special-purpose declarative language designed to transform XML documents. And Python's lxml module has a built-in XSLT 1.0 processor. Additionally XSLT (whose script is a well-formed xml document can also adequately handle the kml undeclared namespace):
XSLT Script (save as .xsl to be loaded in Python, also portable to other languages)
<xsl:transform xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0"
xmlns:doc="http://earth.google.com/kml/2.1">
<xsl:output version="1.0" encoding="UTF-8" indent="yes" />
<xsl:strip-space elements="*"/>
<!-- Identity Transform to copy entire document -->
<xsl:template match="#*|node()">
<xsl:copy>
<xsl:apply-templates select="#*|node()"/>
</xsl:copy>
</xsl:template>
<!-- Empty Template to Remove Nodes -->
<xsl:template match="doc:Style|doc:name"/>
</xsl:transform>
Python Script
import lxml.etree as ET
# LOAD XML AND XSL
dom = ET.parse('Input.xml')
xslt = ET.parse('XSLTScript.xsl')
# TRANSFORM INPUT INTO DOM OBJECT
transform = ET.XSLT(xslt)
newdom = transform(dom)
# OUTPUT DOM TO STRING
tree_out = ET.tostring(newdom,
encoding='UTF-8',
pretty_print=True,
xml_declaration=True)
print(tree_out.decode("utf-8"))
# SAVE RESULTING XML
xmlfile = open('Output.xml','wb')
xmlfile.write(tree_out)
xmlfile.close()
Output
<?xml version='1.0' encoding='UTF-8'?>
<kml xmlns="http://earth.google.com/kml/2.1">
<Document id="myid"/>
</kml>

how to add xml child node with namespace in python?

i'm realy stuck in this, i got a file with an xml layout like this:
<rss xmlns:irc="SomeName" version="2.0">
<channel>
<item>
<irc:title>A title</irc:title>
<irc:poster>A poster</irc:poster>
<irc:url>An url</irc:url>
</item>
</channel>
</rss>
i need to add another 'item' in channel node, that's easy, but i can't find the way to add the item's child with the namespace.
i'm trying with lxml, but the documentation is not so clear for a newbie
please any help will be appreciated.
i find the way to doit with lxml
root = xml.getroot()
channel = root.find('channel')
item = et.Element('item')
title = et.SubElement(item,'{SomeName}title')
title.text = 'My new title'
poster = et.SubElement(item,'{SomeName}poster')
poster.text = 'My poster'
poster = et.SubElement(item,'{SomeName}url')
poster.text = 'http://My.url.com'
channel.append(item)
but still interested in a better solution
Alternatively, you can use XSLT, the declarative programming language, that transforms, styles, re-formats, and re-structures XML files in any way, shape, or form. Python's lxml module maintains an XSLT processor.
Simply, register the needed namespace in the XSLT's declaration line and use it in any new node. This might appear to be overkill for your current need but there could be a situation where a more complex transformation is needed with appropriate namespaces. Below adds a new title to the previous poster and URL.
XSLT (to be saved as .xsl)
<?xml version="1.0" ?>
<xsl:transform xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0"
xmlns:irc="SomeName">
<xsl:strip-space elements="*" />
<xsl:output method="xml" indent="yes"/>
<xsl:template match="rss">
<rss>
<channel>
<xsl:for-each select="//item">
<item>
<irc:title>My new title</irc:title>
<xsl:copy-of select="irc:poster"/>
<xsl:copy-of select="irc:url"/>
</item>
</xsl:for-each>
</channel>
</rss>
</xsl:template>
</xsl:transform>
Python
import os
import lxml.etree as ET
# SET CURRENT DIRECTORY
cd = os.path.dirname(os.path.abspath(__file__))
# LOAD IN XML AND XSL FILES
dom = ET.parse(os.path.join(cd, 'Original.xml'))
xslt = ET.parse(os.path.join(cd, 'XSLT_Script.xsl'))
# TRANSFORM
transform = ET.XSLT(xslt)
newdom = transform(dom)
# OUTPUT FINAL XML
tree_out = ET.tostring(newdom, encoding='UTF-8', pretty_print=True, xml_declaration=True)
xmlfile = open(os.path.join(cd, 'output.xml'),'wb')
xmlfile.write(tree_out)
xmlfile.close()
Output
<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:irc="SomeName">
<channel>
<item>
<irc:title>My new title</irc:title>
<irc:poster>A poster</irc:poster>
<irc:url>An url</irc:url>
</item>
</channel>
</rss>

Efficient removal of XML elements using Python

I am trying to efficiently edit reasonably large XML files (usually 100-500MB but up to 1GB) in size to remove all occurrences of an element which do not contain an attribute with a given value. I am looking for the most efficient way of performing this in terms of speed, whilst also not loading large amounts of data into memory since this is an issue for larger files.
Using example XML, the structure is along the lines of the following, where the parent element may be nested within each other any number of times.
<root>
<parent>
<child id="c1">
<content />
</child>
<child id="c2">
<content />
</child>
</parent>
<parent>
<parent>
<child id="c3">
<content />
</child>
</parent>
</parent>
</root>
Using the above example XML, I am trying to remove all child elements where the ID doesn't equal "c1" to give an outcome of:
<root>
<parent>
<child id="c1">
<content />
</child>
</parent>
<parent>
<parent />
</parent>
</root>
The most efficient method I have come up with so far is using cElementTree iterparse:
import xml.etree.cElementTree as ET
xml_source = 'xml file location'
xml_output = 'xml output file location'
context = ET.iterparse(xml_source, events=("start", "end"))
context = iter(context)
event, root = context.next()
for event, elem in context:
if event == 'end' and elem.tag == 'child' and elem.attrib['id'] != 'c1':
elem.clear()
ET.ElementTree(root).write(xml_output)
The above will handle a test file 100MB in size in around 10 seconds, is there a more efficient way of achieving this?
Sorry, I have no huge equivalent xml file at hand, so you'll have to benchmark those suggestions yourself… :-/
the context has a root property, so you can iterparse on the (default) 'end' events only:
context = ET.iterparse(xml_source)
for event, elem in context:
if elem.tag == 'child' and elem.attrib['id'] != 'c1':
elem.clear()
ET.ElementTree(context.root).write(xml_output)
use lxml.etree instead of xml.etree:
import lxml.etree as ET
lxml.etree.iterparse has a tag argument to iterate only on specific elements:
context = ET.iterparse(xml_source, tag='child')
for event, elem in context:
if elem.attrib['id'] != 'c1':
elem.clear()
one last suggestion, but not about speed. elem.clear() does not remove the element itself but only clear its children, text and tail. So you end up with empty <child/> elements:
<root>
<parent>
<child id="c1">
<content />
</child>
<child />
</parent>
<parent>
<parent>
<child />
</parent>
</parent>
</root>
With lxml you can use this instead of elem.clear():
for event, elem in context:
if elem.attrib['id'] != 'c1':
elem.getparent().remove(elem)

how to parse the second xml tree in a file

Suppose I have a XML file like
<?xml version="1.0" encoding="utf-8"?>
<items>
<?xml version="1.0" encoding="utf-8"?>
<items>
<item>
<price>1500</price>
<info> asfgfdff</info>
</item>
</items>
How do I parse so that the parser selects the recently updated xml tree?
with open('file','r') as f:
newestXml = []
for line in f.readlines():
if re.search('^<\?xml',line):
newestXml = [line]
else:
newestXml.append(line)
At the end of the loop, newestXml will contain all the lines from the last occurrence of <?xml to the end of the file.
Now you can combine the lines and use the xml parser to parse the xml.
Note - I can't check this code now, so it may contain small mistakes, but I hope the idea will help you.

Categories