Efficient removal of XML elements using Python

Efficient removal of XML elements using Python - python

I am trying to efficiently edit reasonably large XML files (usually 100-500MB but up to 1GB) in size to remove all occurrences of an element which do not contain an attribute with a given value. I am looking for the most efficient way of performing this in terms of speed, whilst also not loading large amounts of data into memory since this is an issue for larger files.
Using example XML, the structure is along the lines of the following, where the parent element may be nested within each other any number of times.
<root>
<parent>
<child id="c1">
<content />
</child>
<child id="c2">
<content />
</child>
</parent>
<parent>
<parent>
<child id="c3">
<content />
</child>
</parent>
</parent>
</root>
Using the above example XML, I am trying to remove all child elements where the ID doesn't equal "c1" to give an outcome of:
<root>
<parent>
<child id="c1">
<content />
</child>
</parent>
<parent>
<parent />
</parent>
</root>
The most efficient method I have come up with so far is using cElementTree iterparse:
import xml.etree.cElementTree as ET
xml_source = 'xml file location'
xml_output = 'xml output file location'
context = ET.iterparse(xml_source, events=("start", "end"))
context = iter(context)
event, root = context.next()
for event, elem in context:
if event == 'end' and elem.tag == 'child' and elem.attrib['id'] != 'c1':
elem.clear()
ET.ElementTree(root).write(xml_output)
The above will handle a test file 100MB in size in around 10 seconds, is there a more efficient way of achieving this?

Sorry, I have no huge equivalent xml file at hand, so you'll have to benchmark those suggestions yourself… :-/
the context has a root property, so you can iterparse on the (default) 'end' events only:
context = ET.iterparse(xml_source)
for event, elem in context:
if elem.tag == 'child' and elem.attrib['id'] != 'c1':
elem.clear()
ET.ElementTree(context.root).write(xml_output)
use lxml.etree instead of xml.etree:
import lxml.etree as ET
lxml.etree.iterparse has a tag argument to iterate only on specific elements:
context = ET.iterparse(xml_source, tag='child')
for event, elem in context:
if elem.attrib['id'] != 'c1':
elem.clear()
one last suggestion, but not about speed. elem.clear() does not remove the element itself but only clear its children, text and tail. So you end up with empty <child/> elements:
<root>
<parent>
<child id="c1">
<content />
</child>
<child />
</parent>
<parent>
<parent>
<child />
</parent>
</parent>
</root>
With lxml you can use this instead of elem.clear():
for event, elem in context:
if elem.attrib['id'] != 'c1':
elem.getparent().remove(elem)

Related

Python : Remove an element but not its children from xml?

I want to remove element but not its children. I tried with this code, but my code remove its children also.
code
import xml.etree.ElementTree as ET
tree = ET.parse('test.xml')
root = tree.getroot()
for item in root.findall('item'):
root.remove(item)
print(ET.tostring(root))
>>> <root>
</root>
test.xml
<?xml version="1.0" ?>
<root>
<item>
<data>
<number>01</number>
<step>one</step>
</data>
</item>
</root>
expected outcome
<?xml version="1.0" ?>
<root>
<data>
<number>01</number>
<step>one</step>
</data>
</root>

You should move all children of item to root before removing
for item in root.findall('item'):
for child in item:
root.append(child)
root.remove(item)
print(ET.tostring(root))
the code results in
<root>
<data>
<number>01</number>
<step>one</step>
</data>
</root>

Find the element with data tag, remove it and extend the element's parent with element's children.
import xml.etree.ElementTree as etree
data = """
<root>
<item>
<data>
<number>01</number>
<step>one</step>
</data>
</item>
</root>
"""
tree = etree.fromstring(data)
def iterparent(tree):
for parent in tree.iter():
for child in parent:
yield parent, child
tree = etree.fromstring(data)
for parent, child in iterparent(tree):
if child.tag == "data":
parent.remove(child)
parent.extend(child)
print((etree.tostring(tree)))
will output
<root>
<item>
<number>01</number>
<step>one</step>
</item>
</root>
Adapted from a similar answer for your particular use case.

How to mapping XML tag by matching the tag with a string using Python?

I have an XML file.
Here is the content:
<Country>
<number no="2008" info="update">
<detail name="man1" class="A1">
<string name="ruth" />
<string name="amy" />
</detail>
<detail name="man2" class="A2">
<string name="lisa" />
<string name="graham" />
</detail>
</number>
<number no="2006" info="update">
<detail name="woman1" class="B1">
<string name="grace" />
<string name="chil" />
</detail>
<detail name="woman2" class="B2">
<string name="emy" />
<string name="toms" />
</detail>
</number>
</Country>
I need to get the value of number in here <number no="2008" by mapping with this value class="A1"
I tried this way, but It print None.
here is the code:
import xml.etree.ElementTree as ET
ReadXML = ET.parse('data.xml')
stringno = 'A1'
for family in ReadXML.findall('./number/detail[#class="{}"]'.format(stringno)):
name = family.get('no')
print(name)
Anyone can help me, please. Thanks a lot

You can use XPath expression to select element number by class attribute of detail child, then you can read no attribute from selected number from python: number[detail/#class="A1"]
But findall() only supports limited subset of XPath expression which doesn't include the XPath above. We need to resort to a simpler XPath expression, for example using your attempted XPath then selecting parent of matched detail elements using ..:
stringno = 'A1'
for family in ReadXML.findall('number/detail[#class="{}"]/..'.format(stringno)):
name = family.get('no')
print(name)

iterate through XML?

What is the easiest way to navigate through XML with python?
<html>
<body>
<soapenv:envelope xmlns:soapenv="http://schemas.xmlsoap.org/soap/envelope/" xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<soapenv:body>
<getservicebyidresponse xmlns="http://www.something.com/soa/2.0/SRIManagement">
<code xmlns="">
0
</code>
<client xmlns="">
<action xsi:nil="true">
</action>
<actionmode xsi:nil="true">
</actionmode>
<clientid>
405965216
</clientid>
<firstname xsi:nil="true">
</firstname>
<id xsi:nil="true">
</id>
<lastname>
Last Name
</lastname>
<role xsi:nil="true">
</role>
<state xsi:nil="true">
</state>
</client>
</getservicebyidresponse>
</soapenv:body>
</soapenv:envelope>
</body>
</html>
I would go with regex and try to get the values of the lines I need but is there a pythonic way? something like xml[0][1] etc?

As #deceze already pointed out, you can use xml.etree.ElementTree here.
import xml.etree.ElementTree as ET
tree = ET.parse("path_to_xml_file")
root = tree.getroot()
You can iterate over all children nodes of root:
for child in root.iter():
if child.tag == 'clientid':
print(child.tag, child.text.strip())
Children are nested, and we can access specific child nodes by index, so root[0][1] should work (as long as the indices are correct).

how to add xml child node with namespace in python?

i'm realy stuck in this, i got a file with an xml layout like this:
<rss xmlns:irc="SomeName" version="2.0">
<channel>
<item>
<irc:title>A title</irc:title>
<irc:poster>A poster</irc:poster>
<irc:url>An url</irc:url>
</item>
</channel>
</rss>
i need to add another 'item' in channel node, that's easy, but i can't find the way to add the item's child with the namespace.
i'm trying with lxml, but the documentation is not so clear for a newbie
please any help will be appreciated.
i find the way to doit with lxml
root = xml.getroot()
channel = root.find('channel')
item = et.Element('item')
title = et.SubElement(item,'{SomeName}title')
title.text = 'My new title'
poster = et.SubElement(item,'{SomeName}poster')
poster.text = 'My poster'
poster = et.SubElement(item,'{SomeName}url')
poster.text = 'http://My.url.com'
channel.append(item)
but still interested in a better solution

Alternatively, you can use XSLT, the declarative programming language, that transforms, styles, re-formats, and re-structures XML files in any way, shape, or form. Python's lxml module maintains an XSLT processor.
Simply, register the needed namespace in the XSLT's declaration line and use it in any new node. This might appear to be overkill for your current need but there could be a situation where a more complex transformation is needed with appropriate namespaces. Below adds a new title to the previous poster and URL.
XSLT (to be saved as .xsl)
<?xml version="1.0" ?>
<xsl:transform xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0"
xmlns:irc="SomeName">
<xsl:strip-space elements="*" />
<xsl:output method="xml" indent="yes"/>
<xsl:template match="rss">
<rss>
<channel>
<xsl:for-each select="//item">
<item>
<irc:title>My new title</irc:title>
<xsl:copy-of select="irc:poster"/>
<xsl:copy-of select="irc:url"/>
</item>
</xsl:for-each>
</channel>
</rss>
</xsl:template>
</xsl:transform>
Python
import os
import lxml.etree as ET
# SET CURRENT DIRECTORY
cd = os.path.dirname(os.path.abspath(__file__))
# LOAD IN XML AND XSL FILES
dom = ET.parse(os.path.join(cd, 'Original.xml'))
xslt = ET.parse(os.path.join(cd, 'XSLT_Script.xsl'))
# TRANSFORM
transform = ET.XSLT(xslt)
newdom = transform(dom)
# OUTPUT FINAL XML
tree_out = ET.tostring(newdom, encoding='UTF-8', pretty_print=True, xml_declaration=True)
xmlfile = open(os.path.join(cd, 'output.xml'),'wb')
xmlfile.write(tree_out)
xmlfile.close()
Output
<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:irc="SomeName">
<channel>
<item>
<irc:title>My new title</irc:title>
<irc:poster>A poster</irc:poster>
<irc:url>An url</irc:url>
</item>
</channel>
</rss>

Override text in XML with lxml

Say I have an XML file and I want to edit parts of it. The following does not work, probably because I am editing a copy of a child.
from lxml import etree as et
tree = et.parse(p_my_xml)
root = tree.getroot()
for child in root:
for entry in child.getchildren():
first_part = entry.getchildren()[1].text
second_part = entry.getchildren()[2].text
if first_part == 'some_condition'
second_part = 'something_else'
tree.write(p_my_xml, pretty_print=True)
How can I correctly modify parts of the XML so that the changes are done in the tree?

Save the reference to the element and reset the text:
second_elm = entry.getchildren()[2]
if first_part == 'some_condition'
second_elm.text = 'something_else'

For future readers, any XML transformation, styling, re-formatting, and re-structuring can be adequately and even efficiently handled with XSLT, the declarative programming language used for XML manipulation. And Python's lxml module maintains an XSLT processor.
See below generalized example using OP's needs:
Original XML
<?xml version="1.0" encoding="UTF-8"?>
<root>
<child>
<entry1>some text</entry1>
<entry2>other text</entry2>
</child>
<child>
<entry1>some text</entry1>
<entry2>other text</entry2>
</child>
</root>
XSLT Script
<xsl:transform version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:template match="root">
<root>
<xsl:for-each select="//child">
<child>
<xsl:copy-of select="entry1"/>
<xsl:if test="entry1='some text'">
<entry2>some new text</entry2>
</xsl:if>
</child>
</xsl:for-each>
</root>
</xsl:template>
</xsl:transform>
Python Script
import os
import lxml.etree as ET
cd = os.path.dirname(os.path.abspath(__file__))
dom = ET.parse(os.path.join(cd, 'Original.xml'))
xslt = ET.parse(os.path.join(cd, 'XSLTScript.xsl'))
transform = ET.XSLT(xslt)
newdom = transform(dom)
tree_out = ET.tostring(newdom, encoding='UTF-8', pretty_print=True, xml_declaration=True)
xmlfile = open(os.path.join(cd, 'Final.xml'),'wb')
xmlfile.write(tree_out)
xmlfile.close()
Final XML
<?xml version='1.0' encoding='UTF-8'?>
<root>
<child>
<entry1>some text</entry1>
<entry2>some new text</entry2>
</child>
<child>
<entry1>some text</entry1>
<entry2>some new text</entry2>
</child>
</root>
While the above may seem too involved and not a Pythonic one-liner, please note there may be an occasion where you require a complex, intricate XML restructuring where you can leverage XSLT's recursive, template formatting language and not run complex iteration loops in object-oriented programming (Python, PHP, Java, C#, etc).

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Efficient removal of XML elements using Python - python

Related

Python : Remove an element but not its children from xml?

How to mapping XML tag by matching the tag with a string using Python?

iterate through XML?

how to add xml child node with namespace in python?

Override text in XML with lxml

Categories

Resources