how to add xml child node with namespace in python? - python

i'm realy stuck in this, i got a file with an xml layout like this:
<rss xmlns:irc="SomeName" version="2.0">
<channel>
<item>
<irc:title>A title</irc:title>
<irc:poster>A poster</irc:poster>
<irc:url>An url</irc:url>
</item>
</channel>
</rss>
i need to add another 'item' in channel node, that's easy, but i can't find the way to add the item's child with the namespace.
i'm trying with lxml, but the documentation is not so clear for a newbie
please any help will be appreciated.
i find the way to doit with lxml
root = xml.getroot()
channel = root.find('channel')
item = et.Element('item')
title = et.SubElement(item,'{SomeName}title')
title.text = 'My new title'
poster = et.SubElement(item,'{SomeName}poster')
poster.text = 'My poster'
poster = et.SubElement(item,'{SomeName}url')
poster.text = 'http://My.url.com'
channel.append(item)
but still interested in a better solution

Alternatively, you can use XSLT, the declarative programming language, that transforms, styles, re-formats, and re-structures XML files in any way, shape, or form. Python's lxml module maintains an XSLT processor.
Simply, register the needed namespace in the XSLT's declaration line and use it in any new node. This might appear to be overkill for your current need but there could be a situation where a more complex transformation is needed with appropriate namespaces. Below adds a new title to the previous poster and URL.
XSLT (to be saved as .xsl)
<?xml version="1.0" ?>
<xsl:transform xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0"
xmlns:irc="SomeName">
<xsl:strip-space elements="*" />
<xsl:output method="xml" indent="yes"/>
<xsl:template match="rss">
<rss>
<channel>
<xsl:for-each select="//item">
<item>
<irc:title>My new title</irc:title>
<xsl:copy-of select="irc:poster"/>
<xsl:copy-of select="irc:url"/>
</item>
</xsl:for-each>
</channel>
</rss>
</xsl:template>
</xsl:transform>
Python
import os
import lxml.etree as ET
# SET CURRENT DIRECTORY
cd = os.path.dirname(os.path.abspath(__file__))
# LOAD IN XML AND XSL FILES
dom = ET.parse(os.path.join(cd, 'Original.xml'))
xslt = ET.parse(os.path.join(cd, 'XSLT_Script.xsl'))
# TRANSFORM
transform = ET.XSLT(xslt)
newdom = transform(dom)
# OUTPUT FINAL XML
tree_out = ET.tostring(newdom, encoding='UTF-8', pretty_print=True, xml_declaration=True)
xmlfile = open(os.path.join(cd, 'output.xml'),'wb')
xmlfile.write(tree_out)
xmlfile.close()
Output
<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:irc="SomeName">
<channel>
<item>
<irc:title>My new title</irc:title>
<irc:poster>A poster</irc:poster>
<irc:url>An url</irc:url>
</item>
</channel>
</rss>

Related

Python XML/Pandas: How to merge nested XML?

How can I join two different pieces of information together from this XML file?
# data
xml1 = ('''<?xml version="1.0" encoding="utf-8"?>
<TopologyDefinition xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<RSkus>
<RSku ID="V1" Deprecated="true" Owner="Unknown" Generation="1">
<Devices>
<Device ID="1" SkuID="Switch" Role="xD" />
</Devices>
<Blades>
<Blade ID="{1-20}" SkuID="SBlade" />
</Blades>
<Interfaces>
<Interface ID="COM" HardwareID="NS1" SlotID="COM1" Type="serial" />
<Interface ID="LINK" HardwareID="TS1" SlotID="UPLINK_1" Type="serial" />
</Interfaces>
<Wires>
<WireGroup Type="network">
<Wire LocationA="NS1" SlotA="{1-20}" LocationB="{1-20}" SlotB="NIC1" />
</WireGroup>
<WireGroup Type="serial">
<Wire LocationA="TS1" SlotA="{7001-7020}" LocationB="{1-20}" SlotB="COM1" />
</WireGroup>
</Wires>
</RSku>
</RSkus>
</TopologyDefinition>
''')
While this is a single case and trivial in the instance below; if I run the below commands on the full file, I get shapes that do not match and therefore cannot be joined so easily.
How can I extract the XML information such that for every row, I get all the RSku information PLUS its Blade information. Each xpath contains no information that would let me join it to another xpath so that I may combine the information.
# how to have them joined?
pd.read_xml(xml1, xpath = ".//RSku")
pd.read_xml(xml1, xpath = ".//Blade")
# expected
pd.concat([pd.read_xml(xml1, xpath = ".//RSku"), pd.read_xml(xml1, xpath = ".//Blade")], axis=1)
Consider transforming the XML with XSLT by flattening the document with information you need. Specifically, retrieve only Blade attributes using descendant::* axis and corresponding RSku attributes using the ancestor::* axis. Python' lxml (default parser of pandas.read_xml) can run XSLT 1.0 scripts.
Below XSLT's <xsl:for-each> is used to prefix RSku_ and Blade_ to attribute names since they share same attribute such as ID. Otherwise template would be much less wordy.
import pandas as pd
xml1 = ...
xsl = ('''<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
version="1.0">
<xsl:output method="xml" indent="yes"/>
<xsl:template match="/TopologyDefinition">
<root>
<xsl:apply-templates select="descendant::Blade"/>
</root>
</xsl:template>
<xsl:template match="Blade">
<data>
<xsl:for-each select="ancestor::RSku/#*">
<xsl:attribute name="{concat('RSku_', name())}">
<xsl:value-of select="."/>
</xsl:attribute>
</xsl:for-each>
<xsl:for-each select="#*">
<xsl:attribute name="{concat('Blade_', name())}">
<xsl:value-of select="."/>
</xsl:attribute>
</xsl:for-each>
</data>
</xsl:template>
</xsl:stylesheet>''')
blades_df = pd.read xml(xml1, stylesheet=xsl)
Online XSLT Demo

How to remove all occurences of element in XML file?

I'd like to edit a KML file and remove all occurences of ExtendedData elements, wherever they are located in the file.
Here's the input XML file:
<?xml version="1.0" encoding="UTF-8"?>
<kml xmlns="http://earth.google.com/kml/2.2">
<Document>
<Style id="placemark-red">
<IconStyle>
<Icon>
<href>http://maps.me/placemarks/placemark-red.png</href>
</Icon>
</IconStyle>
</Style>
<name>My track</name>
<ExtendedData xmlns:mwm="https://maps.me">
<mwm:name>
<mwm:lang code="default">Blah</mwm:lang>
</mwm:name>
<mwm:lastModified>2020-04-05T14:17:18Z</mwm:lastModified>
</ExtendedData>
<Placemark>
<name></name>
…
<ExtendedData xmlns:mwm="https://maps.me">
<mwm:localId>0</mwm:localId>
<mwm:visibility>1</mwm:visibility>
</ExtendedData>
</Placemark>
</Document>
</kml>
And here's the code that 1) only removes the outermost occurence, and 2) requires adding the namespace to find it:
from lxml import etree
from pykml import parser
from pykml.factory import KML_ElementMaker as KML
with open("input.xml") as f:
doc = parser.parse(f)
root = doc.getroot()
ns = "{http://earth.google.com/kml/2.2}"
for pm in root.Document.getchildren():
#No way to get rid of namespace, for easier search?
if pm.tag==f"{ns}ExtendedData":
root.Document.remove(pm)
#How to remove innermost occurence of ExtendedData?
print(etree.tostring(doc, pretty_print=True))
Is there a way to remove all occurences in one go, or should I parse the whole tree?
Thank you.
Edit: The BeautifulSoup solution below requires adding an option "BeautifulSoup(my_xml,features="lxml")" to avoid the warning "No parser was explicitly specified".
Here's a solution using BeautifulSoup:
soup = BeautifulSoup(my_xml) # this is your xml
while True:
elem = soup.find("extendeddata")
if not elem:
break
elem.decompose()
Here's the output for your data:
<?xml version="1.0" encoding="UTF-8"?>
<html>
<body>
<kml xmlns="http://earth.google.com/kml/2.2">
<document>
<style id="placemark-red">
<IconStyle>
<Icon>
<href>http://maps.me/placemarks/placemark-red.png</href>
</Icon>
</IconStyle>
</style>
<name>
My track
</name>
<placemark>
<name>
</name>
</placemark>
</document>
</kml>
</body>
</html>
If you know the XML structure, try:
xml_root = ElementTree.parse(filename_path).getroot()
elem = xml_root.find('./ExtendedData')
xml_root.remove(elem)
or
xml_root = ElementTree.parse(filename_path).getroot()
p_elem = xml_root.find('/Placemark')
c_elem = xml_root.find('/Placemark/ExtendedData')
p_elem.remove(c_elem)
play with this ideas :)
if you don't know the xml structure, I think you need to parse the whole tree.
Simply run the empty template with Identity Transform using XSLT 1.0 which Python's lxml can run. No for/while loops or if logic needed. To handle the default namespace, define a prefix like doc:
XSLT (save a .xsl file, a special .xml file)
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:doc="http://earth.google.com/kml/2.2">
<xsl:output method="xml" omit-xml-declaration="yes" indent="yes"/>
<xsl:strip-space elements="*"/>
<!-- IDENTITY TRANSFORM -->
<xsl:template match="node()|#*">
<xsl:copy>
<xsl:apply-templates select="node()|#*"/>
</xsl:copy>
</xsl:template>
<!-- REMOVE ALL OCCURRENCES OF NODE -->
<xsl:template match="doc:ExtendedData"/>
</xsl:stylesheet>
Python
import lxml.etree as et
# LOAD XML AND XSL SOURCES
xml = et.parse('Input.xml')
xsl = et.parse('XSLT_Script.xsl')
# TRANSFORM INPUT
transform = et.XSLT(xsl)
result = transform(xml)
# PRINT TO SCREEN
print(result)
# SAVE TO FILE
with open('Output.kml', 'wb') as f:
f.write(result)

How can I reference a parent and remove the parent element in an RSS XML through LXML in Python?

I've been having trouble cracking this one. I have an RSS feed in the form of an XML file. Simplified, it looks like this:
<rss version="2.0">
<channel>
<title>My RSS Feed</title>
<link href="https://www.examplefeedurl.com">Feed</link>
<description></description>
<item>...</item>
<item>...</item>
<item>...</item>
<item>
<guid></guid>
<pubDate></pubDate>
<author/>
<title>Title of the item</title>
<link href="https://example.com" rel="alternate" type="text/html"/>
<description>
<![CDATA[View Example]]>
</description>
<description>
<![CDATA[<p>This actually contains a bunch of text I want to work with. If this text contains certain strings, I want to get rid of the whole item.</p>]]>
</description>
</item>
<item>...</item>
</channel>
</rss>
My objective is to check if the second description tag contains certain strings. If it does contain that string, I'd like to completely remove it. Currently in my code I have this:
doc = lxml.etree.fromstring(testString)
found = doc.findall('channel/item/description')
for desc in found:
if "FORBIDDENSTRING" in desc.text:
desc.getparent().remove(desc)
And it removes just the second description tag which makes sense but I want the whole item gone.
I don't know how I can get a hold on the 'item' element if I only have the 'desc' reference.
I've tried googling aswell as searching on here but the situations I see just want to remove the tag like I'm doing now, weirdly I haven't stumbled upon sample code that wants to get rid of the entire parent object.
Any pointers towards documentation/tutorials or help is very welcome.
I'm a big fan of XSLT, but another option is to just select the item instead of the description (select the element you want to delete; not its child).
Also, if you use xpath(), you can put the check for the forbidden string directly in the xpath predicate.
Example...
from lxml import etree
testString = """
<rss version="2.0">
<channel>
<title>My RSS Feed</title>
<link href="https://www.examplefeedurl.com">Feed</link>
<description></description>
<item>...</item>
<item>...</item>
<item>...</item>
<item>
<guid></guid>
<pubDate></pubDate>
<author/>
<title>Title of the item</title>
<link href="https://example.com" rel="alternate" type="text/html"/>
<description>
<![CDATA[View Example]]>
</description>
<description>
<![CDATA[<p>This actually contains a bunch of text I want to work with. If this text contains certain strings, I want to get rid of the whole item.</p>]]>
</description>
</item>
<item>...</item>
</channel>
</rss>
"""
forbidden_string = "I want to get rid of the whole item"
parser = etree.XMLParser(strip_cdata=False)
doc = etree.fromstring(testString, parser=parser)
found = doc.xpath('.//channel/item[description[contains(.,"{}")]]'.format(forbidden_string))
for item in found:
item.getparent().remove(item)
print(etree.tostring(doc, encoding="unicode", pretty_print=True))
this prints...
<rss version="2.0">
<channel>
<title>My RSS Feed</title>
<link href="https://www.examplefeedurl.com">Feed</link>
<description/>
<item>...</item>
<item>...</item>
<item>...</item>
<item>...</item>
</channel>
</rss>
Consider XSLT, the special-purpose language designed to transform XML files such as removing nodes conditionally by value. Python's lxml can run XSLT 1.0 scripts and even pass a parameter from Python script to XSLT (not unlike passing parameters in SQL!). In this way, you avoid any for loops or if logic or rebuilding tree at application layer.
XSLT (save as .xsl file, a special .xml file)
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output indent="yes" cdata-section-elements="description"/>
<xsl:strip-space elements="*"/>
<!-- VALUE TO BE PASSED INTO FROM PYTHON -->
<xsl:param name="search_string" />
<!-- IDENTITY TRANSFORM -->
<xsl:template match="#*|node()">
<xsl:copy>
<xsl:apply-templates select="#*|node()"/>
</xsl:copy>
</xsl:template>
<!-- KEEP ONLY item NODES THAT DO NOT CONTAIN $search_string -->
<xsl:template match="channel">
<xsl:copy>
<xsl:apply-templates select="item[not(contains(description[2], $search_string))]"/>
</xsl:copy>
</xsl:template>
</xsl:stylesheet>
Python (for demo, below runs two searches using posted sample)
import lxml.etree as et
# LOAD XML AND XSL
doc = et.parse('Input.xml')
xsl = et.parse('XSLT_String.xsl')
# CONFIGURE TRANSFORMER
transform = et.XSLT(xsl)
# RUN TRANSFORMATION WITH PARAM
n = et.XSLT.strparam('FORBIDDENSTRING')
result = transform(doc, search_string=n)
print(result)
# <?xml version="1.0"?>
# <rss version="2.0">
# <channel>
# <item>...</item>
# <item>...</item>
# <item>...</item>
# <item>
# <guid/>
# <pubDate/>
# <author/>
# <title>Title of the item</title>
# <link href="https://example.com" rel="alternate" type="text/html"/>
# <description><![CDATA[View Example]]></description>
# <description><![CDATA[<p>This actually contains a bunch of text I want to work with. If this text contains certain strings, I want to get rid of the whole item.</p>]]></description>
# </item>
# <item>...</item>
# </channel>
# </rss>
# RUN TRANSFORMATION WITH PARAM
n = et.XSLT.strparam('bunch of text')
result = transform(doc, search_string=n)
print(result)
# <?xml version="1.0"?>
# <rss version="2.0">
# <channel>
# <item>...</item>
# <item>...</item>
# <item>...</item>
# <item>...</item>
# </channel>
# </rss>
# SAVE TO FILE
with open('Output.xml', 'wb') as f:
f.write(result)

Save (print) xml node with its parents but without children

From the XML document, I want to save one node to a file - with all parent nodes, but without any child nodes. For example, for the following XML:
<?xml version="1.0" encoding="UTF-8"?>
<kml xmlns="http://earth.google.com/kml/2.1">
<Document id="myid">
<name>ref.kml</name>
<Style id="normalState">
<IconStyle><scale>1.0</scale><Icon><href>yt.png</href></Icon></IconStyle>
</Style>
</Document>
</kml>
expected output for <Document> node will be like this:
<?xml version="1.0" encoding="UTF-8"?>
<kml xmlns="http://earth.google.com/kml/2.1">
<Document id="myid">
</Document>
</kml>
So far I only found a solution with iterated removal of all child elements before saving. But as I need to work with original XML after, I have to make a copy of the whole document:
#!/usr/bin/env python
import lxml.etree as ET # have to use [lxml] because [xml] doesn't support 'xml_declaration'
import copy
kml_file = ET.parse("myfile.kml")
kml_copied = copy.deepcopy(kml_file) # .copy() is not enough, need .deepcopy()
root = kml_copied.getroot()
my_node = root[0]
for child in my_node:
my_node.remove(child)
print ET.tostring(kml_copied, xml_declaration=True, encoding='utf-8')
Is there better way to do this? at least to avoid making a deepcopy of the whole document...
Consider XSLT, the special-purpose declarative language designed to transform XML documents. And Python's lxml module has a built-in XSLT 1.0 processor. Additionally XSLT (whose script is a well-formed xml document can also adequately handle the kml undeclared namespace):
XSLT Script (save as .xsl to be loaded in Python, also portable to other languages)
<xsl:transform xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0"
xmlns:doc="http://earth.google.com/kml/2.1">
<xsl:output version="1.0" encoding="UTF-8" indent="yes" />
<xsl:strip-space elements="*"/>
<!-- Identity Transform to copy entire document -->
<xsl:template match="#*|node()">
<xsl:copy>
<xsl:apply-templates select="#*|node()"/>
</xsl:copy>
</xsl:template>
<!-- Empty Template to Remove Nodes -->
<xsl:template match="doc:Style|doc:name"/>
</xsl:transform>
Python Script
import lxml.etree as ET
# LOAD XML AND XSL
dom = ET.parse('Input.xml')
xslt = ET.parse('XSLTScript.xsl')
# TRANSFORM INPUT INTO DOM OBJECT
transform = ET.XSLT(xslt)
newdom = transform(dom)
# OUTPUT DOM TO STRING
tree_out = ET.tostring(newdom,
encoding='UTF-8',
pretty_print=True,
xml_declaration=True)
print(tree_out.decode("utf-8"))
# SAVE RESULTING XML
xmlfile = open('Output.xml','wb')
xmlfile.write(tree_out)
xmlfile.close()
Output
<?xml version='1.0' encoding='UTF-8'?>
<kml xmlns="http://earth.google.com/kml/2.1">
<Document id="myid"/>
</kml>

Override text in XML with lxml

Say I have an XML file and I want to edit parts of it. The following does not work, probably because I am editing a copy of a child.
from lxml import etree as et
tree = et.parse(p_my_xml)
root = tree.getroot()
for child in root:
for entry in child.getchildren():
first_part = entry.getchildren()[1].text
second_part = entry.getchildren()[2].text
if first_part == 'some_condition'
second_part = 'something_else'
tree.write(p_my_xml, pretty_print=True)
How can I correctly modify parts of the XML so that the changes are done in the tree?
Save the reference to the element and reset the text:
second_elm = entry.getchildren()[2]
if first_part == 'some_condition'
second_elm.text = 'something_else'
For future readers, any XML transformation, styling, re-formatting, and re-structuring can be adequately and even efficiently handled with XSLT, the declarative programming language used for XML manipulation. And Python's lxml module maintains an XSLT processor.
See below generalized example using OP's needs:
Original XML
<?xml version="1.0" encoding="UTF-8"?>
<root>
<child>
<entry1>some text</entry1>
<entry2>other text</entry2>
</child>
<child>
<entry1>some text</entry1>
<entry2>other text</entry2>
</child>
</root>
XSLT Script
<xsl:transform version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:template match="root">
<root>
<xsl:for-each select="//child">
<child>
<xsl:copy-of select="entry1"/>
<xsl:if test="entry1='some text'">
<entry2>some new text</entry2>
</xsl:if>
</child>
</xsl:for-each>
</root>
</xsl:template>
</xsl:transform>
Python Script
import os
import lxml.etree as ET
cd = os.path.dirname(os.path.abspath(__file__))
dom = ET.parse(os.path.join(cd, 'Original.xml'))
xslt = ET.parse(os.path.join(cd, 'XSLTScript.xsl'))
transform = ET.XSLT(xslt)
newdom = transform(dom)
tree_out = ET.tostring(newdom, encoding='UTF-8', pretty_print=True, xml_declaration=True)
xmlfile = open(os.path.join(cd, 'Final.xml'),'wb')
xmlfile.write(tree_out)
xmlfile.close()
Final XML
<?xml version='1.0' encoding='UTF-8'?>
<root>
<child>
<entry1>some text</entry1>
<entry2>some new text</entry2>
</child>
<child>
<entry1>some text</entry1>
<entry2>some new text</entry2>
</child>
</root>
While the above may seem too involved and not a Pythonic one-liner, please note there may be an occasion where you require a complex, intricate XML restructuring where you can leverage XSLT's recursive, template formatting language and not run complex iteration loops in object-oriented programming (Python, PHP, Java, C#, etc).

Categories