What is the easiest way to navigate through XML with python?
<html>
<body>
<soapenv:envelope xmlns:soapenv="http://schemas.xmlsoap.org/soap/envelope/" xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<soapenv:body>
<getservicebyidresponse xmlns="http://www.something.com/soa/2.0/SRIManagement">
<code xmlns="">
0
</code>
<client xmlns="">
<action xsi:nil="true">
</action>
<actionmode xsi:nil="true">
</actionmode>
<clientid>
405965216
</clientid>
<firstname xsi:nil="true">
</firstname>
<id xsi:nil="true">
</id>
<lastname>
Last Name
</lastname>
<role xsi:nil="true">
</role>
<state xsi:nil="true">
</state>
</client>
</getservicebyidresponse>
</soapenv:body>
</soapenv:envelope>
</body>
</html>
I would go with regex and try to get the values of the lines I need but is there a pythonic way? something like xml[0][1] etc?
As #deceze already pointed out, you can use xml.etree.ElementTree here.
import xml.etree.ElementTree as ET
tree = ET.parse("path_to_xml_file")
root = tree.getroot()
You can iterate over all children nodes of root:
for child in root.iter():
if child.tag == 'clientid':
print(child.tag, child.text.strip())
Children are nested, and we can access specific child nodes by index, so root[0][1] should work (as long as the indices are correct).
Related
I'm completely new to xml parsing .I have some thousands of xml's and I want to find out all element DE , only when I have country tag
Here is my sample xml
<?xml version="1.0" encoding="UTF-8"?>
<DE>
<CT>
<IG>
<FS id="01">
<FE id="A" fId="B">
<title>Apple</title>
</FE>
</FS>
<country syse="21" subSys="2">
<FF FR="101" fe="01" />
<referTo refType="t06">
<CF Code="350" />
</referTo>
<place id="00A" placeValue="00AB">
<Q>001</Q>
<TQ>0001</TQ>
<PR Value="A" CodeValue="C" />
</place>
<place id="00E" placeValue="00EF">
<Q>001</Q>
<TQ>0001</TQ>
<PR Value="03" AValue="957" />
<Books>
<IA>
<Part />
</IA>
<PRGroup>
<country Code="5">
<PR Value="02" AValue="345" />
<constrain>Double condition.</constrain>
<constrain>Double condition.</constrain>
</country>
</PRGroup>
</Books>
</place>
</country>
</IG>
</CT>
</DE>
import xml.etree.ElementTree as ET
tree = ET.parse(content)
root = tree.getroot()
Num = root.findall("//DE[//place/Books/PRGroup/country]")
am getting predicate error or absolute path error when am trying different ways but am not able to figure this out.
How can I retrieve the results and access the attributes based on that
could you please help me on this.
With lxml it should be something along these lines:
from lxml import etree
content = """[your xml above]"""
root = etree.fromstring(content.encode())
Num = root.xpath("//DE[//place/Books/PRGroup/country]")
I want to remove element but not its children. I tried with this code, but my code remove its children also.
code
import xml.etree.ElementTree as ET
tree = ET.parse('test.xml')
root = tree.getroot()
for item in root.findall('item'):
root.remove(item)
print(ET.tostring(root))
>>> <root>
</root>
test.xml
<?xml version="1.0" ?>
<root>
<item>
<data>
<number>01</number>
<step>one</step>
</data>
</item>
</root>
expected outcome
<?xml version="1.0" ?>
<root>
<data>
<number>01</number>
<step>one</step>
</data>
</root>
You should move all children of item to root before removing
for item in root.findall('item'):
for child in item:
root.append(child)
root.remove(item)
print(ET.tostring(root))
the code results in
<root>
<data>
<number>01</number>
<step>one</step>
</data>
</root>
Find the element with data tag, remove it and extend the element's parent with element's children.
import xml.etree.ElementTree as etree
data = """
<root>
<item>
<data>
<number>01</number>
<step>one</step>
</data>
</item>
</root>
"""
tree = etree.fromstring(data)
def iterparent(tree):
for parent in tree.iter():
for child in parent:
yield parent, child
tree = etree.fromstring(data)
for parent, child in iterparent(tree):
if child.tag == "data":
parent.remove(child)
parent.extend(child)
print((etree.tostring(tree)))
will output
<root>
<item>
<number>01</number>
<step>one</step>
</item>
</root>
Adapted from a similar answer for your particular use case.
I'd like to edit a KML file and remove all occurences of ExtendedData elements, wherever they are located in the file.
Here's the input XML file:
<?xml version="1.0" encoding="UTF-8"?>
<kml xmlns="http://earth.google.com/kml/2.2">
<Document>
<Style id="placemark-red">
<IconStyle>
<Icon>
<href>http://maps.me/placemarks/placemark-red.png</href>
</Icon>
</IconStyle>
</Style>
<name>My track</name>
<ExtendedData xmlns:mwm="https://maps.me">
<mwm:name>
<mwm:lang code="default">Blah</mwm:lang>
</mwm:name>
<mwm:lastModified>2020-04-05T14:17:18Z</mwm:lastModified>
</ExtendedData>
<Placemark>
<name></name>
…
<ExtendedData xmlns:mwm="https://maps.me">
<mwm:localId>0</mwm:localId>
<mwm:visibility>1</mwm:visibility>
</ExtendedData>
</Placemark>
</Document>
</kml>
And here's the code that 1) only removes the outermost occurence, and 2) requires adding the namespace to find it:
from lxml import etree
from pykml import parser
from pykml.factory import KML_ElementMaker as KML
with open("input.xml") as f:
doc = parser.parse(f)
root = doc.getroot()
ns = "{http://earth.google.com/kml/2.2}"
for pm in root.Document.getchildren():
#No way to get rid of namespace, for easier search?
if pm.tag==f"{ns}ExtendedData":
root.Document.remove(pm)
#How to remove innermost occurence of ExtendedData?
print(etree.tostring(doc, pretty_print=True))
Is there a way to remove all occurences in one go, or should I parse the whole tree?
Thank you.
Edit: The BeautifulSoup solution below requires adding an option "BeautifulSoup(my_xml,features="lxml")" to avoid the warning "No parser was explicitly specified".
Here's a solution using BeautifulSoup:
soup = BeautifulSoup(my_xml) # this is your xml
while True:
elem = soup.find("extendeddata")
if not elem:
break
elem.decompose()
Here's the output for your data:
<?xml version="1.0" encoding="UTF-8"?>
<html>
<body>
<kml xmlns="http://earth.google.com/kml/2.2">
<document>
<style id="placemark-red">
<IconStyle>
<Icon>
<href>http://maps.me/placemarks/placemark-red.png</href>
</Icon>
</IconStyle>
</style>
<name>
My track
</name>
<placemark>
<name>
</name>
</placemark>
</document>
</kml>
</body>
</html>
If you know the XML structure, try:
xml_root = ElementTree.parse(filename_path).getroot()
elem = xml_root.find('./ExtendedData')
xml_root.remove(elem)
or
xml_root = ElementTree.parse(filename_path).getroot()
p_elem = xml_root.find('/Placemark')
c_elem = xml_root.find('/Placemark/ExtendedData')
p_elem.remove(c_elem)
play with this ideas :)
if you don't know the xml structure, I think you need to parse the whole tree.
Simply run the empty template with Identity Transform using XSLT 1.0 which Python's lxml can run. No for/while loops or if logic needed. To handle the default namespace, define a prefix like doc:
XSLT (save a .xsl file, a special .xml file)
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:doc="http://earth.google.com/kml/2.2">
<xsl:output method="xml" omit-xml-declaration="yes" indent="yes"/>
<xsl:strip-space elements="*"/>
<!-- IDENTITY TRANSFORM -->
<xsl:template match="node()|#*">
<xsl:copy>
<xsl:apply-templates select="node()|#*"/>
</xsl:copy>
</xsl:template>
<!-- REMOVE ALL OCCURRENCES OF NODE -->
<xsl:template match="doc:ExtendedData"/>
</xsl:stylesheet>
Python
import lxml.etree as et
# LOAD XML AND XSL SOURCES
xml = et.parse('Input.xml')
xsl = et.parse('XSLT_Script.xsl')
# TRANSFORM INPUT
transform = et.XSLT(xsl)
result = transform(xml)
# PRINT TO SCREEN
print(result)
# SAVE TO FILE
with open('Output.kml', 'wb') as f:
f.write(result)
i'm realy stuck in this, i got a file with an xml layout like this:
<rss xmlns:irc="SomeName" version="2.0">
<channel>
<item>
<irc:title>A title</irc:title>
<irc:poster>A poster</irc:poster>
<irc:url>An url</irc:url>
</item>
</channel>
</rss>
i need to add another 'item' in channel node, that's easy, but i can't find the way to add the item's child with the namespace.
i'm trying with lxml, but the documentation is not so clear for a newbie
please any help will be appreciated.
i find the way to doit with lxml
root = xml.getroot()
channel = root.find('channel')
item = et.Element('item')
title = et.SubElement(item,'{SomeName}title')
title.text = 'My new title'
poster = et.SubElement(item,'{SomeName}poster')
poster.text = 'My poster'
poster = et.SubElement(item,'{SomeName}url')
poster.text = 'http://My.url.com'
channel.append(item)
but still interested in a better solution
Alternatively, you can use XSLT, the declarative programming language, that transforms, styles, re-formats, and re-structures XML files in any way, shape, or form. Python's lxml module maintains an XSLT processor.
Simply, register the needed namespace in the XSLT's declaration line and use it in any new node. This might appear to be overkill for your current need but there could be a situation where a more complex transformation is needed with appropriate namespaces. Below adds a new title to the previous poster and URL.
XSLT (to be saved as .xsl)
<?xml version="1.0" ?>
<xsl:transform xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0"
xmlns:irc="SomeName">
<xsl:strip-space elements="*" />
<xsl:output method="xml" indent="yes"/>
<xsl:template match="rss">
<rss>
<channel>
<xsl:for-each select="//item">
<item>
<irc:title>My new title</irc:title>
<xsl:copy-of select="irc:poster"/>
<xsl:copy-of select="irc:url"/>
</item>
</xsl:for-each>
</channel>
</rss>
</xsl:template>
</xsl:transform>
Python
import os
import lxml.etree as ET
# SET CURRENT DIRECTORY
cd = os.path.dirname(os.path.abspath(__file__))
# LOAD IN XML AND XSL FILES
dom = ET.parse(os.path.join(cd, 'Original.xml'))
xslt = ET.parse(os.path.join(cd, 'XSLT_Script.xsl'))
# TRANSFORM
transform = ET.XSLT(xslt)
newdom = transform(dom)
# OUTPUT FINAL XML
tree_out = ET.tostring(newdom, encoding='UTF-8', pretty_print=True, xml_declaration=True)
xmlfile = open(os.path.join(cd, 'output.xml'),'wb')
xmlfile.write(tree_out)
xmlfile.close()
Output
<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:irc="SomeName">
<channel>
<item>
<irc:title>My new title</irc:title>
<irc:poster>A poster</irc:poster>
<irc:url>An url</irc:url>
</item>
</channel>
</rss>
I am trying to remove an element in an xml which contains a namespace.
Here is my code:
templateXml = """<?xml version="1.0" encoding="UTF-8"?>
<Metadata xmlns="http://www.amazon.com/UnboxMetadata/v1">
<Movie>
<CountryOfOrigin>US</CountryOfOrigin>
<TitleInfo>
<Title locale="en-GB">The Title</Title>
<Actor>
<ActorName locale="en-GB">XXX</ActorName>
<Character locale="en-GB">XXX</Character>
</Actor>
</TitleInfo>
</Movie>
</Metadata>"""
from lxml import etree
tree = etree.fromstring(templateXml)
namespaces = {'ns':'http://www.amazon.com/UnboxMetadata/v1'}
for checkActor in tree.xpath('//ns:Actor', namespaces=namespaces):
etree.strip_elements(tree, 'ns:Actor')
In my actual XML I have lots of tags, So I am trying to search for the Actor tags which contain XXX and completely remove that whole tag and its contents. But it's not working.
Use remove() method:
for checkActor in tree.xpath('//ns:Actor', namespaces=namespaces):
checkActor.getparent().remove(checkActor)
print etree.tostring(tree, pretty_print=True, xml_declaration=True)
prints:
<?xml version='1.0' encoding='ASCII'?>
<Metadata xmlns="http://www.amazon.com/UnboxMetadata/v1">
<Movie>
<CountryOfOrigin>US</CountryOfOrigin>
<TitleInfo>
<Title locale="en-GB">The Title</Title>
</TitleInfo>
</Movie>
</Metadata>