I am trying to use this code
s = """<?xml version="1.0" encoding="UTF-8" standalone="no" ?>
<gdml xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="http://service-spi.web.cern.ch/service-spi/app/releases/GDML/schema/gdml.xsd">"""
doctype = """<!DOCTYPE doc [
<!ENTITY ent SYSTEM "another_doc.xml">
]>"""
gdml = ET.fromstring(s.encode("UTF-8"))
But get XMLSyntaxError: EndTag
I want eventually to add the doctype as well
The string you pass to fromstring() needs to be well-formed XML. Since you only have a start tag, it's not well-formed.
This is easily resolvable by self-closing it or adding an end tag.
Also, if you want the XML declaration (<?xml ...?>) included in your serialized XML, you can add it to doctype.
Example...
from lxml import etree
s = """<?xml version="1.0" encoding="UTF-8" standalone="no" ?>
<gdml xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:noNamespaceSchemaLocation="http://service-spi.web.cern.ch/service-spi/app/releases/GDML/schema/gdml.xsd"/>"""
doctype = """<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!DOCTYPE gdml [
<!ENTITY ent SYSTEM "another_doc.xml">
]>"""
gdml = etree.fromstring(s.encode("UTF-8"))
print(etree.tostring(gdml, doctype=doctype).decode())
This prints...
<?xml version="1.0" encoding="UTF-8" standalone="no" ?>
<!DOCTYPE gdml [
<!ENTITY ent SYSTEM "another_doc.xml">
]>
<gdml xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="http://service-spi.web.cern.ch/service-spi/app/releases/GDML/schema/gdml.xsd"/>
Related
My input file is:
<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE note SYSTEM "Note.dtd">
<book>
<name>ABC</name>
</book>
I want to change the DOCTYPE element (commented - see below):
<?xml version="1.0" encoding="utf-8"?>
<!-- <!DOCTYPE note SYSTEM "Note.dtd"> -->
<book>
<name>ABC</name>
</book>
I wonder if we can do xml manipulation. Here what you can do with string manipulation
a='''<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE note SYSTEM "Note.dtd">
<book>
<name>ABC</name>
</book>'''
lis=a.split('\n')
for i,j in zip(lis,range(len(lis))):
if i.startswith('<!DOCTYPE'):
i = '<!--'+i+'-->'
lis[j]=i
a='\n'.join(lis)
If it does not have anything to split on.
a='<?xml version="1.0" encoding="utf-8"?><!DOCTYPE note SYSTEM "Note.dtd"><book><name>ABC</name></book>'
str1=""
start=a.index('<!DOCTYPE')
end=a.index('>',start)
str1=a[:start]
str1+='<!--'+'<!DOCTYPE note SYSTEM "Note.dtd">'+'-->'
str1+=a[end+1:]
str1 #'<?xml version="1.0" encoding="utf-8"?><!--<!DOCTYPE note SYSTEM "Note.dtd">--><book><name>ABC</name></book>'
From the XML document, I want to save one node to a file - with all parent nodes, but without any child nodes. For example, for the following XML:
<?xml version="1.0" encoding="UTF-8"?>
<kml xmlns="http://earth.google.com/kml/2.1">
<Document id="myid">
<name>ref.kml</name>
<Style id="normalState">
<IconStyle><scale>1.0</scale><Icon><href>yt.png</href></Icon></IconStyle>
</Style>
</Document>
</kml>
expected output for <Document> node will be like this:
<?xml version="1.0" encoding="UTF-8"?>
<kml xmlns="http://earth.google.com/kml/2.1">
<Document id="myid">
</Document>
</kml>
So far I only found a solution with iterated removal of all child elements before saving. But as I need to work with original XML after, I have to make a copy of the whole document:
#!/usr/bin/env python
import lxml.etree as ET # have to use [lxml] because [xml] doesn't support 'xml_declaration'
import copy
kml_file = ET.parse("myfile.kml")
kml_copied = copy.deepcopy(kml_file) # .copy() is not enough, need .deepcopy()
root = kml_copied.getroot()
my_node = root[0]
for child in my_node:
my_node.remove(child)
print ET.tostring(kml_copied, xml_declaration=True, encoding='utf-8')
Is there better way to do this? at least to avoid making a deepcopy of the whole document...
Consider XSLT, the special-purpose declarative language designed to transform XML documents. And Python's lxml module has a built-in XSLT 1.0 processor. Additionally XSLT (whose script is a well-formed xml document can also adequately handle the kml undeclared namespace):
XSLT Script (save as .xsl to be loaded in Python, also portable to other languages)
<xsl:transform xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0"
xmlns:doc="http://earth.google.com/kml/2.1">
<xsl:output version="1.0" encoding="UTF-8" indent="yes" />
<xsl:strip-space elements="*"/>
<!-- Identity Transform to copy entire document -->
<xsl:template match="#*|node()">
<xsl:copy>
<xsl:apply-templates select="#*|node()"/>
</xsl:copy>
</xsl:template>
<!-- Empty Template to Remove Nodes -->
<xsl:template match="doc:Style|doc:name"/>
</xsl:transform>
Python Script
import lxml.etree as ET
# LOAD XML AND XSL
dom = ET.parse('Input.xml')
xslt = ET.parse('XSLTScript.xsl')
# TRANSFORM INPUT INTO DOM OBJECT
transform = ET.XSLT(xslt)
newdom = transform(dom)
# OUTPUT DOM TO STRING
tree_out = ET.tostring(newdom,
encoding='UTF-8',
pretty_print=True,
xml_declaration=True)
print(tree_out.decode("utf-8"))
# SAVE RESULTING XML
xmlfile = open('Output.xml','wb')
xmlfile.write(tree_out)
xmlfile.close()
Output
<?xml version='1.0' encoding='UTF-8'?>
<kml xmlns="http://earth.google.com/kml/2.1">
<Document id="myid"/>
</kml>
i'm realy stuck in this, i got a file with an xml layout like this:
<rss xmlns:irc="SomeName" version="2.0">
<channel>
<item>
<irc:title>A title</irc:title>
<irc:poster>A poster</irc:poster>
<irc:url>An url</irc:url>
</item>
</channel>
</rss>
i need to add another 'item' in channel node, that's easy, but i can't find the way to add the item's child with the namespace.
i'm trying with lxml, but the documentation is not so clear for a newbie
please any help will be appreciated.
i find the way to doit with lxml
root = xml.getroot()
channel = root.find('channel')
item = et.Element('item')
title = et.SubElement(item,'{SomeName}title')
title.text = 'My new title'
poster = et.SubElement(item,'{SomeName}poster')
poster.text = 'My poster'
poster = et.SubElement(item,'{SomeName}url')
poster.text = 'http://My.url.com'
channel.append(item)
but still interested in a better solution
Alternatively, you can use XSLT, the declarative programming language, that transforms, styles, re-formats, and re-structures XML files in any way, shape, or form. Python's lxml module maintains an XSLT processor.
Simply, register the needed namespace in the XSLT's declaration line and use it in any new node. This might appear to be overkill for your current need but there could be a situation where a more complex transformation is needed with appropriate namespaces. Below adds a new title to the previous poster and URL.
XSLT (to be saved as .xsl)
<?xml version="1.0" ?>
<xsl:transform xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0"
xmlns:irc="SomeName">
<xsl:strip-space elements="*" />
<xsl:output method="xml" indent="yes"/>
<xsl:template match="rss">
<rss>
<channel>
<xsl:for-each select="//item">
<item>
<irc:title>My new title</irc:title>
<xsl:copy-of select="irc:poster"/>
<xsl:copy-of select="irc:url"/>
</item>
</xsl:for-each>
</channel>
</rss>
</xsl:template>
</xsl:transform>
Python
import os
import lxml.etree as ET
# SET CURRENT DIRECTORY
cd = os.path.dirname(os.path.abspath(__file__))
# LOAD IN XML AND XSL FILES
dom = ET.parse(os.path.join(cd, 'Original.xml'))
xslt = ET.parse(os.path.join(cd, 'XSLT_Script.xsl'))
# TRANSFORM
transform = ET.XSLT(xslt)
newdom = transform(dom)
# OUTPUT FINAL XML
tree_out = ET.tostring(newdom, encoding='UTF-8', pretty_print=True, xml_declaration=True)
xmlfile = open(os.path.join(cd, 'output.xml'),'wb')
xmlfile.write(tree_out)
xmlfile.close()
Output
<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:irc="SomeName">
<channel>
<item>
<irc:title>My new title</irc:title>
<irc:poster>A poster</irc:poster>
<irc:url>An url</irc:url>
</item>
</channel>
</rss>
I am trying to remove an element in an xml which contains a namespace.
Here is my code:
templateXml = """<?xml version="1.0" encoding="UTF-8"?>
<Metadata xmlns="http://www.amazon.com/UnboxMetadata/v1">
<Movie>
<CountryOfOrigin>US</CountryOfOrigin>
<TitleInfo>
<Title locale="en-GB">The Title</Title>
<Actor>
<ActorName locale="en-GB">XXX</ActorName>
<Character locale="en-GB">XXX</Character>
</Actor>
</TitleInfo>
</Movie>
</Metadata>"""
from lxml import etree
tree = etree.fromstring(templateXml)
namespaces = {'ns':'http://www.amazon.com/UnboxMetadata/v1'}
for checkActor in tree.xpath('//ns:Actor', namespaces=namespaces):
etree.strip_elements(tree, 'ns:Actor')
In my actual XML I have lots of tags, So I am trying to search for the Actor tags which contain XXX and completely remove that whole tag and its contents. But it's not working.
Use remove() method:
for checkActor in tree.xpath('//ns:Actor', namespaces=namespaces):
checkActor.getparent().remove(checkActor)
print etree.tostring(tree, pretty_print=True, xml_declaration=True)
prints:
<?xml version='1.0' encoding='ASCII'?>
<Metadata xmlns="http://www.amazon.com/UnboxMetadata/v1">
<Movie>
<CountryOfOrigin>US</CountryOfOrigin>
<TitleInfo>
<Title locale="en-GB">The Title</Title>
</TitleInfo>
</Movie>
</Metadata>
I want parse this response from SOAP and extract text between <LoginResult> :
<?xml version="1.0" encoding="utf-8"?>
<soap:Envelope xmlns:soap="http://schemas.xmlsoap.org/soap/envelope/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema">
<soap:Body>
<LoginResponse xmlns="http://tempuri.org/wsSalesQuotation/Service1">
<LoginResult>45eeadF43423KKmP33</LoginResult>
</LoginResponse>
</soap:Body>
</soap:Envelope>
How I can do it using XML Python Libs?
import xml.etree.ElementTree as ET
tree = ET.parse('soap.xml')
print tree.find('.//{http://tempuri.org/wsSalesQuotation/Service1}LoginResult').text
>>45eeadF43423KKmP33
instead of print, do something useful to it.