I have an XML with a number of strings:
<?xml version="1.0" encoding="UTF-8"?>
<Strings>
<String id="TEST_STRING_FROM_XML">
<en>Test string from XML</en>
<de>Testzeichenfolge aus XML</de>
<es>Cadena de prueba de XML</es>
<fr>Tester la chaîne à partir de XML</fr>
<it>Stringa di test da XML</it>
<ja>XMLからのテスト文字列</ja>
<ko>XML에서 테스트 문자열</ko>
<nl>Testreeks van XML</nl>
<pl>Łańcuch testowy z XML</pl>
<pt>Cadeia de teste de XML</pt>
<ru>Тестовая строка из XML</ru>
<sv>Teststräng från XML</sv>
<zh-CHS>从XML测试字符串</zh-CHS>
<zh-CHT>從XML測試字符串</zh-CHT>
<Comment>A test string that comes from a shared XML file.</Comment>
</String>
<String id="TEST_STRING_FROM_XML_2">
<en>Another test string from XML.</en>
<de></de>
<es></es>
<fr></fr>
<it></it>
<ja></ja>
<ko></ko>
<nl></nl>
<pl></pl>
<pt></pt>
<ru></ru>
<sv></sv>
<zh-CHS></zh-CHS>
<zh-CHT></zh-CHT>
<Comment>Another test string that comes from a shared XML file.</Comment>
</String>
</Strings>
And I would like to append these strings to a resx file with a long list of strings in the following format:
<?xml version="1.0" encoding="utf-8"?>
<root>
<!--
Microsoft ResX Schema
Version 2.0
**a bunch of schema and header stuff...**
-->
<data name="STRING_NAME_1" xml:space="preserve">
<value>This is a value 1</value>
<comment>This is a comment 1</comment>
</data>
<data name="STRING_NAME_2" xml:space="preserve">
<value>This is a value 2</value>
<comment>This is a comment 2</comment>
</data>
</root>
But using the following snippet of python code:
import sys, os, os.path, re
import xml.etree.ElementTree as ET
from xml.dom import minidom
existingStrings = []
newStrings = {}
languages = []
resx = '*path to resx file*'
def LoadAllNewStrings():
src_root = ET.parse('Strings.xml').getroot()
for src_string in src_root.findall('String'):
src_id = src_string.get('id')
src_value = src_string.findtext("en")
src_comment = src_string.findtext("Comment")
content = [src_value, src_comment]
newStrings[src_id] = content
def ExscludeExistingStrings():
dest_root = ET.parse(resx)
for stringName in dest_root.findall('Name'):
for stringId in newStrings:
if stringId == stringName:
newStrings.remove(stringId)
def PrettifyXML(element):
roughString = ET.tostring(element, 'utf-8')
reparsed = minidom.parseString(roughString)
return reparsed.toprettyxml(indent=" ")
def AddMissingStringsToLocalResource():
ExscludeExistingStrings()
with open(resx, "a") as output:
root = ET.parse(resx).getroot()
for newString in newStrings:
data = ET.Element("data", name=newString)
newStringContent = newStrings[newString]
newStringValue = newStringContent[0]
newStringComment = newStringContent[1]
ET.SubElement(data, "value").text = newStringValue
ET.SubElement(data, "comment").text = newStringComment
output.write(PrettifyXML(data))
if __name__ == "__main__":
LoadAllNewStrings()
AddMissingStringsToLocalResource()
I get the following XML appended to the end of the resx file:
<data name="STRING_NAME_2" xml:space="preserve">
<value>This is a value 1</value>
<comment>This is a comment 1</comment>
</data>
</root><?xml version="1.0" ?>
<data name="TEST_STRING_FROM_XML">
<value>Test string from XML</value>
<comment>A test string that comes from a shared XML file.</comment>
</data>
<?xml version="1.0" ?>
<data name="TEST_STRING_FROM_XML_2">
<value>Another test string from XML.</value>
<comment>Another test string that comes from a shared XML file.</comment>
</data>
I.e. the root ends and then my new strings are added after. Any ideas on how to add the data tags to the existing root properly?
with open(resx, "a") as output:
No. Don't open XML files as text files. Not for reading, not for writing, not for appending. Never.
The typical life cycle of an XML file is:
parsing (with an XML parser)
reading or Modification (with a DOM API)
if there were changes: Serializition (also with a DOM API)
At no point should you ever call open() on an XML file. XML files are not supposed to be treated as if they were plain text. They are not.
# parsing
resx = ET.parse(resx_path)
root = resx.getroot()
# modification
for newString in newStrings:
newStringContent = newStrings[newString]
# create node
data = ET.Element("data", name=newString)
ET.SubElement(data, "value").text = newStringContent[0]
ET.SubElement(data, "comment").text = newStringContent[1]
# append node, e.g. to the top level element
root.append(data)
# serialization
resx.write(resx_path, encoding='utf8')
I am trying to parse XML before converting it's content into lists and then into CSV. Unfortunately, I think my search terms for finding the initial element are failing, causing subsequent searches further down the hierarchy. I am new to XML, so I've tried variations on namespace dictionaries and including the namespace references... The simplified XML is given below:
<?xml version="1.0" encoding="utf-8"?>
<StationList xmlns:xsd="http://www.w3.org/2001/XMLSchema"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns:add="http://www.govtalk.gov.uk/people/AddressAndPersonalDetails"
xmlns:com="http://nationalrail.co.uk/xml/common" xsi:schemaLocation="http://internal.nationalrail.co.uk/xml/XsdSchemas/External/Version4.0/nre-station-v4-0.xsd"
xmlns="http://nationalrail.co.uk/xml/station">
<Station xsi:schemaLocation="http://internal.nationalrail.co.uk/xml/XsdSchemas/External/Version4.0/nre-station-v4-0.xsd">
<ChangeHistory>
<com:ChangedBy>spascos</com:ChangedBy>
<com:LastChangedDate>2018-11-07T00:00:00.000Z</com:LastChangedDate>
</ChangeHistory>
<Name>Aber</Name>
</Station>
The Code I am using to try to extract the com/...xml/station / ChangedBy element is below
tree = ET.parse(rootfilepath + "NRE_Station_Dataset_2019_raw.xml")
root = tree.getroot()
#get at the tags and their data
#for elem in tree.iter():
# print(f"this the tag {elem.tag} and this is the data: {elem.text}")
#open file for writing
station_data = open(rootfilepath + 'station_data.csv','w')
csvwriter = csv.writer(station_data)
station_head = []
count = 0
#inspiration for this code: http://blog.appliedinformaticsinc.com/how-to- parse-and-convert-xml-to-csv-using-python/
#this is where it goes wrong; some combination of the namespace and the tag can't find anything in line 27, 'StationList'
for member in root.findall('{http://nationalrail.co.uk/xml/station}Station'):
station = []
if count == 0:
changedby = member.find('{http://nationalrail.co.uk/xml/common}ChangedBy').tag
station_head.append(changedby)
name = member.find('{http://nationalrail.co.uk/xml/station}Name').tag
station_head.append(name)
count = count+1
changedby = member.find('{http://nationalrail.co.uk/xml/common}ChangedBy').text
station.append(changedby)
name = member.find('{http://nationalrail.co.uk/xml/station}Name').text
station.append(name)
csvwriter.writerow(station)
I have tried:
using dictionaries of namespaces but that results in nothing being found at all
using hard coded namespaces but that results in "Attribute Error: 'NoneType' object has no attribute 'tag'
Thanks in advance for all and any assistance.
First of all your XML is invalid (</StationList> is absent at the end of a file).
Assuming you have valid XML file:
<?xml version="1.0" encoding="utf-8"?>
<StationList xmlns:xsd="http://www.w3.org/2001/XMLSchema"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns:add="http://www.govtalk.gov.uk/people/AddressAndPersonalDetails"
xmlns:com="http://nationalrail.co.uk/xml/common" xsi:schemaLocation="http://internal.nationalrail.co.uk/xml/XsdSchemas/External/Version4.0/nre-station-v4-0.xsd"
xmlns="http://nationalrail.co.uk/xml/station">
<Station xsi:schemaLocation="http://internal.nationalrail.co.uk/xml/XsdSchemas/External/Version4.0/nre-station-v4-0.xsd">
<ChangeHistory>
<com:ChangedBy>spascos</com:ChangedBy>
<com:LastChangedDate>2018-11-07T00:00:00.000Z</com:LastChangedDate>
</ChangeHistory>
<Name>Aber</Name>
</Station>
</StationList>
Then you can convert your XML to JSON and simply address to the required value:
import xmltodict
with open('file.xml', 'r') as f:
data = xmltodict.parse(f.read())
changed_by = data['StationList']['Station']['ChangeHistory']['com:ChangedBy']
Output:
spascos
Try lxml:
#!/usr/bin/env python3
from lxml import etree
ns = {"com": "http://nationalrail.co.uk/xml/common"}
with open("so.xml") as f:
tree = etree.parse(f)
for t in tree.xpath("//com:ChangedBy/text()", namespaces=ns):
print(t)
Output:
spascos
You can use Beautifulsoup which is an html and xml parser
from bs4 import BeautifulSoup
fd = open(rootfilepath + "NRE_Station_Dataset_2019_raw.xml")
soup = BeautifulSoup(fd,'lxml-xml')
for i in soup.findAll('ChangeHistory'):
print(i.ChangedBy.text)
I have the following Input XML:
<?xml version="1.0" encoding="utf-8"?>
<Scenario xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="Scenario.xsd">
<TestCase>test_startup_0029</TestCase>
<ShortDescription>Restart of the EVC with missing ODO5 board.</ShortDescription>
<Events>
<Event Num="1">Switch on the EVC</Event>
</Events>
<HW-configuration>
<ELBE5A>true</ELBE5A>
<ELBE5K>false</ELBE5K>
</HW-configuration>
<SystemFailure>true</SystemFailure>
</Scenario>
My Program does add three Tags to the XML but they are formatted false.
The Output XML looks like the following:
<Scenario xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="Scenario.xsd">
<TestCase>test_startup_0029</TestCase>
<ShortDescription>Restart of the EVC with missing ODO5 board.</ShortDescription>
<Events>
<Event Num="1">Switch on the EVC</Event>
</Events>
<HW-configuration>
<ELBE5A>true</ELBE5A>
<ELBE5K>false</ELBE5K>
</HW-configuration>
<SystemFailure>true</SystemFailure>
<Duration>12</Duration><EVC-SW-Version>08.02.0001.0027</EVC-SW-Version><STAC-Release>08.02.0001.0027</STAC-Release></Scenario>
Thats my Source-Code:
class XmlManager:
#staticmethod
def write_xml(xml_path, duration, evc_sw_version):
xml_path = os.path.abspath(xml_path)
if os.path.isfile(xml_path) and xml_path.endswith(".xml"):
# parse XML into etree
root = etree.parse(xml_path).getroot()
# add tags
duration_tag = etree.SubElement(root, "Duration")
duration_tag.text = duration
sw_version_tag = etree.SubElement(root, "EVC-SW-Version")
sw_version_tag.text = evc_sw_version
stac_release = evc_sw_version
stac_release_tag = etree.SubElement(root, "STAC-Release")
stac_release_tag.text = stac_release
# write changes to the XML-file
tree = etree.ElementTree(root)
tree.write(xml_path, pretty_print=False)
else:
XmlManager.logger.log("Invalid path to XML-file")
def main():
xml = r".\Test_Input_Data_Base\blnmerf1_md1czjyc_REL_V_08.01.0001.000x\Test_startup_0029\Test_startup_0029.xml"
XmlManager.write_xml(xml, "12", "08.02.0001.0027")
My Question is how to add the new tags to the XML in the right format. I guess its working that way for parsing again the changed XML but its not nice formated. Any Ideas? Thanks in advance.
To ensure nice pretty-printed output, you need to do two things:
Parse the input file using an XMLParser object with remove_blank_text=True.
Write the output using pretty_print=True
Example:
from lxml import etree
parser = etree.XMLParser(remove_blank_text=True)
tree = etree.parse("Test_startup_0029.xml", parser)
root = tree.getroot()
duration_tag = etree.SubElement(root, "Duration")
duration_tag.text = "12"
sw_version_tag = etree.SubElement(root, "EVC-SW-Version")
sw_version_tag.text = "08.02.0001.0027"
stac_release_tag = etree.SubElement(root, "STAC-Release")
stac_release_tag.text = "08.02.0001.0027"
tree.write("output.xml", pretty_print=True)
Contents of output.xml:
<Scenario xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="Scenario.xsd">
<TestCase>test_startup_0029</TestCase>
<ShortDescription>Restart of the EVC with missing ODO5 board.</ShortDescription>
<Events>
<Event Num="1">Switch on the EVC</Event>
</Events>
<HW-configuration>
<ELBE5A>true</ELBE5A>
<ELBE5K>false</ELBE5K>
</HW-configuration>
<SystemFailure>true</SystemFailure>
<Duration>12</Duration>
<EVC-SW-Version>08.02.0001.0027</EVC-SW-Version>
<STAC-Release>08.02.0001.0027</STAC-Release>
</Scenario>
See also http://lxml.de/FAQ.html#why-doesn-t-the-pretty-print-option-reformat-my-xml-output.
I need to parse an xml file, lest say called example.xml, that looks like the following:
<?xml version="1.0" encoding="ISO-8859-1"?>
<nf:rpc-reply xmlns:nf="urn:ietf:params:xml:ns:netconf:base:1.0" xmlns="http://www.cisco.com/nxos:1.0:if_manager">
<nf:data>
<show>
<interface>
<__XML__OPT_Cmd_show_interface___readonly__>
<__readonly__>
<TABLE_interface>
<ROW_interface>
<interface>Ethernet1/1</interface>
<state>down</state>
<state_rsn_desc>Link not connected</state_rsn_desc>
<admin_state>up</admin_state>
I need to get "interface", and "state" elements as such: ['Ethernet1/1', 'down']
Here is my solution that doesnt work:
from lxml import etree
parser = etree.XMLParser()
tree = etree.parse('example.xml', parser)
print tree.xpath('//*/*/*/*/*/*/*/*/interface/text()')
print tree.xpath('//*/*/*/*/*/*/*/*/state/text()')
You need to handle namespaces here:
from lxml import etree
parser = etree.XMLParser()
tree = etree.parse('example.xml', parser)
ns = {'ns': 'http://www.cisco.com/nxos:1.0:if_manager'}
interface = tree.find('//ns:ROW_interface', namespaces=ns)
print [interface.find('.//ns:interface', namespaces=ns).text,
interface.find('.//ns:state', namespaces=ns).text]
Prints:
['Ethernet1/1', 'down']
Using collections.namedtuple():
interface_node = tree.find('//ns:ROW_interface', ns)
Interface = namedtuple('Interface', ['interface', 'state'])
interface = Interface(interface=interface_node.find('.//ns:interface', ns).text,
state=interface_node.find('.//ns:state', ns).text)
print interface
Prints:
Interface(interface='Ethernet1/1', state='down')
I've discovered that cElementTree is about 30 times faster than xml.dom.minidom and I'm rewriting my XML encoding/decoding code. However, I need to output XML that contains CDATA sections and there doesn't seem to be a way to do that with ElementTree.
Can it be done?
After a bit of work, I found the answer myself. Looking at the ElementTree.py source code, I found there was special handling of XML comments and preprocessing instructions. What they do is create a factory function for the special element type that uses a special (non-string) tag value to differentiate it from regular elements.
def Comment(text=None):
element = Element(Comment)
element.text = text
return element
Then in the _write function of ElementTree that actually outputs the XML, there's a special case handling for comments:
if tag is Comment:
file.write("<!-- %s -->" % _escape_cdata(node.text, encoding))
In order to support CDATA sections, I create a factory function called CDATA, extended the ElementTree class and changed the _write function to handle the CDATA elements.
This still doesn't help if you want to parse an XML with CDATA sections and then output it again with the CDATA sections, but it at least allows you to create XMLs with CDATA sections programmatically, which is what I needed to do.
The implementation seems to work with both ElementTree and cElementTree.
import elementtree.ElementTree as etree
#~ import cElementTree as etree
def CDATA(text=None):
element = etree.Element(CDATA)
element.text = text
return element
class ElementTreeCDATA(etree.ElementTree):
def _write(self, file, node, encoding, namespaces):
if node.tag is CDATA:
text = node.text.encode(encoding)
file.write("\n<![CDATA[%s]]>\n" % text)
else:
etree.ElementTree._write(self, file, node, encoding, namespaces)
if __name__ == "__main__":
import sys
text = """
<?xml version='1.0' encoding='utf-8'?>
<text>
This is just some sample text.
</text>
"""
e = etree.Element("data")
cdata = CDATA(text)
e.append(cdata)
et = ElementTreeCDATA(e)
et.write(sys.stdout, "utf-8")
lxml has support for CDATA and API like ElementTree.
Here is a variant of gooli's solution that works for python 3.2:
import xml.etree.ElementTree as etree
def CDATA(text=None):
element = etree.Element('![CDATA[')
element.text = text
return element
etree._original_serialize_xml = etree._serialize_xml
def _serialize_xml(write, elem, qnames, namespaces):
if elem.tag == '![CDATA[':
write("\n<%s%s]]>\n" % (
elem.tag, elem.text))
return
return etree._original_serialize_xml(
write, elem, qnames, namespaces)
etree._serialize_xml = etree._serialize['xml'] = _serialize_xml
if __name__ == "__main__":
import sys
text = """
<?xml version='1.0' encoding='utf-8'?>
<text>
This is just some sample text.
</text>
"""
e = etree.Element("data")
cdata = CDATA(text)
e.append(cdata)
et = etree.ElementTree(e)
et.write(sys.stdout.buffer.raw, "utf-8")
Solution:
import xml.etree.ElementTree as ElementTree
def CDATA(text=None):
element = ElementTree.Element('![CDATA[')
element.text = text
return element
ElementTree._original_serialize_xml = ElementTree._serialize_xml
def _serialize_xml(write, elem, qnames, namespaces,short_empty_elements, **kwargs):
if elem.tag == '![CDATA[':
write("\n<{}{}]]>\n".format(elem.tag, elem.text))
if elem.tail:
write(_escape_cdata(elem.tail))
else:
return ElementTree._original_serialize_xml(write, elem, qnames, namespaces,short_empty_elements, **kwargs)
ElementTree._serialize_xml = ElementTree._serialize['xml'] = _serialize_xml
if __name__ == "__main__":
import sys
text = """
<?xml version='1.0' encoding='utf-8'?>
<text>
This is just some sample text.
</text>
"""
e = ElementTree.Element("data")
cdata = CDATA(text)
root.append(cdata)
Background:
I don't know whether previous versions of proposed code worked very well and whether ElementTree module has been updated but I have faced problems with using this trick:
etree._original_serialize_xml = etree._serialize_xml
def _serialize_xml(write, elem, qnames, namespaces):
if elem.tag == '![CDATA[':
write("\n<%s%s]]>\n" % (
elem.tag, elem.text))
return
return etree._original_serialize_xml(
write, elem, qnames, namespaces)
etree._serialize_xml = etree._serialize['xml'] = _serialize_xml
The problem with this approach is that after passing this exception, serializer is again treating it as normal tag afterwards. I was getting something like:
<textContent>
<![CDATA[this was the code I wanted to put inside of CDATA]]>
<![CDATA[>this was the code I wanted to put inside of CDATA</![CDATA[>
</textContent>
And of course we know that will cause only plenty of errors.
Why that was happening though?
The answer is in this little guy:
return etree._original_serialize_xml(write, elem, qnames, namespaces)
We don't want to examine code once again through original serialise function if we have trapped our CDATA and successfully passed it through.
Therefore in the "if" block we have to return original serialize function only when CDATA was not there. We were missing "else" before returning original function.
Moreover in my version ElementTree module, serialize function was desperately asking for "short_empty_element" argument. So the most recent version I would recommend looks like this(also with "tail"):
from xml.etree import ElementTree
from xml import etree
#in order to test it you have to create testing.xml file in the folder with the script
xmlParsedWithET = ElementTree.parse("testing.xml")
root = xmlParsedWithET.getroot()
def CDATA(text=None):
element = ElementTree.Element('![CDATA[')
element.text = text
return element
ElementTree._original_serialize_xml = ElementTree._serialize_xml
def _serialize_xml(write, elem, qnames, namespaces,short_empty_elements, **kwargs):
if elem.tag == '![CDATA[':
write("\n<{}{}]]>\n".format(elem.tag, elem.text))
if elem.tail:
write(_escape_cdata(elem.tail))
else:
return ElementTree._original_serialize_xml(write, elem, qnames, namespaces,short_empty_elements, **kwargs)
ElementTree._serialize_xml = ElementTree._serialize['xml'] = _serialize_xml
text = """
<?xml version='1.0' encoding='utf-8'?>
<text>
This is just some sample text.
</text>
"""
e = ElementTree.Element("data")
cdata = CDATA(text)
root.append(cdata)
#tests
print(root)
print(root.getchildren()[0])
print(root.getchildren()[0].text + "\n\nyay!")
The output I got was:
<Element 'Database' at 0x10062e228>
<Element '![CDATA[' at 0x1021cc9a8>
<?xml version='1.0' encoding='utf-8'?>
<text>
This is just some sample text.
</text>
yay!
I wish you the same result!
It's not possible AFAIK... which is a pity. Basically, ElementTree modules assume that the reader is 100% XML compliant, so it shouldn't matter if they output a section as CDATA or some other format that generates the equivalent text.
See this thread on the Python mailing list for more info. Basically, they recommend some kind of DOM-based XML library instead.
Actually this code has a bug, since you don't catch ]]> appearing in the data you are inserting as CDATA
as per Is there a way to escape a CDATA end token in xml?
you should break it into two CDATA's in that case, splitting the ]]> between the two.
basically data = data.replace("]]>", "]]]]><![CDATA[>")
(not necessarily correct, please verify)
This ended up working for me in Python 2.7. Similar to Amaury's answer.
import xml.etree.ElementTree as ET
ET._original_serialize_xml = ET._serialize_xml
def _serialize_xml(write, elem, encoding, qnames, namespaces):
if elem.tag == '![CDATA[':
write("<%s%s]]>%s" % (elem.tag, elem.text, elem.tail))
return
return ET._original_serialize_xml(
write, elem, encoding, qnames, namespaces)
ET._serialize_xml = ET._serialize['xml'] = _serialize_xml
You can override ElementTree _escape_cdata function:
import xml.etree.ElementTree as ET
def _escape_cdata(text, encoding):
try:
if "&" in text:
text = text.replace("&", "&")
# if "<" in text:
# text = text.replace("<", "<")
# if ">" in text:
# text = text.replace(">", ">")
return text
except TypeError:
raise TypeError(
"cannot serialize %r (type %s)" % (text, type(text).__name__)
)
ET._escape_cdata = _escape_cdata
Note that you may not need pass extra encoding param, depending on your library/python version.
Now you can write CDATA into obj.text like:
root = ET.Element('root')
body = ET.SubElement(root, 'body')
body.text = '<![CDATA[perform extra angle brackets escape for this text]]>'
print(ET.tostring(root))
and get clear CDATA node:
<root>
<body>
<![CDATA[perform extra angle brackets escape for this text]]>
</body>
</root>
I've discovered a hack to get CDATA to work using comments:
node.append(etree.Comment(' --><![CDATA[' + data.replace(']]>', ']]]]><![CDATA[>') + ']]><!-- '))
for python3 and ElementTree you can use next reciept
import xml.etree.ElementTree as ET
ET._original_serialize_xml = ET._serialize_xml
def serialize_xml_with_CDATA(write, elem, qnames, namespaces, short_empty_elements, **kwargs):
if elem.tag == 'CDATA':
write("<![CDATA[{}]]>".format(elem.text))
return
return ET._original_serialize_xml(write, elem, qnames, namespaces, short_empty_elements, **kwargs)
ET._serialize_xml = ET._serialize['xml'] = serialize_xml_with_CDATA
def CDATA(text):
element = ET.Element("CDATA")
element.text = text
return element
my_xml = ET.Element("my_name")
my_xml.append(CDATA("<p>some text</p>")
tree = ElementTree(my_xml)
if you need xml as str, you can use
ET.tostring(tree)
or next hack (which almost same as code inside tostring())
fake_file = BytesIO()
tree.write(fake_file, encoding="utf-8", xml_declaration=True)
result_xml_text = str(fake_file.getvalue(), encoding="utf-8")
and get result
<?xml version='1.0' encoding='utf-8'?>
<my_name>
<![CDATA[<p>some text</p>]]>
</my_name>
The DOM has (atleast in level 2) an interface
DATASection, and an operation Document::createCDATASection. They are
extension interfaces, supported only if an implementation supports the
"xml" feature.
from xml.dom import minidom
my_xmldoc=minidom.parse(xmlfile)
my_xmldoc.createCDATASection(data)
now u have cadata node add it wherever u want....
The accepted solution cannot work with Python 2.7. However, there is another package called lxml which (though slightly slower) shared a largely identical syntax with the xml.etree.ElementTree. lxml is able to both write and parse CDATA. Documentation here
Here's my version which is based on both gooli's and amaury's answers above. It works for both ElementTree 1.2.6 and 1.3.0, which use very different methods of doing this.
Note that gooli's does not work with 1.3.0, which seems to be the current standard in Python 2.7.x.
Also note that this version does not use the CDATA() method gooli used either.
import xml.etree.cElementTree as ET
class ElementTreeCDATA(ET.ElementTree):
"""Subclass of ElementTree which handles CDATA blocks reasonably"""
def _write(self, file, node, encoding, namespaces):
"""This method is for ElementTree <= 1.2.6"""
if node.tag == '![CDATA[':
text = node.text.encode(encoding)
file.write("\n<![CDATA[%s]]>\n" % text)
else:
ET.ElementTree._write(self, file, node, encoding, namespaces)
def _serialize_xml(write, elem, qnames, namespaces):
"""This method is for ElementTree >= 1.3.0"""
if elem.tag == '![CDATA[':
write("\n<![CDATA[%s]]>\n" % elem.text)
else:
ET._serialize_xml(write, elem, qnames, namespaces)
I got here looking for a way to "parse an XML with CDATA sections and then output it again with the CDATA sections".
I was able to do this (maybe lxml has been updated since this post?) with the following: (it is a little rough - sorry ;-). Someone else may have a better way to find the CDATA sections programatically but I was too lazy.
parser = etree.XMLParser(encoding='utf-8') # my original xml was utf-8 and that was a lot of the problem
tree = etree.parse(ppath, parser)
for cdat in tree.findall('./ProjectXMPMetadata'): # the tag where my CDATA lives
cdat.text = etree.CDATA(cdat.text)
# other stuff here
tree.write(opath, encoding="UTF-8",)
Simple way of making .xml file with CDATA sections
The main idea is that we covert the element tree to a string and call unescape on it. Once we have the string we use standard python to write a string to a file.
Based on:
How to write unescaped string to a XML element with ElementTree?
Code that generates the XML file
import xml.etree.ElementTree as ET
from xml.sax.saxutils import unescape
# defining the tree structure
element1 = ET.Element('test1')
element1.text = '<![CDATA[Wired & Forbidden]]>'
# & and <> are in a weird format
string1 = ET.tostring(element1).decode()
print(string1)
# now they are not weird anymore
# more formally, we unescape '&', '<', and '>' in a string of data
# from https://docs.python.org/3.8/library/xml.sax.utils.html#xml.sax.saxutils.unescape
string1 = unescape(string1)
print(string1)
element2 = ET.Element('test2')
element2.text = '<![CDATA[Wired & Forbidden]]>'
string2 = unescape(ET.tostring(element2).decode())
print(string2)
# make the xml file and open in append mode
with open('foo.xml', 'a') as f:
f.write(string1 + '\n')
f.write(string2)
Output foo.xml
<test1><![CDATA[Wired & Forbidden]]></test1>
<test2><![CDATA[Wired & Forbidden]]></test2>