Encoding in XML declaration python - python

I have created an XML file using python. But the XML declaration has only version info. How can I include encoding with XML declaration like:
<?xml version="1.0" encoding="UTF-8"?>

>>> from xml.dom.minidom import Document
>>> a=Document()
>>> a.toprettyxml(encoding="utf-8")
'<?xml version="1.0" encoding="utf-8"?>\n'
or
>>> a.toxml(encoding="utf-8")
'<?xml version="1.0" encoding="utf-8"?>'
you can set the encoding for the document.writexml() function in the same way.

Related

I want to append an element to xml’s lead [duplicate]

I am generating an XML document in Python using an ElementTree, but the tostring function doesn't include an XML declaration when converting to plaintext.
from xml.etree.ElementTree import Element, tostring
document = Element('outer')
node = SubElement(document, 'inner')
node.NewValue = 1
print tostring(document) # Outputs "<outer><inner /></outer>"
I need my string to include the following XML declaration:
<?xml version="1.0" encoding="UTF-8" standalone="yes" ?>
However, there does not seem to be any documented way of doing this.
Is there a proper method for rendering the XML declaration in an ElementTree?
I am surprised to find that there doesn't seem to be a way with ElementTree.tostring(). You can however use ElementTree.ElementTree.write() to write your XML document to a fake file:
from io import BytesIO
from xml.etree import ElementTree as ET
document = ET.Element('outer')
node = ET.SubElement(document, 'inner')
et = ET.ElementTree(document)
f = BytesIO()
et.write(f, encoding='utf-8', xml_declaration=True)
print(f.getvalue()) # your XML file, encoded as UTF-8
See this question. Even then, I don't think you can get your 'standalone' attribute without writing prepending it yourself.
I would use lxml (see http://lxml.de/api.html).
Then you can:
from lxml import etree
document = etree.Element('outer')
node = etree.SubElement(document, 'inner')
print(etree.tostring(document, xml_declaration=True))
If you include the encoding='utf8', you will get an XML header:
xml.etree.ElementTree.tostring writes a XML encoding declaration with encoding='utf8'
Sample Python code (works with Python 2 and 3):
import xml.etree.ElementTree as ElementTree
tree = ElementTree.ElementTree(
ElementTree.fromstring('<xml><test>123</test></xml>')
)
root = tree.getroot()
print('without:')
print(ElementTree.tostring(root, method='xml'))
print('')
print('with:')
print(ElementTree.tostring(root, encoding='utf8', method='xml'))
Python 2 output:
$ python2 example.py
without:
<xml><test>123</test></xml>
with:
<?xml version='1.0' encoding='utf8'?>
<xml><test>123</test></xml>
With Python 3 you will note the b prefix indicating byte literals are returned (just like with Python 2):
$ python3 example.py
without:
b'<xml><test>123</test></xml>'
with:
b"<?xml version='1.0' encoding='utf8'?>\n<xml><test>123</test></xml>"
xml_declaration Argument
Is there a proper method for rendering the XML declaration in an ElementTree?
YES, and there is no need of using .tostring function. According to ElementTree Documentation, you should create an ElementTree object, create Element and SubElements, set the tree's root, and finally use xml_declaration argument in .write function, so the declaration line is included in output file.
You can do it this way:
import xml.etree.ElementTree as ET
tree = ET.ElementTree("tree")
document = ET.Element("outer")
node1 = ET.SubElement(document, "inner")
node1.text = "text"
tree._setroot(document)
tree.write("./output.xml", encoding = "UTF-8", xml_declaration = True)
And the output file is:
<?xml version='1.0' encoding='UTF-8'?>
<outer><inner>text</inner></outer>
I encounter this issue recently, after some digging of the code, I found the following code snippet is definition of function ElementTree.write
def write(self, file, encoding="us-ascii"):
assert self._root is not None
if not hasattr(file, "write"):
file = open(file, "wb")
if not encoding:
encoding = "us-ascii"
elif encoding != "utf-8" and encoding != "us-ascii":
file.write("<?xml version='1.0' encoding='%s'?>\n" %
encoding)
self._write(file, self._root, encoding, {})
So the answer is, if you need write the XML header to your file, set the encoding argument other than utf-8 or us-ascii, e.g. UTF-8
Easy
Sample for both Python 2 and 3 (encoding parameter must be utf8):
import xml.etree.ElementTree as ElementTree
tree = ElementTree.ElementTree(ElementTree.fromstring('<xml><test>123</test></xml>'))
root = tree.getroot()
print(ElementTree.tostring(root, encoding='utf8', method='xml'))
From Python 3.8 there is xml_declaration parameter for that stuff:
New in version 3.8: The xml_declaration and default_namespace
parameters.
xml.etree.ElementTree.tostring(element, encoding="us-ascii",
method="xml", *, xml_declaration=None, default_namespace=None,
short_empty_elements=True) Generates a string representation of an XML
element, including all subelements. element is an Element instance.
encoding 1 is the output encoding (default is US-ASCII). Use
encoding="unicode" to generate a Unicode string (otherwise, a
bytestring is generated). method is either "xml", "html" or "text"
(default is "xml"). xml_declaration, default_namespace and
short_empty_elements has the same meaning as in ElementTree.write().
Returns an (optionally) encoded string containing the XML data.
Sample for Python 3.8 and higher:
import xml.etree.ElementTree as ElementTree
tree = ElementTree.ElementTree(ElementTree.fromstring('<xml><test>123</test></xml>'))
root = tree.getroot()
print(ElementTree.tostring(root, encoding='unicode', method='xml', xml_declaration=True))
The minimal working example with ElementTree package usage:
import xml.etree.ElementTree as ET
document = ET.Element('outer')
node = ET.SubElement(document, 'inner')
node.text = '1'
res = ET.tostring(document, encoding='utf8', method='xml').decode()
print(res)
the output is:
<?xml version='1.0' encoding='utf8'?>
<outer><inner>1</inner></outer>
Another pretty simple option is to concatenate the desired header to the string of xml like this:
xml = (bytes('<?xml version="1.0" encoding="UTF-8"?>\n', encoding='utf-8') + ET.tostring(root))
xml = xml.decode('utf-8')
with open('invoice.xml', 'w+') as f:
f.write(xml)
I would use ET:
try:
from lxml import etree
print("running with lxml.etree")
except ImportError:
try:
# Python 2.5
import xml.etree.cElementTree as etree
print("running with cElementTree on Python 2.5+")
except ImportError:
try:
# Python 2.5
import xml.etree.ElementTree as etree
print("running with ElementTree on Python 2.5+")
except ImportError:
try:
# normal cElementTree install
import cElementTree as etree
print("running with cElementTree")
except ImportError:
try:
# normal ElementTree install
import elementtree.ElementTree as etree
print("running with ElementTree")
except ImportError:
print("Failed to import ElementTree from any known place")
document = etree.Element('outer')
node = etree.SubElement(document, 'inner')
print(etree.tostring(document, encoding='UTF-8', xml_declaration=True))
This works if you just want to print. Getting an error when I try to send it to a file...
import xml.dom.minidom as minidom
import xml.etree.ElementTree as ET
from xml.etree.ElementTree import Element, SubElement, Comment, tostring
def prettify(elem):
rough_string = ET.tostring(elem, 'utf-8')
reparsed = minidom.parseString(rough_string)
return reparsed.toprettyxml(indent=" ")
Including 'standalone' in the declaration
I didn't found any alternative for adding the standalone argument in the documentation so I adapted the ET.tosting function to take it as an argument.
from xml.etree import ElementTree as ET
# Sample
document = ET.Element('outer')
node = ET.SubElement(document, 'inner')
et = ET.ElementTree(document)
# Function that you need
def tostring(element, declaration, encoding=None, method=None,):
class dummy:
pass
data = []
data.append(declaration+"\n")
file = dummy()
file.write = data.append
ET.ElementTree(element).write(file, encoding, method=method)
return "".join(data)
# Working example
xdec = """<?xml version="1.0" encoding="UTF-8" standalone="no" ?>"""
xml = tostring(document, encoding='utf-8', declaration=xdec)

Write KML file from another

I'm trying to:
- read a KML file
- remove the Placemark element if name = 'ZONE'
- write a new KML file without the element
This is my code:
from pykml import parser
kml_file_path = '../Source/Lombardia.kml'
removeList = list()
with open(kml_file_path) as f:
folder = parser.parse(f).getroot().Document.Folder
for pm in folder.Placemark:
if pm.name == 'ZONE':
removeList.append(pm)
print pm.name
for tag in removeList:
parent = tag.getparent()
parent.remove(tag)
#Write the new file
#I cannot reach the solution help me
and this is the KML:
<?xml version="1.0" encoding="UTF-8"?>
<kml xmlns="http://earth.google.com/kml/2.2">
<Document>
<name>Lombardia</name>
<Style>
...
</Style>
<Folder>
<Placemark>
<name>ZOGNO</name>
<styleUrl>#FEATURES_LABELS</styleUrl>
<Point>
<coordinates>9.680530595139061,45.7941656233647,0</coordinates>
</Point>
</Placemark>
<Placemark>
<name>ZONE</name>
<styleUrl>#FEATURES_LABELS</styleUrl>
<Point>
<coordinates>10.1315885854064,45.7592449779275,0</coordinates>
</Point>
</Placemark>
</Folder>
</Document>
</kml>
The problem is that when I write the new KML file this still has the element I want to delete.
In fact, with I want to delete the element that contains name = ZONE.
What i'm doing wrong?
Thank you.
--- Final Code
This is the working code thanks to #Dawid Ferenczy:
from lxml import etree
import pykml
from pykml import parser
kml_file_path = '../Source/Lombardia.kml'
# parse the input file into an object tree
with open(kml_file_path) as f:
tree = parser.parse(f)
# get a reference to the "Document.Folder" node
folder = tree.getroot().Document.Folder
# iterate through all "Document.Folder.Placemark" nodes and find and remove all nodes
# which contain child node "name" with content "ZONE"
for pm in folder.Placemark:
if pm.name == 'ZOGNO':
parent = pm.getparent()
parent.remove(pm)
# convert the object tree into a string and write it into an output file
with open('output.kml', 'w') as output:
output.write(etree.tostring(folder, pretty_print=True))
Consider XSLT, the special purpose language designed to transform XML files. And because KML files are XML files, this solution is viable. Python's third-party module, lxml can run XSLT 1.0 scripts and do so without a single loop.
Specifically, the XSLT script runs the Identity Transform to copy entire document as is. Then, script runs an empty template on the element (conditional to specific logic) to remove that element. To accommodate the default namespace, a prefix, doc, is used for XPath search.
XSLT (save as .xsl file, a special .xml file to be loaded in Python below)
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:doc="http://earth.google.com/kml/2.2">
<xsl:output method="xml" indent="yes"/>
<xsl:strip-space elements="*"/>
<xsl:template match="#* | node()">
<xsl:copy>
<xsl:apply-templates select="#* | node()"/>
</xsl:copy>
</xsl:template>
<xsl:template match="doc:Placemark[doc:name='ZONE']"/>
</xsl:stylesheet>
XSLT Fiddle Demo
Python
import lxml.etree as et
# LOAD XML AND XSL
doc = et.parse('/path/to/Input.xml')
xsl = et.parse('/path/to/XSLT_Script.xsl')
# CONFIGURE TRANSFORMER
transform = et.XSLT(xsl)
# RUN TRANSFORMATION
result = transform(doc)
# PRINT RESULT
print(result)
# SAVE TO FILE
with open('output.xml', 'wb') as f:
f.write(result)
Output
<?xml version="1.0" encoding="UTF-8"?>
<kml xmlns="http://earth.google.com/kml/2.2">
<Document>
<name>Lombardia</name>
<Style>
...
</Style>
<Folder>
<Placemark>
<name>ZOGNO</name>
<styleUrl>#FEATURES_LABELS</styleUrl>
<Point>
<coordinates>9.680530595139061,45.7941656233647,0</coordinates>
</Point>
</Placemark>
</Folder>
</Document>
</kml>
You have the following issues in your code:
you're not storing the whole parsed object tree anywhere (you have just a reference to the node "Document.Folder": folder = parser.parse(f).getroot().Document.Folder) but you want to write it back into a file so you need to store it
I don't understand why you need two loops and the list removeList when you can delete elements directly in the first loop
you're not reading the documentation - it's well described how to write the object tree into a file under examples in pykml library's documentation
Try the following code:
from lxml import etree
from pykml import parser
kml_file_path = './input.kml'
# parse the input file into an object tree
with open(kml_file_path) as f:
tree = parser.parse(f)
# get a reference to the "Document.Folder" node
folder = tree.getroot().Document.Folder
# iterate through all "Document.Folder.Placemark" nodes and find and remove all nodes
# which contain child node "name" with content "ZONE"
for pm in folder.Placemark:
if pm.name == 'ZONE':
parent = pm.getparent()
parent.remove(pm)
# convert the object tree into a string and write it into an output file
with open('output.kml', 'w') as output:
output.write(etree.tostring(tree, pretty_print=True))
It's very simple:
KML file is parsed into an object tree and stored in variable tree
the same object tree is directly manipulated (removed element)
the same object tree is written back into a file

Character encoding compatibility between python2 and python3

I am writing plist files using ElementTree, and I need to prepend two lines of text before the tree starts, to match Apple's plist syntax. The following code works in python 2.7, but it fails in python 3.6 with TypeError: write() argument must be str, not bytes.
import xml.etree.ElementTree as ET
tree = ET.parse('com.input.plist')
with open('com.new.plist', 'w') as f:
f.write('<?xml version="1.0" encoding="UTF-8"?>\n<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">\n')
tree.write(f, encoding='utf-8')
To get this working on python3, I can change it like so:
tree = ET.parse('com.input.plist')
with open('com.new.plist', 'w') as f:
f.write('<?xml version="1.0" encoding="UTF-8"?>\n<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">\n')
tree.write(f, encoding='unicode')
But this fails in python2 with LookupError: unknown encoding: unicode. How can I make this compatible with both versions?
I found the solution. I open the file in binary mode, then use string.encode() before writing.
with open('com.new.plist', 'wb') as f:
xml_header = '<?xml version="1.0" encoding="UTF-8"?>\n<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">\n'
f.write(xml_header.encode())
tree.write(f)

remove <?xml version="1.0" ?> using xml.dom.minidom

I am generating XML files using xml.dom.minidom. Every time I generate a file on the very row there appears <?xml version="1.0" ?> and the generated file looks like this:
<?xml version="1.0" ?>
<Root>
data
</Root>
is not there anyway so have an output without and my output should look like
<Root>
data
</Root>
The best solution I found was to write out .childNodes[0], i.e. write out:
doc.childNodes[0].toprettyxml()
to the file, which will omit the xml version tag.
If you are happy just to trim the first line from the file, use this code;
f = open( 'file.txt', 'r' )
lines = f.readlines()
f.close()
f = open( 'file.txt'.'w' )
f.write( '\n'.join( lines[1:] ) )
f.close()
This does the job where old_data is the xml to strip
new_data = old_data[old_data.find("?>")+2:]

Generating xml in python and lxml

I have this xml from sql, and I want to do the same by python 2.7 and lxml
<?xml version="1.0" encoding="utf-16"?>
<results>
<Country name="Germany" Code="DE" Storage="Basic" Status="Fresh" Type="Photo" />
</results>
Now I have:
from lxml import etree
# create XML
results= etree.Element('results')
country= etree.Element('country')
country.text = 'Germany'
root.append(country)
filename = "xmltestthing.xml"
FILE = open(filename,"w")
FILE.writelines(etree.tostring(root, pretty_print=True))
FILE.close()
Do you know how to add rest of attributes?
Note this also prints the BOM
>>> from lxml.etree import tostring
>>> from lxml.builder import E
>>> print tostring(
E.results(
E.Country(name='Germany',
Code='DE',
Storage='Basic',
Status='Fresh',
Type='Photo')
), pretty_print=True, xml_declaration=True, encoding='UTF-16')
��<?xml version='1.0' encoding='UTF-16'?>
<results>
<Country Status="Fresh" Type="Photo" Code="DE" Storage="Basic" name="Germany"/>
</results>
from lxml import etree
# Create the root element
page = etree.Element('results')
# Make a new document tree
doc = etree.ElementTree(page)
# Add the subelements
pageElement = etree.SubElement(page, 'Country',
name='Germany',
Code='DE',
Storage='Basic')
# For multiple multiple attributes, use as shown above
# Save to XML file
outFile = open('output.xml', 'w')
doc.write(outFile, xml_declaration=True, encoding='utf-16')
Save to XML file
doc.write('output.xml', xml_declaration=True, encoding='utf-16')
instead of:
outFile = open('output.xml', 'w')
doc.write(outFile, xml_declaration=True, encoding='utf-16')
Promoting my comment to an answer:
#sukbir is probably not using Windows. What happens is that lxml writes a newline (0A 00 in UTF-16LE) between the XML header and the body. This is then molested by Win text mode to become 0D 0A 00 which makes everything after that look like UTF-16BE hence the Chinese etc characters when you display it. You can get around this in this instance by using "wb" instead of "w" when you open the file. However I'd strongly suggest that you use 'UTF-8' (spelled EXACTLY like that) as your encoding. Why are you using UTF-16? You like large files and/or weird problems?

Categories