Parse XML in Python with encoding other than utf-8

Parse XML in Python with encoding other than utf-8 - python

Any clue on how to parse xml in python that has: encoding='Windows-1255' in it?
At least the lxml.etree parser won't even look at the string when there's an "encoding" tag in the XML header which isn't "utf-8" or "ASCII".
Running the following code fails with:
ValueError: Unicode strings with encoding declaration are not supported. Please use bytes input or XML fragments without
declaration.
from lxml import etree
parser = etree.XMLParser(encoding='utf-8')
def convert_xml_to_utf8(xml_str):
tree = etree.fromstring(xml_str, parser=parser)
if tree.docinfo.encoding == 'utf-8':
# already in correct encoding, abort
return xml_str
decoded_str = xml_str.decode(tree.docinfo.encoding)
utf8_encoded_str = decoded_str.encode('utf-8')
tree = etree.fromstring(utf8_encoded_str)
tree.docinfo.encoding = 'utf-8'
return etree.tostring(tree, pretty_print = True, xml_declaration = True, encoding='UTF-8', standalone="yes")
data = '''<?xml version='1.0' encoding='Windows-1255'?><rss version="2.0"><channel ><title ><![CDATA[ynet - חדשות]]></title></channel></rss>'''
print(convert_xml_to_utf8(data))

data is a unicode str. The error is saying that such a thing which also contains an encoding="..." declaration is not supported, because a str is supposedly already decoded from its encoding and hence it's ambiguous/nonsensical that it would also contain an encoding declaration. It's telling you to use a bytes instead, e.g. data = b'<...>'. Presumably you should be opening a file in binary mode, read the data from there and let etree handle the encoding="...", instead of using string literals in your code, which complicates the encoding situation even further.
It's as simple as:
from xml.etree import ElementTree
# open in binary mode ↓
with open('/tmp/test.xml', 'rb') as f:
e = ElementTree.fromstring(f.read())
Et voilà, e contains your parsed file with the encoding having been (presumably) correctly interpreted by etree based on the file's internal encoding="..." header.
ElementTree in fact has a shortcut method for this:
e = ElementTree.parse('/tmp/test.xml')

Related

I want to append an element to xml’s lead [duplicate]

I am generating an XML document in Python using an ElementTree, but the tostring function doesn't include an XML declaration when converting to plaintext.
from xml.etree.ElementTree import Element, tostring
document = Element('outer')
node = SubElement(document, 'inner')
node.NewValue = 1
print tostring(document) # Outputs "<outer><inner /></outer>"
I need my string to include the following XML declaration:
<?xml version="1.0" encoding="UTF-8" standalone="yes" ?>
However, there does not seem to be any documented way of doing this.
Is there a proper method for rendering the XML declaration in an ElementTree?

I am surprised to find that there doesn't seem to be a way with ElementTree.tostring(). You can however use ElementTree.ElementTree.write() to write your XML document to a fake file:
from io import BytesIO
from xml.etree import ElementTree as ET
document = ET.Element('outer')
node = ET.SubElement(document, 'inner')
et = ET.ElementTree(document)
f = BytesIO()
et.write(f, encoding='utf-8', xml_declaration=True)
print(f.getvalue()) # your XML file, encoded as UTF-8
See this question. Even then, I don't think you can get your 'standalone' attribute without writing prepending it yourself.

I would use lxml (see http://lxml.de/api.html).
Then you can:
from lxml import etree
document = etree.Element('outer')
node = etree.SubElement(document, 'inner')
print(etree.tostring(document, xml_declaration=True))

If you include the encoding='utf8', you will get an XML header:
xml.etree.ElementTree.tostring writes a XML encoding declaration with encoding='utf8'
Sample Python code (works with Python 2 and 3):
import xml.etree.ElementTree as ElementTree
tree = ElementTree.ElementTree(
ElementTree.fromstring('<xml><test>123</test></xml>')
)
root = tree.getroot()
print('without:')
print(ElementTree.tostring(root, method='xml'))
print('')
print('with:')
print(ElementTree.tostring(root, encoding='utf8', method='xml'))
Python 2 output:
$ python2 example.py
without:
<xml><test>123</test></xml>
with:
<?xml version='1.0' encoding='utf8'?>
<xml><test>123</test></xml>
With Python 3 you will note the b prefix indicating byte literals are returned (just like with Python 2):
$ python3 example.py
without:
b'<xml><test>123</test></xml>'
with:
b"<?xml version='1.0' encoding='utf8'?>\n<xml><test>123</test></xml>"

xml_declaration Argument
Is there a proper method for rendering the XML declaration in an ElementTree?
YES, and there is no need of using .tostring function. According to ElementTree Documentation, you should create an ElementTree object, create Element and SubElements, set the tree's root, and finally use xml_declaration argument in .write function, so the declaration line is included in output file.
You can do it this way:
import xml.etree.ElementTree as ET
tree = ET.ElementTree("tree")
document = ET.Element("outer")
node1 = ET.SubElement(document, "inner")
node1.text = "text"
tree._setroot(document)
tree.write("./output.xml", encoding = "UTF-8", xml_declaration = True)
And the output file is:
<?xml version='1.0' encoding='UTF-8'?>
<outer><inner>text</inner></outer>

I encounter this issue recently, after some digging of the code, I found the following code snippet is definition of function ElementTree.write
def write(self, file, encoding="us-ascii"):
assert self._root is not None
if not hasattr(file, "write"):
file = open(file, "wb")
if not encoding:
encoding = "us-ascii"
elif encoding != "utf-8" and encoding != "us-ascii":
file.write("<?xml version='1.0' encoding='%s'?>\n" %
encoding)
self._write(file, self._root, encoding, {})
So the answer is, if you need write the XML header to your file, set the encoding argument other than utf-8 or us-ascii, e.g. UTF-8

Easy
Sample for both Python 2 and 3 (encoding parameter must be utf8):
import xml.etree.ElementTree as ElementTree
tree = ElementTree.ElementTree(ElementTree.fromstring('<xml><test>123</test></xml>'))
root = tree.getroot()
print(ElementTree.tostring(root, encoding='utf8', method='xml'))
From Python 3.8 there is xml_declaration parameter for that stuff:
New in version 3.8: The xml_declaration and default_namespace
parameters.
xml.etree.ElementTree.tostring(element, encoding="us-ascii",
method="xml", *, xml_declaration=None, default_namespace=None,
short_empty_elements=True) Generates a string representation of an XML
element, including all subelements. element is an Element instance.
encoding 1 is the output encoding (default is US-ASCII). Use
encoding="unicode" to generate a Unicode string (otherwise, a
bytestring is generated). method is either "xml", "html" or "text"
(default is "xml"). xml_declaration, default_namespace and
short_empty_elements has the same meaning as in ElementTree.write().
Returns an (optionally) encoded string containing the XML data.
Sample for Python 3.8 and higher:
import xml.etree.ElementTree as ElementTree
tree = ElementTree.ElementTree(ElementTree.fromstring('<xml><test>123</test></xml>'))
root = tree.getroot()
print(ElementTree.tostring(root, encoding='unicode', method='xml', xml_declaration=True))

The minimal working example with ElementTree package usage:
import xml.etree.ElementTree as ET
document = ET.Element('outer')
node = ET.SubElement(document, 'inner')
node.text = '1'
res = ET.tostring(document, encoding='utf8', method='xml').decode()
print(res)
the output is:
<?xml version='1.0' encoding='utf8'?>
<outer><inner>1</inner></outer>

Another pretty simple option is to concatenate the desired header to the string of xml like this:
xml = (bytes('<?xml version="1.0" encoding="UTF-8"?>\n', encoding='utf-8') + ET.tostring(root))
xml = xml.decode('utf-8')
with open('invoice.xml', 'w+') as f:
f.write(xml)

I would use ET:
try:
from lxml import etree
print("running with lxml.etree")
except ImportError:
try:
# Python 2.5
import xml.etree.cElementTree as etree
print("running with cElementTree on Python 2.5+")
except ImportError:
try:
# Python 2.5
import xml.etree.ElementTree as etree
print("running with ElementTree on Python 2.5+")
except ImportError:
try:
# normal cElementTree install
import cElementTree as etree
print("running with cElementTree")
except ImportError:
try:
# normal ElementTree install
import elementtree.ElementTree as etree
print("running with ElementTree")
except ImportError:
print("Failed to import ElementTree from any known place")
document = etree.Element('outer')
node = etree.SubElement(document, 'inner')
print(etree.tostring(document, encoding='UTF-8', xml_declaration=True))

This works if you just want to print. Getting an error when I try to send it to a file...
import xml.dom.minidom as minidom
import xml.etree.ElementTree as ET
from xml.etree.ElementTree import Element, SubElement, Comment, tostring
def prettify(elem):
rough_string = ET.tostring(elem, 'utf-8')
reparsed = minidom.parseString(rough_string)
return reparsed.toprettyxml(indent=" ")

Including 'standalone' in the declaration
I didn't found any alternative for adding the standalone argument in the documentation so I adapted the ET.tosting function to take it as an argument.
from xml.etree import ElementTree as ET
# Sample
document = ET.Element('outer')
node = ET.SubElement(document, 'inner')
et = ET.ElementTree(document)
# Function that you need
def tostring(element, declaration, encoding=None, method=None,):
class dummy:
pass
data = []
data.append(declaration+"\n")
file = dummy()
file.write = data.append
ET.ElementTree(element).write(file, encoding, method=method)
return "".join(data)
# Working example
xdec = """<?xml version="1.0" encoding="UTF-8" standalone="no" ?>"""
xml = tostring(document, encoding='utf-8', declaration=xdec)

lxml.etree.XML ValueError for Unicode string

I'm transforming an xml document with xslt. While doing it with python3 I had this following error. But I don't have any errors with python2
-> % python3 cstm/artefact.py
Traceback (most recent call last):
File "cstm/artefact.py", line 98, in <module>
simplify_this_dataset('fisheries-service-des-peches.xml')
File "cstm/artefact.py", line 85, in simplify_this_dataset
xslt_root = etree.XML(xslt_content)
File "lxml.etree.pyx", line 3012, in lxml.etree.XML (src/lxml/lxml.etree.c:67861)
File "parser.pxi", line 1780, in lxml.etree._parseMemoryDocument (src/lxml/lxml.etree.c:102420)
ValueError: Unicode strings with encoding declaration are not supported. Please use bytes input or XML fragments without declaration.
#!/usr/bin/env python3
# vim:fileencoding=UTF-8:ts=4:sw=4:sta:et:sts=4:ai
# -*- coding: utf-8 -*-
from lxml import etree
def simplify_this_dataset(dataset):
"""Create A simplify version of an xml file
it will remove all the attributes and assign them as Elements instead
"""
module_path = os.path.dirname(os.path.abspath(__file__))
data = open(module_path+'/data/ex-fire.xslt')
xslt_content = data.read()
xslt_root = etree.XML(xslt_content)
dom = etree.parse(module_path+'/../CanSTM_dataset/'+dataset)
transform = etree.XSLT(xslt_root)
result = transform(dom)
f = open(module_path+ '/../CanSTM_dataset/otra.xml', 'w')
f.write(str(result))
f.close()

data = open(module_path+'/data/ex-fire.xslt')
xslt_content = data.read()
This implicitly decodes the bytes in the file to Unicode text, using the default encoding. (This might give wrong results, if the XML file isn't in that encoding.)
xslt_root = etree.XML(xslt_content)
XML has its own handling and signalling for encodings, the <?xml encoding="..."?> prolog. If you pass a Unicode string starting with <?xml encoding="..."?> to a parser, the parser would like to reintrepret the rest of the byte string using that encoding... but can't, because you've already decoded the byte input to a Unicode string.
Instead, you should either pass the undecoded byte string to the parser:
data = open(module_path+'/data/ex-fire.xslt', 'rb')
xslt_content = data.read()
xslt_root = etree.XML(xslt_content)
or, better, just have the parser read straight from the file:
xslt_root = etree.parse(module_path+'/data/ex-fire.xslt')

You can also decode the UTF-8 string and encode it with ascii before passing it to etree.XML
xslt_content = data.read()
xslt_content = xslt_content.decode('utf-8').encode('ascii')
xslt_root = etree.XML(xslt_content)

I made it work by simply reencoding with the default options
xslt_content = data.read().encode()

LXML kills my CDATA sections

I'm batch-converting a lot of XML files, changing their character encodings to UTF-8:
with open(source_filename, "rb") as source:
tree = etree.parse(source)
with open(destination_filename, "wb") as destination:
tree.write(destination, encoding="UTF-8", xml_declaration=True)
Unfortunately, it is destroying my CDATA sections and just escaping them instead.
Source:
<d><![CDATA[áÌÀøÅàùÑÄéú ëÌÄé áÈàÅùÑ éäå''ä ðÄùÑÀôÌÈè <small><small>(ùí ëå èæ)</small></small>
Destination:
<d>בְּרֵאשִׁית כִּי בָאֵשׁ יהו''ה נִשְׁפָּט <small><small>(שם כו טז)</small></small>
Is there a setting which I can set which will tell it to leave my CDATA sections alone? I'm mainly using LXML to change the character encoding and to write the XML header properly.

Use the strip_cdata=False option:
import lxml.etree as etree
parser = etree.XMLParser(strip_cdata=False)
with open(source_filename, "rb") as source:
tree = etree.parse(source, parser=parser)

Unicode: Python / lxml file output not as expected (print vs write)

I'm parsing an xml file using the code below:
import lxml
file_name = input('Enter the file name, including .xml extension: ')
print('Parsing ' + file_name)
from lxml import etree
parser = lxml.etree.XMLParser()
tree = lxml.etree.parse(file_name, parser)
root = tree.getroot()
nsmap = {'xmlns': 'urn:tva:metadata:2010'}
with open(file_name+'.log', 'w', encoding='utf-8') as f:
for info in root.xpath('//xmlns:ProgramInformation', namespaces=nsmap):
crid = (info.get('programId'))
titlex = (info.find('.//xmlns:Title', namespaces=nsmap))
title = (titlex.text if titlex != None else 'Missing')
synopsis1x = (info.find('.//xmlns:Synopsis[1]', namespaces=nsmap))
synopsis1 = (synopsis1x.text if synopsis1x != None else 'Missing')
synopsis1 = synopsis1.replace('\r','').replace('\n','')
f.write('{}|{}|{}\n'.format(crid, title, synopsis1))
Let take an example title of 'Přešité bydlení'. If I print the title whilst parsing the file, it comes out as expected. When I write it out however, it displays as 'PÅ™eÅ¡itÃ© bydlenÃ'.
I understand that this is do to with encoding (as I was able to change the print command to use UTF-8, and 'corrupt' the output), but I couldn't get the written output to print as I desired. I had a look at the codecs library, but couldn't wasn't successful. Having 'encoding = "utf-8"' in the XML Parser line didn't make any difference.
How can I configure the written output to be human readable?

I had all sorts of troubles with this before. But the solution is rather simple. There is a chapter on how to read and write in unicode to a file in the documentation. This Python talk is also very enlightening to understand the issue. Unicode can be a pain. It gets a lot easier if you start using python 3 though.
import codecs
f = codecs.open('test', encoding='utf-8', mode='w+')
f.write(u'\u4500 blah blah blah\n')
f.seek(0)
print repr(f.readline()[:1])
f.close()

Your code looks ok, so I reckon your input is duff. Assuming you're viewing your output file with a UTF-8 viewer or shell then I suspect that the encoding in the <?xml doesn't match the actual encoding.
This would explain why printing works but not writing to a file. If your shell/IDE is set to "ISO-8859-2" and your input XML is also "ISO-8859-2" then printing is pushing out the raw encoding.

Why does printing to a utf-8 file fail?

So I ran into a problem this afternoon, I was able to solve it, but I don't quite understand why it worked.
this is related to a problem I had the other week: python check if utf-8 string is uppercase
basically, the following will not work:
#!/usr/bin/python
import codecs
from lxml import etree
outFile = codecs.open('test.xml', 'w', 'utf-8') #cannot use codecs.open()
root = etree.Element('root')
sect = etree.SubElement(root,'sect')
words = ( u'\u041c\u041e\u0421\u041a\u0412\u0410', # capital of Russia, all uppercase
u'R\xc9SUM\xc9', # RESUME with accents
u'R\xe9sum\xe9', # Resume with accents
u'R\xe9SUM\xe9', ) # ReSUMe with accents
for word in words:
print word
if word.encode('utf8').decode('utf8').isupper(): #.isupper won't function on utf8
title = etree.SubElement(sect,'title')
title.text = word
else:
item = etree.SubElement(sect,'item')
item.text = word
print>>outFile,etree.tostring(root,pretty_print=True,xml_declaration=True,encoding='utf-8')
it fails with the following:
Traceback (most recent call last):
File "./temp.py", line 25, in
print >>outFile,etree.tostring(root,pretty_print=True,xml_declaration=True,encoding='utf-8')
File "/usr/lib/python2.7/codecs.py",
line 691, in write
return self.writer.write(data) File "/usr/lib/python2.7/codecs.py",
line 351, in write
data, consumed = self.encode(object, self.errors)
UnicodeDecodeError: 'ascii' codec
can't decode byte 0xd0 in position 66:
ordinal not in range(128)
but if I open the new file without codecs.open('test.xml', 'w', 'utf-8') and instead use
outFile = open('test.xml', 'w') it works perfectly.
So whats happening??
since encoding='utf-8' is specified in etree.tostring() is it encoding the file again?
if I leave codecs.open() and remove encoding='utf-8' the file then becomes an ascii file. Why? becuase etree.tostring() has a default encoding of ascii I persume?
but etree.tostring() is simply being written to stdout, and is then redirect to a file that was created as a utf-8 file??
is print>> not workings as I expect? outFile.write(etree.tostring()) behaves the same way.
Basically, why wouldn't this work? what is going on here. It might be trivial, but I am obviously a bit confused and have a desire to figure out why my solution works,

You've opened the file with UTF-8 encoding, which means that it expects Unicode strings.
tostring is encoding to UTF-8 (in the form of bytestrings, str), which you're writing to the file.
Because the file is expecting Unicode, it's decoding the bytestrings to Unicode using the default ASCII encoding so that it can then encode the Unicode to UTF-8.
Unfortunately, the bytestrings aren't ASCII.
EDIT: The best advice to avoid this kind of problem is to use Unicode internally, decoding on input and encoding on output.

Using print>>outFile is a little strange. I don't have lxml installed, but the built-in xml.etree library is similar (but doesn't support pretty_print). Wrap the root Element in an ElementTree and use the write method.
Also, if you using a # coding line to declare the encoding of the source file, you can use readable Unicode strings instead of escape codes:
#!/usr/bin/python
# coding: utf8
import codecs
from xml.etree import ElementTree as etree
root = etree.Element(u'root')
sect = etree.SubElement(root,u'sect')
words = [u'МОСКВА',u'RÉSUMÉ',u'Résumé',u'RéSUMé']
for word in words:
print word
if word.isupper():
title = etree.SubElement(sect,u'title')
title.text = word
else:
item = etree.SubElement(sect,u'item')
item.text = word
tree = etree.ElementTree(root)
tree.write('text.xml',xml_declaration=True,encoding='utf-8')

In addition to MRABs answer some lines of code:
import codecs
from lxml import etree
root = etree.Element('root')
sect = etree.SubElement(root,'sect')
# do some other xml building here
with codecs.open('test.xml', 'w', encoding='utf-8') as f:
f.write(etree.tostring(root, encoding=unicode))

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Parse XML in Python with encoding other than utf-8 - python

Related

I want to append an element to xml’s lead [duplicate]

lxml.etree.XML ValueError for Unicode string

LXML kills my CDATA sections

Unicode: Python / lxml file output not as expected (print vs write)

Why does printing to a utf-8 file fail?

Categories

Resources