ElementTree Unicode Encode/Decode Error - python

For a project I'm supposed to enhance some XML and store it in a file. The problem I encountered is that I keep getting the following error:
Traceback (most recent call last):
File "C:\Python27\lib\multiprocessing\process.py", line 258, in _bootstrap
self.run()
File "C:\Python27\lib\multiprocessing\process.py", line 114, in run
self._target(*self._args, **self._kwargs)
File "C:\Users\Bart\Dropbox\Studie\2013-2014\BSc-KI\cite_parser\parser.py", line 193, in parse_references
outputXML = ET.tostring(root, encoding='utf8', method='xml')
File "C:\Python27\lib\xml\etree\ElementTree.py", line 1126, in tostring
ElementTree(element).write(file, encoding, method=method)
File "C:\Python27\lib\xml\etree\ElementTree.py", line 820, in write
serialize(write, self._root, encoding, qnames, namespaces)
File "C:\Python27\lib\xml\etree\ElementTree.py", line 939, in _serialize_xml
_serialize_xml(write, e, encoding, qnames, None)
File "C:\Python27\lib\xml\etree\ElementTree.py", line 939, in _serialize_xml
_serialize_xml(write, e, encoding, qnames, None)
File "C:\Python27\lib\xml\etree\ElementTree.py", line 939, in _serialize_xml
_serialize_xml(write, e, encoding, qnames, None)
ECLI:NL:RVS:2012:BY1564
File "C:\Python27\lib\xml\etree\ElementTree.py", line 937, in _serialize_xml
write(_escape_cdata(text, encoding))
File "C:\Python27\lib\xml\etree\ElementTree.py", line 1073, in _escape_cdata
return text.encode(encoding, "xmlcharrefreplace")
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 80: ordinal not in range(128)
That error was generated by:
outputXML = ET.tostring(root, encoding='utf8', method='xml')
When looking for a solution to this problem i found several suggestions saying I should add .decode('utf-8') to the function but that results in an Encoding error (first it was decoding) from the write function so that doesn't work...
The encoding error:
Traceback (most recent call last):
File "C:\Python27\lib\multiprocessing\process.py", line 258, in _bootstrap
self.run()
File "C:\Python27\lib\multiprocessing\process.py", line 114, in run
self._target(*self._args, **self._kwargs)
File "C:\Users\Bart\Dropbox\Studie\2013-2014\BSc-KI\cite_parser\parser.py", line 197, in parse_references
myfile.write(outputXML)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xeb' in position 13559: ordinal not in range(128)
It is generated by the following code:
outputXML = ET.tostring(root, encoding='utf8', method='xml').decode('utf-8')
Source (or at least the relevant parts):
# URL encodes the parameters
encoded_parameters = urllib.urlencode({'id':ecli})
# Opens XML file
feed = urllib2.urlopen("http://data.rechtspraak.nl/uitspraken/content?"+encoded_parameters, timeout = 3)
# Parses the XML
ecliFile = ET.parse(feed)
# Fetches root element of current tree
root = ecliFile.getroot()
# Write the XML to a file without any extra indents or newlines
outputXML = ET.tostring(root, encoding='utf8', method='xml')
# Write the XML to the file
with open(file, "w") as myfile:
myfile.write(outputXML)
And last but not least an URL to an XML sample: http://data.rechtspraak.nl/uitspraken/content?id=ECLI:NL:RVS:2012:BY1542

The exception is caused by a byte string value.
text in the traceback is supposed to be a unicode value, but if it is a plain byte string, Python will implicitly first decode it (with the ASCII codec) to Unicode just so you can then encode it again.
It is that decoding that fails.
Because you didn't actually show us what you insert into the XML tree, it is hard to tell you what to fix, other than to make sure you always use Unicode values when inserting text.
Demo:
>>> root.attrib['oops'] = u'Data with non-ASCII codepoints \u2014 (em dash)'.encode('utf8')
>>> ET.tostring(root, encoding='utf8', method='xml')
Traceback (most recent call last):
File "<console>", line 1, in <module>
File "/Users/mj/Development/Library/buildout.python/parts/opt/lib/python2.7/xml/etree/ElementTree.py", line 1126, in tostring
ElementTree(element).write(file, encoding, method=method)
File "/Users/mj/Development/Library/buildout.python/parts/opt/lib/python2.7/xml/etree/ElementTree.py", line 820, in write
serialize(write, self._root, encoding, qnames, namespaces)
File "/Users/mj/Development/Library/buildout.python/parts/opt/lib/python2.7/xml/etree/ElementTree.py", line 932, in _serialize_xml
v = _escape_attrib(v, encoding)
File "/Users/mj/Development/Library/buildout.python/parts/opt/lib/python2.7/xml/etree/ElementTree.py", line 1090, in _escape_attrib
return text.encode(encoding, "xmlcharrefreplace")
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 31: ordinal not in range(128)
>>> root.attrib['oops'] = u'Data with non-ASCII codepoints \u2014 (em dash)'
>>> ET.tostring(root, encoding='utf8', method='xml')
'<?xml version=\'1.0\' encoding=\'utf8\'?> ...'
Setting a bytestring attribute, containing bytes outside the ASCII range, triggers the excetpion; using a unicode value instead ensured the result could be produced.

Related

AWS lambda UnicodeDecodeError

I tried to test my lambda function on aws console, but I can't understand why this error is occuring.
[ERROR] UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc6 in position 9: invalid continuation byte
Traceback (most recent call last):
File "/var/lang/lib/python3.8/site.py", line 208, in addsitedir
addpackage(sitedir, name, known_paths)
File "/var/lang/lib/python3.8/site.py", line 164, in addpackage
for n, line in enumerate(f):
File "/var/lang/lib/python3.8/codecs.py", line 322, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
[ERROR] UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc6 in position 9: invalid continuation byte Traceback (most recent call last): File "/var/lang/lib/python3.8/site.py", line 208, in addsitedir addpackage(sitedir, name, known_paths) File "/var/lang/lib/python3.8/site.py", line 164, in addpackage for n, line in enumerate(f): File "/var/lang/lib/python3.8/codecs.py", line 322, in decode (result, consumed) = self._buffer_decode(data, self.errors, final)
Please give me some help to figure out this problem. Thanks.
There not much information because there no codes.
And it seems you used UTF-8 chars for path which only allows ascii or latin chars.
So would check your code is there any function only allows non-UTF-8 chars?

exception reading in large tab separated file chunked

I have a 350MB tab separated text file. If I try to read it into memory I get an out of memory exception. So I am trying something along those lines (i.e. only read in a few columns):
import pandas as pd
input_file_and_path = r'C:\Christian\ModellingData\X.txt'
column_names = [
'X1'
# , 'X2
]
raw_data = pd.DataFrame()
for chunk in pd.read_csv(input_file_and_path, names=column_names, chunksize=1000, sep='\t'):
raw_data = pd.concat([raw_data, chunk], ignore_index=True)
print(raw_data.head())
Unfortunately, I get this:
Traceback (most recent call last):
File "pandas\_libs\parsers.pyx", line 1134, in pandas._libs.parsers.TextReader._convert_tokens
File "pandas\_libs\parsers.pyx", line 1240, in pandas._libs.parsers.TextReader._convert_with_dtype
File "pandas\_libs\parsers.pyx", line 1256, in pandas._libs.parsers.TextReader._string_convert
File "pandas\_libs\parsers.pyx", line 1494, in pandas._libs.parsers._string_box_utf8
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xae in position 5: invalid start byte
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "C:/xxxx/EdaDataPrepRange1.py", line 17, in <module>
for chunk in pd.read_csv(input_file_and_path, header=None, names=column_names, chunksize=1000, sep='\t'):
File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\io\parsers.py", line 1007, in __next__
return self.get_chunk()
File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\io\parsers.py", line 1070, in get_chunk
return self.read(nrows=size)
File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\io\parsers.py", line 1036, in read
ret = self._engine.read(nrows)
File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\io\parsers.py", line 1848, in read
data = self._reader.read(nrows)
File "pandas\_libs\parsers.pyx", line 876, in pandas._libs.parsers.TextReader.read
File "pandas\_libs\parsers.pyx", line 903, in pandas._libs.parsers.TextReader._read_low_memory
File "pandas\_libs\parsers.pyx", line 968, in pandas._libs.parsers.TextReader._read_rows
File "pandas\_libs\parsers.pyx", line 1094, in pandas._libs.parsers.TextReader._convert_column_data
File "pandas\_libs\parsers.pyx", line 1141, in pandas._libs.parsers.TextReader._convert_tokens
File "pandas\_libs\parsers.pyx", line 1240, in pandas._libs.parsers.TextReader._convert_with_dtype
File "pandas\_libs\parsers.pyx", line 1256, in pandas._libs.parsers.TextReader._string_convert
File "pandas\_libs\parsers.pyx", line 1494, in pandas._libs.parsers._string_box_utf8
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xae in position 5: invalid start byte
Any ideas. Btw how can I generally deal with large files and impute for example missing variables? Ultimately, I need to read in everything to determine, for example, the median to be imputed.
use encoding="utf-8" while using pd.read_csv
Here they have used this encoding. see if this works. open(file path, encoding='windows-1252'):
Reference: 'utf-8' codec can't decode byte 0xa0 in position 4276: invalid start byte
Working Solution
to use encoding encoding="ISO-8859-1"
Regarding your large file problem, just use a file handler and context manager:
with open("your_file.txt") as fileObject:
for line in fileObject:
do_something_with(line)
## No need to close file as 'with' automatically does that
This won't load the whole file into memory. Instead, it'll load a line at a time, and will 'forget' previous lines unless you store a reference.
Also, regarding your encoding problem, just use encoding="utf-8" while using pd.read_csv.

Python UnicodeDecodeError: 'ascii' codec can't decode byte 0xd0 in position 12: ordinal not in range(128)

I am trying to return a file in a StreamingHttpResponse from a class based view using Django rest framework. However I get a very cryptic error message with a stack trace that does not contain any references to my code:
16/Jun/2017 11:08:48] "GET /api/v1/models/49 HTTP/1.1" 200 0
Traceback (most recent call last):
File "/Users/jonathan/anaconda/lib/python3.6/wsgiref/handlers.py", line 138, in run
self.finish_response()
File "/Users/jonathan/anaconda/lib/python3.6/wsgiref/handlers.py", line 179, in finish_response
for data in self.result:
File "/Users/jonathan/anaconda/lib/python3.6/wsgiref/util.py", line 30, in __next__
data = self.filelike.read(self.blksize)
File "/Users/jonathan/anaconda/lib/python3.6/encodings/ascii.py", line 26, in decode
return codecs.ascii_decode(input, self.errors)[0]
UnicodeDecodeError: 'ascii' codec can't decode byte 0xd0 in position 12: ordinal not in range(128)
[...]
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/Users/jonathan/anaconda/lib/python3.6/wsgiref/handlers.py", line 141, in run
self.handle_error()
File "/Users/jonathan/anaconda/lib/python3.6/site-packages/django/core/servers/basehttp.py", line 88, in handle_error
super(ServerHandler, self).handle_error()
File "/Users/jonathan/anaconda/lib/python3.6/wsgiref/handlers.py", line 368, in handle_error
self.finish_response()
File "/Users/jonathan/anaconda/lib/python3.6/wsgiref/handlers.py", line 180, in finish_response
self.write(data)
File "/Users/jonathan/anaconda/lib/python3.6/wsgiref/handlers.py", line 274, in write
self.send_headers()
File "/Users/jonathan/anaconda/lib/python3.6/wsgiref/handlers.py", line 331, in send_headers
if not self.origin_server or self.client_is_modern():
File "/Users/jonathan/anaconda/lib/python3.6/wsgiref/handlers.py", line 344, in client_is_modern
return self.environ['SERVER_PROTOCOL'].upper() != 'HTTP/0.9'
TypeError: 'NoneType' object is not subscriptable
My get method looks like this:
def get(self, request, pk, format=None):
"""
Get model by primary key (pk)
"""
try:
model = QSARModel.objects.get(pk=pk)
except Exception:
raise Http404
filename = model.pluginFileName
chunk_size = 8192
response = StreamingHttpResponse(
FileWrapper( open(filename), chunk_size ),
content_type = 'application/zip' )
return response
From googling a bit I get the feeling that this is related to ASCII / UTF8 encoding but I don't understand how that applies to my situation. I am dealing with a binary file. In fact it is a Java jar file but that should be pretty much a zip file as far as I understand. What is going on here and what am I doing wrong?
this is related to language translation. when non ascii filenames with the django storage system. So add following lines in your apache envvars
export LANG='en_US.UTF-8'
export LC_ALL='en_US.UTF-8'
https://code.djangoproject.com/wiki/django_apache_and_mod_wsgi#AdditionalTweaking

sublime text when open browser unicodeEncodeError

Traceback (most recent call last):
File "/Applications/Sublime
Text.app/Contents/MacOS/sublime_plugin.py", line 818, in run_
return self.run(edit)
File "open_in_browser in /Applications/Sublime
Text.app/Contents/MacOS/Packages/Default.sublime-package", line 9, in run
File "./python3.3/webbrowser.py", line 70, in open_new_tab
File "./python3.3/webbrowser.py", line 62, in open
File "./python3.3/webbrowser.py", line 635, in open
UnicodeEncodeError: 'ascii' codec can't encode characters in position
56-57: ordinal not in range(128)
How can i fix this? I use the plugin SidebarEnhancements to open my html,but I meet this problem.

How to store output of os.urandom(8) in CouchDB?

I am trying to store some cryptographic data in couchdb. I need to store a salt and encrypted password in couchdb. The salt is generated using python's os.urandom(8) and the sample output of that would look like:
'z/\xfe\xdf\xdeJ=y'
I'm using python-couchdb api to store the document. When I try to save the document I get:
Traceback (most recent call last):
File "<console>", line 1, in <module>
File "build/bdist.macosx-10.7-intel/egg/couchdb/client.py", line 343, in __setitem__
status, headers, data = resource.put_json(body=content)
File "build/bdist.macosx-10.7-intel/egg/couchdb/http.py", line 499, in put_json
**params)
File "build/bdist.macosx-10.7-intel/egg/couchdb/http.py", line 514, in _request_json
headers=headers, **params)
File "build/bdist.macosx-10.7-intel/egg/couchdb/http.py", line 510, in _request
credentials=self.credentials)
File "build/bdist.macosx-10.7-intel/egg/couchdb/http.py", line 260, in request
body = json.encode(body).encode('utf-8')
File "build/bdist.macosx-10.7-intel/egg/couchdb/json.py", line 68, in encode
return _encode(obj)
File "build/bdist.macosx-10.7-intel/egg/couchdb/json.py", line 129, in <lambda>
dumps(obj, allow_nan=False, ensure_ascii=False)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/__init__.py", line 238, in dumps
**kw).encode(obj)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/encoder.py", line 204, in encode
return ''.join(chunks)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xfe in position 3: ordinal not in range(128)
Encode it as either base64 or as hex before saving, or save it in a binary field.
Encode the output of urandom in base 64 like this:
os.urandom(8).encode('base64')
As per the example in this thread

Categories