Getting ascii code error even after encoding with utf-8 - python

I am posting to github's api for markdown, and in the post request I am sending json data. I discovered that I can't write lists because the characters are not a part of ascii and looked it up to find that I should always encode. I encoded the text which needed to be marked down and the api is working, but I still get the same error when I try to make lists.
The code for the POST method is:
def markDown(to_mark):
headers = {
'content-type': 'application/json'
}
text = to_mark.decode('utf8')
payload = {
'text': text,
'mode':'gfm'
}
data = json.dumps(payload)
req = urllib2.Request('https://api.github.com/markdown', data, headers)
response = urllib2.urlopen(req)
marked_down = response.read()
return marked_down
And the error that I get when I try making lists is as follows:
'ascii' codec can't decode byte 0xe2 in position 55: ordinal not in range(128)
Add the full traceback:
Traceback (most recent call last):
File "/home/bigb/Programming/google_appengine/google/appengine/runtime/wsgi.py", line 266, in Handle
result = handler(dict(self._environ), self._StartResponse)
File "/home/bigb/Programming/google_appengine/lib/webapp2-2.3/webapp2.py", line 1519, in __call__
response = self._internal_error(e)
File "/home/bigb/Programming/google_appengine/lib/webapp2-2.3/webapp2.py", line 1511, in __call__
rv = self.handle_exception(request, response, e)
File "/home/bigb/Programming/google_appengine/lib/webapp2-2.3/webapp2.py", line 1505, in __call__
rv = self.router.dispatch(request, response)
File "/home/bigb/Programming/google_appengine/lib/webapp2-2.3/webapp2.py", line 1253, in default_dispatcher
return route.handler_adapter(request, response)
File "/home/bigb/Programming/google_appengine/lib/webapp2-2.3/webapp2.py", line 1077, in __call__
return handler.dispatch()
File "/home/bigb/Programming/google_appengine/lib/webapp2-2.3/webapp2.py", line 547, in dispatch
return self.handle_exception(e, self.app.debug)
File "/home/bigb/Programming/google_appengine/lib/webapp2-2.3/webapp2.py", line 545, in dispatch
return method(*args, **kwargs)
File "/home/bigb/Programming/Blog/my-ramblings/blog.py", line 232, in post
mark_blog = markDown(blog)
File "/home/bigb/Programming/Blog/my-ramblings/blog.py", line 43, in markDown
text = to_mark.decode('utf8')
File "/usr/lib/python2.7/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 45-46: ordinal not in range(128)
Am I understanding something wrong here ? Thanks!

Your to_mark value is not a Unicode value; you already have encoded byte string there. Trying to encode a byte string tells Python that it should first decode the value to Unicode before encoding again. This causes your exception:
>>> '\xc3\xa5'.encode('utf8')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)
For the json.dumps() function, you want to use Unicode values. If to_mark contains UTF-8 data, use str.decode():
text = to_mark.decode('utf8')

Your code snippets reads:
text = to_mark.encode('utf-8')
but in the traceback you have:
File "/home/bigb/Programming/Blog/my-ramblings/blog.py", line 43, in markDown
text = to_mark.decode('utf8')
Please first make sure you post the real code and traceback (that is: you post the code that actually raise the exception).

I can not remember accurately, but probably using decode/encode at response.read() worked for me when I have faced the exact same error.
response.read().decode("utf8")

Related

'utf8' codec can't decode byte 0xb5 in position 0: invalid start byte

Before marking this as a duplicate, I want to make it clear that I have tried countless solutions to make this go away, by using from __future__ import unicode_literals to every pertumations and combinations of str.encode('utf8') and str.decode('utf8'), putting # -*- coding: utf-8 -*- at the start of the file and what not. I know I'm getting something wrong so I'll be as specific as possible, I'm converting a dictionary into a JSON Array/Object and showing it in it's raw string form on the webpage.
The unicode string which I'm having issue is the one starting with "µ" in a file name, thus the error occurs at the last fourth line in the below code. The files array is showing the value at the index of that string as \xb5Torrent.lnk.
if os.path.isdir(finalDirPath):
print "is Dir"
for (path,dir,files) in os.walk(finalDirPath):
if dir!=[]:
for i in dir:
if not hidden(os.path.join(path,i)):
# Here
JSONarray.append({"ext":"dir","path":b64(os.path.join(path,i)),"name":i})
if files!=[]:
for i in files:
if not hidden(os.path.join(path,i)):
# Here
JSONarray.append({"ext":i.split('.')[-1],"path":b64(os.path.join(path,i)),"name":i})
break
jsonStr = {"json":json.dumps(JSONarray)}
return render(request,"json.html",jsonStr)
Here's the traceback:
Traceback (most recent call last):
File "C:\Python27\lib\site-packages\django\core\handlers\exception.py", line 39, in inner
response = get_response(request)
File "C:\Python27\lib\site-packages\django\core\handlers\base.py", line 249, in _legacy_get_response
response = self._get_response(request)
File "C:\Python27\lib\site-packages\django\core\handlers\base.py", line 187, in _get_response
response = self.process_exception_by_middleware(e, request)
File "C:\Python27\lib\site-packages\django\core\handlers\base.py", line 185, in _get_response
response = wrapped_callback(request, *callback_args, **callback_kwargs)
File "E:\ICT\Other\Python\Django\trydjango18\src\newsletter\views.py", line 468, in getJSON
JSONarray.append({"ext":i.split('.')[-1],"path":b64(os.path.join(path.encode('utf8'),i)),"name":i})
UnicodeDecodeError: 'utf8' codec can't decode byte 0xb5 in position 0: invalid start byte
A shorter example that demonstrates your problem:
>>> json.dumps('\xb5Torrent.lnk')
Traceback (most recent call last):
File "<pyshell#17>", line 1, in <module>
json.dumps('\xb5Torrent.lnk')
File "C:\Python27\lib\json\__init__.py", line 243, in dumps
return _default_encoder.encode(obj)
File "C:\Python27\lib\json\encoder.py", line 201, in encode
return encode_basestring_ascii(o)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xb5 in position 0: invalid start byte
Your array files contains byte strings, but json.dumps() wants any strings in the data to be unicode. It assumes that any byte strings are utf-8 encoded but your strings are using a different encoding: possibly latin1 or possibly something else. You need to find out the encoding used by your filesystem and decode all of the filenames to unicode before you add them to your JSONarray structure.
First thing is to check your filesystem encoding:
import sys
print sys.getfilesystemencoding()
should tell you the encoding used for the filenames and then you just make sure that all your path manipulations use unicode:
import sys
fsencoding = sys.getfilesystemencoding()
if os.path.isdir(finalDirPath):
print "is Dir"
for (path,dir,files) in os.walk(finalDirPath):
path = path.decode(fsencoding)
for i in dir:
i = i.decode(fsencoding)
if not hidden(os.path.join(path,i)):
# Here
JSONarray.append({
"ext": "dir",
"path": b64(os.path.join(path,i)),
"name": i})
})
for i in files:
i = i.decode(fsencoding)
if not hidden(os.path.join(path,i)):
# Here
JSONarray.append({
"ext": i.split('.')[-1],
"path": b64(os.path.join(path,i)),
"name":i
})
break
jsonStr = {"json":json.dumps(JSONarray)}
return render(request,"json.html",jsonStr)

UnicodeDecodeError when trying to save an Excel File with Python xlwt

I'm running a Python script that writes HTML code found using BeautifulSoup into multiple rows of an Excel spreadsheet column.
[...]
Col_HTML = 19
w_sheet.write(row_index, Col_HTML, str(HTML_Code))
wb.save(output)
When trying to save the file, I get the following error message:
Traceback (most recent call last):
File "C:\Users\[..]\src\MYCODE.py", line 201, in <module>
wb.save(output)
File "C:\Python27\lib\site-packages\xlwt-0.7.5-py2.7.egg\xlwt\Workbook.py", line 662, in save
doc.save(filename, self.get_biff_data())
File "C:\Python27\lib\site-packages\xlwt-0.7.5-py2.7.egg\xlwt\Workbook.py", line 637, in get_biff_data
shared_str_table = self.__sst_rec()
File "C:\Python27\lib\site-packages\xlwt-0.7.5-py2.7.egg\xlwt\Workbook.py", line 599, in __sst_rec
return self.__sst.get_biff_record()
File "C:\Python27\lib\site-packages\xlwt-0.7.5-py2.7.egg\xlwt\BIFFRecords.py", line 76, in get_biff_record
self._add_to_sst(s)
File "C:\Python27\lib\site-packages\xlwt-0.7.5-py2.7.egg\xlwt\BIFFRecords.py", line 91, in _add_to_sst
u_str = upack2(s, self.encoding)
File "C:\Python27\lib\site-packages\xlwt-0.7.5-py2.7.egg\xlwt\UnicodeUtils.py", line 50, in upack2
us = unicode(s, encoding)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 5181: ordinal not in range(128)
I've successfully written Python script in the past to write into worksheets. It's the first time I try to write a string of HTML into cells and I'm wondering what is causing the error and how I could fix it.
Use this line before passing HTML_Code to w_sheet.write
HTML_Code = HTML_Code.decode('utf-8')
Because, in the error line UnicodeDecodeError: 'ascii' codec can't decode, Python is trying to decode unicode into ascii, so you need to decode unicode using the proper encoding format, that is, utf-8.
So, you have:
Col_HTML = 19
HTML_Code = HTML_Code.decode('utf-8')
w_sheet.write(row_index, Col_HTML, str(HTML_Code))

UnicodeDecodeError in django top-level template code

I have code in my views that returns information to be displayed in a textbox. My name has fadas (Irish accent) over the letters which is causing UnicodeDecodeErrors. The line in my logic is as follows:
return {
...
'wrap_up_form': WrapUpForm(data={u'message': settings.DEFAULT_WRAP_UP_MESSAGE.format(name=customer.given_name.encode('utf-8'))}),
}
and the traceback I get is this
ERROR 2014-07-24 14:48:26,540 exception_handlers.py:65] 'ascii' codec can't decode byte 0xc3 in position 5: ordinal not in range(128)
Traceback (most recent call last):
File "/home/rony/google_appengine/lib/webapp2-2.5.2/webapp2.py", line 1529, in __call__
rv = self.router.dispatch(request, response)
File "/home/rony/Documents/clone-attempt/personal-shopping/vendor/nacelle/core/dispatcher.py", line 24, in nacelle_dispatcher
response = router.default_dispatcher(request, response)
File "/home/rony/google_appengine/lib/webapp2-2.5.2/webapp2.py", line 1278, in default_dispatcher
return route.handler_adapter(request, response)
File "/home/rony/google_appengine/lib/webapp2-2.5.2/webapp2.py", line 1065, in __call__
return self.handler(request, *args, **kwargs)
File "/home/rony/Documents/clone-attempt/personal-shopping/app/utils/decorators.py", line 43, in _arguments_wrapper
return view_method(request, *args, **kwargs)
File "/home/rony/Documents/clone-attempt/personal-shopping/app/utils/decorators.py", line 89, in _arguments_wrapper
output = render_jinja2_template(template_name, context)
File "/home/rony/Documents/clone-attempt/personal-shopping/vendor/nacelle/core/template/renderers.py", line 19, in render_jinja2_template
return renderer.render_template(template_name, **context)
File "/home/rony/google_appengine/lib/webapp2-2.5.2/webapp2_extras/jinja2.py", line 158, in render_template
return self.environment.get_template(_filename).render(**context)
File "/home/rony/google_appengine/lib/jinja2-2.6/jinja2/environment.py", line 894, in render
return self.environment.handle_exception(exc_info, True)
File "templates/cms/appointments_form.html", line 2, in top-level template code
{% import 'cms/macros.html' as cms_macros %}
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 5: ordinal not in range(128)
Do I need to add some sort of encoding to my templates?
As Daniel Roseman answered, I also suspect that customer.given_name is byte string; trying to encode on it cause Python try to decode it.
Another issue is that DEFAULT_WRAP_UP_MESSAGE is byte string literal.
str.format(unicode) has same issue.
Solution:
remove .decode(..) part.
Make a DEFAULT_WRAP_UP_MESSAGE unicode object instead of byte string.
It seems likely that customer.given_name is a byte string, rather than Unicode - so in calling encode on it, Python first needs to decode it to Unicode before it can then re-encode to UTF-8.
You should drop the encode call altogether.

Modifying a forked Pyramid application that uses MySQL to use a SQLite DB -- Unicode Decode Error

I have a question that is a bit hard to explain. I'm forking devsniper's application 'customers' as a base to start a POS system for a local computer shop. The original application uses MySQL, however it is critical that this application uses my client's original data. So I am presented with two options:
1) I can migrate the SQLite Database to a MySQL DB
2) I can modify the program to use the SQLite DB (Preferred)
However, whenever I try to pull up the customers page, I get the following:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 0: ordinal not in range(128)
I am not sure where to start with detailing my problem, as there isn't much detail in precisely what is causing this problem, however I will start with the traceback.
Traceback (most recent call last):
File "/home/tabras/posenv/local/lib/python2.7/site-packages/pyramid-1.4.2-py2.7.egg/pyramid/mako_templating.py", line 232, in __call__
result = template.render_unicode(**system)
File "/home/tabras/posenv/local/lib/python2.7/site-packages/Mako-0.8.1-py2.7.egg/mako/template.py", line 452, in render_unicode
as_unicode=True)
File "/home/tabras/posenv/local/lib/python2.7/site-packages/Mako-0.8.1-py2.7.egg/mako/runtime.py", line 783, in _render
**_kwargs_for_callable(callable_, data))
File "/home/tabras/posenv/local/lib/python2.7/site-packages/Mako-0.8.1-py2.7.egg/mako/runtime.py", line 815, in _render_context
_exec_template(inherit, lclcontext, args=args, kwargs=kwargs)
File "/home/tabras/posenv/local/lib/python2.7/site-packages/Mako-0.8.1-py2.7.egg/mako/runtime.py", line 841, in _exec_template
callable_(context, *args, **kwargs)
File "/home/tabras/posenv/customers/customers/templates/base/index.html", line 102, in render_body
${next.body()}
File "/home/tabras/posenv/customers/customers/templates/customer/list.html", line 19, in render_body
<%include file="listPartial.html"/>
File "/home/tabras/posenv/local/lib/python2.7/site-packages/Mako-0.8.1-py2.7.egg/mako/runtime.py", line 710, in _include_file
callable_(ctx, **_kwargs_for_include(callable_, context._data, **kwargs))
File "/home/tabras/posenv/customers/customers/templates/customer/listPartial.html", line 50, in render_body
${pager(customers)}
File "/home/tabras/posenv/customers/customers/templates/base/uiHelpers.html", line 10, in render_pager
${items.pager(format="$link_previous ~2~ $link_next",
File "/home/tabras/posenv/local/lib/python2.7/site-packages/WebHelpers-1.3-py2.7.egg/webhelpers/paginate.py", line 716, in pager
self._pagerlink(self.next_page, symbol_next) or ''
File "/home/tabras/posenv/local/lib/python2.7/site-packages/WebHelpers-1.3-py2.7.egg/webhelpers/paginate.py", line 855, in _pagerlink
return HTML.a(text, href=link_url, onclick=onclick_action, **self.link_attr)
File "/home/tabras/posenv/local/lib/python2.7/site-packages/WebHelpers-1.3-py2.7.egg/webhelpers/html/builder.py", line 213, in __call__
return make_tag(self._tag, *args, **kw)
File "/home/tabras/posenv/local/lib/python2.7/site-packages/WebHelpers-1.3-py2.7.egg/webhelpers/html/builder.py", line 308, in make_tag
chunks.extend(escape(x) for x in args)
File "/home/tabras/posenv/local/lib/python2.7/site-packages/WebHelpers-1.3-py2.7.egg/webhelpers/html/builder.py", line 308, in <genexpr>
chunks.extend(escape(x) for x in args)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 0: ordinal not in range(128)
Post Solution Edit:
The problem was here:
${items.pager(format="$link_previous ~2~ $link_next",
symbol_previous="«",
symbol_next="»",
link_attr=link_attr,
curpage_attr=curpage_attr,
dotdot_attr=dotdot_attr,
onclick="$('.list-partial').load('%s'); return false;")}
For some reason the '»' character and its counterpart were giving throwing the error. I simply changed them to standard ascii characters and everything was golden.
Yeah, you were right about slowing down Michael -- It was a really simple error. In uiHelpers.html there was a unicode character '»' which was causing the problem for some reason.. Simply changed that to '>' and it was golden. This was a good lesson in reading the traceback more carefully, thanks for the feedback.
-Tabras

UnicodeDecodeError is raised when getting a cookie in Google App Engine

I have a GAE project in Python where I am setting a cookie in one of my RequestHandlers with this code:
self.response.headers['Set-Cookie'] = 'app=ABCD; expires=Fri, 31-Dec-2020 23:59:59 GMT'
I checked in Chrome and I can see the cookie listed, so it appears to be working.
Then later in another RequestHandler, I get the cookie to check it:
appCookie = self.request.cookies['app']
This line gives the following error when executed:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 1962: ordinal not in range(128)
It seems that it is trying to decode the incoming cookie info using an ASCII codec rather than UTF-8.
How do I force Python to use UTF-8 to decode this?
Are there any other Unicode-related gotchas that I need to be aware of as a newbie to Python and Google App Engine (but an experienced programmer in other languages)?
Here is the full Traceback:
Traceback (most recent call last):
File "/Applications/GoogleAppEngineLauncher.app/Contents/Resources/GoogleAppEngine-default.bundle/Contents/Resources/google_appengine/google/appengine/tools/dev_appserver.py", line 4144, in _HandleRequest
self._Dispatch(dispatcher, self.rfile, outfile, env_dict)
File "/Applications/GoogleAppEngineLauncher.app/Contents/Resources/GoogleAppEngine-default.bundle/Contents/Resources/google_appengine/google/appengine/tools/dev_appserver.py", line 4049, in _Dispatch
base_env_dict=env_dict)
File "/Applications/GoogleAppEngineLauncher.app/Contents/Resources/GoogleAppEngine-default.bundle/Contents/Resources/google_appengine/google/appengine/tools/dev_appserver.py", line 616, in Dispatch
base_env_dict=base_env_dict)
File "/Applications/GoogleAppEngineLauncher.app/Contents/Resources/GoogleAppEngine-default.bundle/Contents/Resources/google_appengine/google/appengine/tools/dev_appserver.py", line 3120, in Dispatch
self._module_dict)
File "/Applications/GoogleAppEngineLauncher.app/Contents/Resources/GoogleAppEngine-default.bundle/Contents/Resources/google_appengine/google/appengine/tools/dev_appserver.py", line 3024, in ExecuteCGI
reset_modules = exec_script(handler_path, cgi_path, hook)
File "/Applications/GoogleAppEngineLauncher.app/Contents/Resources/GoogleAppEngine-default.bundle/Contents/Resources/google_appengine/google/appengine/tools/dev_appserver.py", line 2887, in ExecuteOrImportScript
exec module_code in script_module.__dict__
File "/Users/ken/hgdev/juicekit/main.py", line 402, in <module>
main()
File "/Users/ken/hgdev/juicekit/main.py", line 399, in main
run_wsgi_app(application)
File "/Applications/GoogleAppEngineLauncher.app/Contents/Resources/GoogleAppEngine-default.bundle/Contents/Resources/google_appengine/google/appengine/ext/webapp/util.py", line 98, in run_wsgi_app
run_bare_wsgi_app(add_wsgi_middleware(application))
File "/Applications/GoogleAppEngineLauncher.app/Contents/Resources/GoogleAppEngine-default.bundle/Contents/Resources/google_appengine/google/appengine/ext/webapp/util.py", line 116, in run_bare_wsgi_app
result = application(env, _start_response)
File "/Applications/GoogleAppEngineLauncher.app/Contents/Resources/GoogleAppEngine-default.bundle/Contents/Resources/google_appengine/google/appengine/ext/webapp/__init__.py", line 721, in __call__
response.wsgi_write(start_response)
File "/Applications/GoogleAppEngineLauncher.app/Contents/Resources/GoogleAppEngine-default.bundle/Contents/Resources/google_appengine/google/appengine/ext/webapp/__init__.py", line 296, in wsgi_write
body = self.out.getvalue()
File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/StringIO.py", line 270, in getvalue
self.buf += ''.join(self.buflist)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 1962: ordinal not in range(128)
You're looking to use the decode function somewhat like this (cred #agf:):
self.request.cookies['app'].decode('utf-8')
From official python documentation (plus a couple added details):
Python’s 8-bit strings have a .decode([encoding], [errors]) method that interprets the string using the given encoding. The following example shows the string as it goes to unicode and then back to 8-bit string:
>>> u = unichr(40960) + u'abcd' + unichr(1972) # Assemble a string
>>> type(u), u # Examine
(<type 'unicode'>, u'\ua000abcd\u07b4')
>>> utf8_version = u.encode('utf-8') # Encode as UTF-8
>>> type(utf8_version), utf8_version # Examine
(<type 'str'>, '\xea\x80\x80abcd\xde\xb4')
>>> u2 = utf8_version.decode('utf-8') # Decode using UTF-8
>>> u == u2 # The two strings match
True
First, encode any unicode value you set in the cookies. You also need to quote them in case they can break the header:
import urllib
# This is the value we want to set.
initial_value = u'äëïöü'
# WebOb version that comes with SDK doesn't quote cookie values
# in the Response, neither webapp.Response. So we have to do it.
quoted_value = urllib.quote(initial_value.encode('utf-8'))
rsp = webapp.Response()
rsp.headers['Set-Cookie'] = 'app=%s; Path=/' % quoted_value
Now let's read the value. To test it, create a fake Request to test the cookie we have set. This code was extracted from a real unittest:
cookie = rsp.headers.get('Set-Cookie')
req = webapp.Request.blank('/', headers=[('Cookie', cookie)])
# The stored value is the same quoted value from before.
# Notice that here we use .str_cookies, not .cookies.
stored_value = req.str_cookies.get('app')
self.assertEqual(stored_value, quoted_value)
Our value is still encoded and quoted. We must do the reverse to get the initial one:
# And we can get the initial value unquoting and decoding.
final_value = urllib.unquote(stored_value).decode('utf-8')
self.assertEqual(final_value, initial_value)
If you can, consider using webapp2. webob.Response does all the hard work of quoting and setting cookies, and you can set unicode values directly. See a summary of these issues here.

Categories