I have a GAE project in Python where I am setting a cookie in one of my RequestHandlers with this code:
self.response.headers['Set-Cookie'] = 'app=ABCD; expires=Fri, 31-Dec-2020 23:59:59 GMT'
I checked in Chrome and I can see the cookie listed, so it appears to be working.
Then later in another RequestHandler, I get the cookie to check it:
appCookie = self.request.cookies['app']
This line gives the following error when executed:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 1962: ordinal not in range(128)
It seems that it is trying to decode the incoming cookie info using an ASCII codec rather than UTF-8.
How do I force Python to use UTF-8 to decode this?
Are there any other Unicode-related gotchas that I need to be aware of as a newbie to Python and Google App Engine (but an experienced programmer in other languages)?
Here is the full Traceback:
Traceback (most recent call last):
File "/Applications/GoogleAppEngineLauncher.app/Contents/Resources/GoogleAppEngine-default.bundle/Contents/Resources/google_appengine/google/appengine/tools/dev_appserver.py", line 4144, in _HandleRequest
self._Dispatch(dispatcher, self.rfile, outfile, env_dict)
File "/Applications/GoogleAppEngineLauncher.app/Contents/Resources/GoogleAppEngine-default.bundle/Contents/Resources/google_appengine/google/appengine/tools/dev_appserver.py", line 4049, in _Dispatch
base_env_dict=env_dict)
File "/Applications/GoogleAppEngineLauncher.app/Contents/Resources/GoogleAppEngine-default.bundle/Contents/Resources/google_appengine/google/appengine/tools/dev_appserver.py", line 616, in Dispatch
base_env_dict=base_env_dict)
File "/Applications/GoogleAppEngineLauncher.app/Contents/Resources/GoogleAppEngine-default.bundle/Contents/Resources/google_appengine/google/appengine/tools/dev_appserver.py", line 3120, in Dispatch
self._module_dict)
File "/Applications/GoogleAppEngineLauncher.app/Contents/Resources/GoogleAppEngine-default.bundle/Contents/Resources/google_appengine/google/appengine/tools/dev_appserver.py", line 3024, in ExecuteCGI
reset_modules = exec_script(handler_path, cgi_path, hook)
File "/Applications/GoogleAppEngineLauncher.app/Contents/Resources/GoogleAppEngine-default.bundle/Contents/Resources/google_appengine/google/appengine/tools/dev_appserver.py", line 2887, in ExecuteOrImportScript
exec module_code in script_module.__dict__
File "/Users/ken/hgdev/juicekit/main.py", line 402, in <module>
main()
File "/Users/ken/hgdev/juicekit/main.py", line 399, in main
run_wsgi_app(application)
File "/Applications/GoogleAppEngineLauncher.app/Contents/Resources/GoogleAppEngine-default.bundle/Contents/Resources/google_appengine/google/appengine/ext/webapp/util.py", line 98, in run_wsgi_app
run_bare_wsgi_app(add_wsgi_middleware(application))
File "/Applications/GoogleAppEngineLauncher.app/Contents/Resources/GoogleAppEngine-default.bundle/Contents/Resources/google_appengine/google/appengine/ext/webapp/util.py", line 116, in run_bare_wsgi_app
result = application(env, _start_response)
File "/Applications/GoogleAppEngineLauncher.app/Contents/Resources/GoogleAppEngine-default.bundle/Contents/Resources/google_appengine/google/appengine/ext/webapp/__init__.py", line 721, in __call__
response.wsgi_write(start_response)
File "/Applications/GoogleAppEngineLauncher.app/Contents/Resources/GoogleAppEngine-default.bundle/Contents/Resources/google_appengine/google/appengine/ext/webapp/__init__.py", line 296, in wsgi_write
body = self.out.getvalue()
File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/StringIO.py", line 270, in getvalue
self.buf += ''.join(self.buflist)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 1962: ordinal not in range(128)
You're looking to use the decode function somewhat like this (cred #agf:):
self.request.cookies['app'].decode('utf-8')
From official python documentation (plus a couple added details):
Python’s 8-bit strings have a .decode([encoding], [errors]) method that interprets the string using the given encoding. The following example shows the string as it goes to unicode and then back to 8-bit string:
>>> u = unichr(40960) + u'abcd' + unichr(1972) # Assemble a string
>>> type(u), u # Examine
(<type 'unicode'>, u'\ua000abcd\u07b4')
>>> utf8_version = u.encode('utf-8') # Encode as UTF-8
>>> type(utf8_version), utf8_version # Examine
(<type 'str'>, '\xea\x80\x80abcd\xde\xb4')
>>> u2 = utf8_version.decode('utf-8') # Decode using UTF-8
>>> u == u2 # The two strings match
True
First, encode any unicode value you set in the cookies. You also need to quote them in case they can break the header:
import urllib
# This is the value we want to set.
initial_value = u'äëïöü'
# WebOb version that comes with SDK doesn't quote cookie values
# in the Response, neither webapp.Response. So we have to do it.
quoted_value = urllib.quote(initial_value.encode('utf-8'))
rsp = webapp.Response()
rsp.headers['Set-Cookie'] = 'app=%s; Path=/' % quoted_value
Now let's read the value. To test it, create a fake Request to test the cookie we have set. This code was extracted from a real unittest:
cookie = rsp.headers.get('Set-Cookie')
req = webapp.Request.blank('/', headers=[('Cookie', cookie)])
# The stored value is the same quoted value from before.
# Notice that here we use .str_cookies, not .cookies.
stored_value = req.str_cookies.get('app')
self.assertEqual(stored_value, quoted_value)
Our value is still encoded and quoted. We must do the reverse to get the initial one:
# And we can get the initial value unquoting and decoding.
final_value = urllib.unquote(stored_value).decode('utf-8')
self.assertEqual(final_value, initial_value)
If you can, consider using webapp2. webob.Response does all the hard work of quoting and setting cookies, and you can set unicode values directly. See a summary of these issues here.
Related
Before marking this as a duplicate, I want to make it clear that I have tried countless solutions to make this go away, by using from __future__ import unicode_literals to every pertumations and combinations of str.encode('utf8') and str.decode('utf8'), putting # -*- coding: utf-8 -*- at the start of the file and what not. I know I'm getting something wrong so I'll be as specific as possible, I'm converting a dictionary into a JSON Array/Object and showing it in it's raw string form on the webpage.
The unicode string which I'm having issue is the one starting with "µ" in a file name, thus the error occurs at the last fourth line in the below code. The files array is showing the value at the index of that string as \xb5Torrent.lnk.
if os.path.isdir(finalDirPath):
print "is Dir"
for (path,dir,files) in os.walk(finalDirPath):
if dir!=[]:
for i in dir:
if not hidden(os.path.join(path,i)):
# Here
JSONarray.append({"ext":"dir","path":b64(os.path.join(path,i)),"name":i})
if files!=[]:
for i in files:
if not hidden(os.path.join(path,i)):
# Here
JSONarray.append({"ext":i.split('.')[-1],"path":b64(os.path.join(path,i)),"name":i})
break
jsonStr = {"json":json.dumps(JSONarray)}
return render(request,"json.html",jsonStr)
Here's the traceback:
Traceback (most recent call last):
File "C:\Python27\lib\site-packages\django\core\handlers\exception.py", line 39, in inner
response = get_response(request)
File "C:\Python27\lib\site-packages\django\core\handlers\base.py", line 249, in _legacy_get_response
response = self._get_response(request)
File "C:\Python27\lib\site-packages\django\core\handlers\base.py", line 187, in _get_response
response = self.process_exception_by_middleware(e, request)
File "C:\Python27\lib\site-packages\django\core\handlers\base.py", line 185, in _get_response
response = wrapped_callback(request, *callback_args, **callback_kwargs)
File "E:\ICT\Other\Python\Django\trydjango18\src\newsletter\views.py", line 468, in getJSON
JSONarray.append({"ext":i.split('.')[-1],"path":b64(os.path.join(path.encode('utf8'),i)),"name":i})
UnicodeDecodeError: 'utf8' codec can't decode byte 0xb5 in position 0: invalid start byte
A shorter example that demonstrates your problem:
>>> json.dumps('\xb5Torrent.lnk')
Traceback (most recent call last):
File "<pyshell#17>", line 1, in <module>
json.dumps('\xb5Torrent.lnk')
File "C:\Python27\lib\json\__init__.py", line 243, in dumps
return _default_encoder.encode(obj)
File "C:\Python27\lib\json\encoder.py", line 201, in encode
return encode_basestring_ascii(o)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xb5 in position 0: invalid start byte
Your array files contains byte strings, but json.dumps() wants any strings in the data to be unicode. It assumes that any byte strings are utf-8 encoded but your strings are using a different encoding: possibly latin1 or possibly something else. You need to find out the encoding used by your filesystem and decode all of the filenames to unicode before you add them to your JSONarray structure.
First thing is to check your filesystem encoding:
import sys
print sys.getfilesystemencoding()
should tell you the encoding used for the filenames and then you just make sure that all your path manipulations use unicode:
import sys
fsencoding = sys.getfilesystemencoding()
if os.path.isdir(finalDirPath):
print "is Dir"
for (path,dir,files) in os.walk(finalDirPath):
path = path.decode(fsencoding)
for i in dir:
i = i.decode(fsencoding)
if not hidden(os.path.join(path,i)):
# Here
JSONarray.append({
"ext": "dir",
"path": b64(os.path.join(path,i)),
"name": i})
})
for i in files:
i = i.decode(fsencoding)
if not hidden(os.path.join(path,i)):
# Here
JSONarray.append({
"ext": i.split('.')[-1],
"path": b64(os.path.join(path,i)),
"name":i
})
break
jsonStr = {"json":json.dumps(JSONarray)}
return render(request,"json.html",jsonStr)
Traceback (most recent call last):
File "C:\Program Files (x86)\Python\Projects\test.py", line 70, in <module>
html = urlopen("https://www.google.co.jp/").read().decode('utf-8')
File "C:\Program Files (x86)\Python\lib\http\client.py", line 506, in read
return self._readall_chunked()
File "C:\Program Files (x86)\Python\lib\http\client.py", line 592, in _readall_chunked
value.append(self._safe_read(chunk_left))
File "C:\Program Files (x86)\Python\lib\http\client.py", line 664, in _safe_read
raise IncompleteRead(b''.join(s), amt)
http.client.IncompleteRead: IncompleteRead(5034 bytes read, 3158 more expected)
So I am trying to get data from a website but it seems whenever it comes across Japanese characters or other unreadable characters it comes up with this error. All I am using is urlopen and .read().decode('utf-8'). Is there some way I can just ignore all of them or replace them all so there is no error?
In the code you posted, there is no problem with character encoding. Instead you have a problem getting the whole HTTP response. (Look closely at the error message.)
I tried this in an interactive Python shell:
>>> import urllib2
>>> url = urllib2.urlopen("https://www.google.co.jp/")
>>> body = url.read()
>>> len(body)
11155
This worked.
>>> body.decode('utf-8')
UnicodeDecodeError: 'utf8' codec can't decode byte 0x90 in position 102: invalid start byte
Ok, there is indeed an encoding error.
>>> url.headers['Content-Type']
'text/html; charset=Shift_JIS'
This is because your HTTP response is not encoded in UTF-8, but in Shift-JIS.
You should probably not use urllib2 but a higher level library that takes care of the HTTP encoding. Or, if you want to do it yourself, see https://stackoverflow.com/a/20714761.
Use requests and BeautifulSoup:
import requests
r = requests.get("https://www.google.co.jp/")
soup = BeautifulSoup(r.content)
print soup.find_all("p")
[<p style="color:#767676;font-size:8pt">© 2013 - プライバシーと利用規約</p>]
I am posting to github's api for markdown, and in the post request I am sending json data. I discovered that I can't write lists because the characters are not a part of ascii and looked it up to find that I should always encode. I encoded the text which needed to be marked down and the api is working, but I still get the same error when I try to make lists.
The code for the POST method is:
def markDown(to_mark):
headers = {
'content-type': 'application/json'
}
text = to_mark.decode('utf8')
payload = {
'text': text,
'mode':'gfm'
}
data = json.dumps(payload)
req = urllib2.Request('https://api.github.com/markdown', data, headers)
response = urllib2.urlopen(req)
marked_down = response.read()
return marked_down
And the error that I get when I try making lists is as follows:
'ascii' codec can't decode byte 0xe2 in position 55: ordinal not in range(128)
Add the full traceback:
Traceback (most recent call last):
File "/home/bigb/Programming/google_appengine/google/appengine/runtime/wsgi.py", line 266, in Handle
result = handler(dict(self._environ), self._StartResponse)
File "/home/bigb/Programming/google_appengine/lib/webapp2-2.3/webapp2.py", line 1519, in __call__
response = self._internal_error(e)
File "/home/bigb/Programming/google_appengine/lib/webapp2-2.3/webapp2.py", line 1511, in __call__
rv = self.handle_exception(request, response, e)
File "/home/bigb/Programming/google_appengine/lib/webapp2-2.3/webapp2.py", line 1505, in __call__
rv = self.router.dispatch(request, response)
File "/home/bigb/Programming/google_appengine/lib/webapp2-2.3/webapp2.py", line 1253, in default_dispatcher
return route.handler_adapter(request, response)
File "/home/bigb/Programming/google_appengine/lib/webapp2-2.3/webapp2.py", line 1077, in __call__
return handler.dispatch()
File "/home/bigb/Programming/google_appengine/lib/webapp2-2.3/webapp2.py", line 547, in dispatch
return self.handle_exception(e, self.app.debug)
File "/home/bigb/Programming/google_appengine/lib/webapp2-2.3/webapp2.py", line 545, in dispatch
return method(*args, **kwargs)
File "/home/bigb/Programming/Blog/my-ramblings/blog.py", line 232, in post
mark_blog = markDown(blog)
File "/home/bigb/Programming/Blog/my-ramblings/blog.py", line 43, in markDown
text = to_mark.decode('utf8')
File "/usr/lib/python2.7/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 45-46: ordinal not in range(128)
Am I understanding something wrong here ? Thanks!
Your to_mark value is not a Unicode value; you already have encoded byte string there. Trying to encode a byte string tells Python that it should first decode the value to Unicode before encoding again. This causes your exception:
>>> '\xc3\xa5'.encode('utf8')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)
For the json.dumps() function, you want to use Unicode values. If to_mark contains UTF-8 data, use str.decode():
text = to_mark.decode('utf8')
Your code snippets reads:
text = to_mark.encode('utf-8')
but in the traceback you have:
File "/home/bigb/Programming/Blog/my-ramblings/blog.py", line 43, in markDown
text = to_mark.decode('utf8')
Please first make sure you post the real code and traceback (that is: you post the code that actually raise the exception).
I can not remember accurately, but probably using decode/encode at response.read() worked for me when I have faced the exact same error.
response.read().decode("utf8")
I am using Python & lxml and am stuck with an error
My code
>>>import urllib
>>>from lxml import html
>>>response = urllib.urlopen('http://www.edmunds.com/dealerships/Texas/Grapevine/GrapevineFordLincoln_1/fullservice-505318162.html').read()
>>>dom = html.fromstring(response)
>>>dom.xpath("//div[#class='description item vcard']")[0].xpath(".//p[#class='service-review-paragraph loose-spacing']")[0].text_content()
Traceback
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python2.7/dist-packages/lxml/html/__init__.py", line 249, in text_content
return _collect_string_content(self)
File "xpath.pxi", line 466, in lxml.etree.XPath.__call__ (src/lxml/lxml.etree.c:119105)
File "xpath.pxi", line 242, in lxml.etree._XPathEvaluatorBase._handle_result (src/lxml/lxml.etree.c:116936)
File "extensions.pxi", line 552, in lxml.etree._unwrapXPathObject (src/lxml/lxml.etree.c:112473)
File "apihelpers.pxi", line 1344, in lxml.etree.funicode (src/lxml/lxml.etree.c:21864)
UnicodeDecodeError: 'utf8' codec can't decode byte 0x93 in position 477: invalid start byte
The problem is the special character which is present in the div I am fetching. How can I encode/decode the text without losing any data?
The parser assumes this is a utf-8 file, but it's not. the simplest thing to do would be to convert it to unicode first, by knowing the encoding of the page
>>> url = urllib.urlopen('http://www.edmunds.com/dealerships/Texas/Grapevine/GrapevineFordLincoln_1/fullservice-505318162.html')
>>> url.headers.get('content-type')
'text/html; charset=ISO-8859-1'
>>> response = url.read()
#let's convert to unicode first
>>> response_unicode = codecs.decode(response, 'ISO-8859-1')
>>> dom = html.fromstring(response_unicode)
#and now...
>>> dom.xpath("//div[#class='description item vcard']")[0].xpath(".//p[#class='service-review-paragraph loose-spacing']")[0].text_content()
u'\n On December 5th, my vehicle completely shut down.\nI had it towed to Grapevine Ford where they told me that the intak.....
tada!
So it looks like the page is corrupted. It has UTF-8 encoding specified, but is not valid in that encoding.
urlopen(...).read() returns us a byte string (str). When you feed it to lxml, it tries to decode it with UTF-8 and fails.
This might not be the best way, but we can specify a different encoding manually, such as Latin-1:
response = urllib.urlopen(...).read().decode('latin-1')
Now response is a text string (unicode), and that's what LXML wants to work with.
First time doing Python in a while, and I'm having trouble doing a simple scan of a file when I run the following script with Python 3.0.1,
with open("/usr/share/dict/words", 'r') as f:
for line in f:
pass
I get this exception:
Traceback (most recent call last):
File "/home/matt/install/test.py", line 2, in <module>
for line in f:
File "/home/matt/install/root/lib/python3.0/io.py", line 1744, in __next__
line = self.readline()
File "/home/matt/install/root/lib/python3.0/io.py", line 1817, in readline
while self._read_chunk():
File "/home/matt/install/root/lib/python3.0/io.py", line 1565, in _read_chunk
self._set_decoded_chars(self._decoder.decode(input_chunk, eof))
File "/home/matt/install/root/lib/python3.0/io.py", line 1299, in decode
output = self.decoder.decode(input, final=final)
File "/home/matt/install/root/lib/python3.0/codecs.py", line 300, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 1689-1692: invalid data
The line in the file it blows up on is "Argentinian", which doesn't seem to be unusual in any way.
Update: I added,
encoding="iso-8559-1"
to the open() call, and it fixed the problem.
How have you determined from "position 1689-1692" what line in the file it has blown up on? Those numbers would be offsets in the chunk that it's trying to decode. You would have had to determine what chunk it was -- how?
Try this at the interactive prompt:
buf = open('the_file', 'rb').read()
len(buf)
ubuf = buf.decode('utf8')
# splat ... but it will give you the byte offset into the file
buf[offset-50:60] # should show you where/what the problem is
# By the way, from the error message, looks like a bad
# FOUR-byte UTF-8 character ... interesting
Can you check to make sure it is valid UTF-8? A way to do that is given at this SO question:
iconv -f UTF-8 /usr/share/dict/words -o /dev/null
There are other ways to do the same thing.