UnicodeDecodeError when passing GET data in Python/AppEngine - python

This feels like a really basic question, but I haven't been able to find an answer.
I would like to read data from an url, for example GET data from a querystring. I am using the webapp framework in Python. I tried the following code, but since I've a total beginner at Python/appengine, I've certainly done something wrong.
class MainPage(webapp.RequestHandler):
def get(self):
self.response.out.write(self.request.get('data'))
application = webapp.WSGIApplication([('/', MainPage),('/search', Search),('/next', Next)],debug=False)
def main():
run_wsgi_app(application)
if __name__ == "__main__":
main()
When testing in my test environment, the URL http://localhost/?data=test just returns this error message below. Without the querystring, it just displays a blank page as expected.
UnicodeDecodeError: 'ascii' codec can't decode byte 0xd6 in position 40: ordinal not in range(128)
What am I doing wrong and what should I do instead?

You try to e.g. print an ASCII coded string actually containing data of a different charset. This can happen e.g. with Latin-1 encoded data. Try converting your input to unicode using
unicoded = unicode(non_unicode_string, source_encoding)
where source_encoding is something like 'cp1252', 'iso-8859-1' etc., and sending this to output.
Have a look at this HOWTO. For a list of encodings supported by Python, see this

Check out this blog post on how to do unicode right in Python. In a nutshell, you're trying to decode a byte string (implicitly) as ASCII, and it contains a byte that isn't valid in that codec. Your string is probably in UTF-8.

Related

Converting widechars to system ANSI encoding in Python

I am currently trying to make my screen reader work better with Becky! Internet Mail. The problem which I am facing is related to the list view in there. This control is not Unicode aware but the items are custom drawn on screen so when someone looks at it content of all fields regardless of encoding looks okay. When accessed via MSAA or UIA however basic ANSI chars and mails encoded with the code page set for non Unicode programs have they text correct whereas mails encoded in Unicode do not.
Samples of the text :
Zażółć gęślą jaźń
is represented by:
Zażółć gęślą jaźń
In this case it is damaged CP1250 as per answer below.
However:
⚠️
is represented by:
⚠️
⏰
is represented by:
⏰
and
高生旺
is represented by:
é«ç”źć—ş
I've just assumed that these strings are damaged beyond repair, however when unicode beta support in windows 10 is enabled they are exposed correctly.
Is it possible to simulate this behavior in Python?
The solution needs to work in both Python 2 and 3.
At the moment I am simply replacing known combinations of these characters with their proper representations, but it is not very good solution, because lists containing replacements and characters to replace needs to be updated with each new discovered character.
your utf-8 is decoded as cp1250.
What I did in python3 is this:
orig = "Zażółć gęślą jaźń"
wrong = "Zażółć gęślą jaźń"
for enc in range(437, 1300):
try:
res = orig.encode().decode(f"cp{enc}")
if res == wrong:
print('FOUND', res, enc)
except:
pass
...and the result was the 1250 codepage.
So your solution shall be:
import sys
def restore(garbaged):
# python 3
if sys.version_info.major > 2:
return garbaged.encode('cp1250').decode()
# python 2
else:
# is it a string
try:
return garbaged.decode('utf-8').encode('cp1250')
# or is it unicode
except UnicodeEncodeError:
return garbaged.encode('cp1250')
EDIT:
The reason why "高生旺" can not be recovered from é«ç”źć—ş:
"高生旺".encode('utf-8') is b'\xe9\xab\x98\xe7\x94\x9f\xe6\x97\xba'.
The problem is the \x98 part. In cp1250 there is no character set for that value. If you try this:
"高生旺".encode('utf-8').decode('cp1250')
You will get this error: UnicodeDecodeError: 'charmap' codec can't decode byte 0x98 in position 2: character maps to <undefined>
The way to get "é«ç”źć—ş" is:
"高生旺".encode('utf-8').decode('cp1250', 'ignore')
But the ignore part is critical, it causes data loss:
'é«ç”źć—ş'.encode('cp1250') is b'\xe9\xab\xe7\x94\x9f\xe6\x97\xba'.
If you compare these two:
b'\xe9\xab\xe7\x94\x9f\xe6\x97\xba'
b'\xe9\xab\x98\xe7\x94\x9f\xe6\x97\xba'
you will see that the \x98 character is missing so when you try to restore the original content, you will get a UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 0-1: invalid continuation byte.
If you try this:
'é«ç”źć—ş'.encode('cp1250').decode('utf-8', 'backslashreplace')
The result will be '\\xe9\\xab生旺'. \xe9\xab\x98 could be decoded to 高, from \xe9\xab it is not possible.

Tornado Invalid x-www-form-urlencoded body: 'latin-1' codec can't encode characters in position 774-777: ordinal not in range(256)

I'm using tornado to accept some data sended from clients I don't have access to. Everything works fine if only English characters appear in the data. When utf-8 encoded Chinese characters(3 bytes) are within the data, Tornado gives me this warning and the 'get_argument' function can't get anything at all.
I debuged and simplified my code to the simplest, yet the warning still comes up
class DataHandler(tornado.web.RequestHandler):
def post(self):
print("test")
print(self.get_argument("data"))
print("1")
application = tornado.web.Application([
(r"/data", Data),
])
application.listen(5000)
tornado.ioloop.IOLoop.instance().start()
The data's format looks like this:
data={"id":"00f1c423","mac":"11:22:33:44:55:66"}
The data is x-www-form-urlencoded and WireShark shows the Chinese characters are perfectly 3-bytes utf-8 which starts with E(1110). The position mentioned in the warning(774-777) is where the Chinese characters begins and it's always 5 bytes, despite the changing of Chinese characters.
I'm confused about the 'encode' in the warning. I actually did nothing about encoding in my code, so I presume it's what Tornado does within the RequestHandler class. But since Tornado defaults to use utf-8 codec, where does this latin-1 come from? And most importantly, how can I fix it?
This won't be a problem anymore. Tornado did some changes to support x-www-form-urlencoded body with values consisting of encoded bytes which are not url-encoded into ascii.
See: tornado merge request
Also: github issue #2733

Python UnicodeEncodeError when Outputting Parsed Data from a Webpage

I have a program that parses webpages and then writes the data out somewhere else. When I am writing the data, I get
"UnicodeEncodeError: 'ascii' codec can't encode characters in position
19-21: ordinal not in range(128)"
I am gathering the data using lxml.
name = apiTree.xpath("//boardgames/boardgame/name[#primary='true']")[0].text
worksheet.goog["Name"].append(name)
Upon reading, http://effbot.org/pyfaq/what-does-unicodeerror-ascii-decoding-encoding-error-ordinal-not-in-range-128-mean.htm, it suggests I record all of my variables in unicode. This means I need to know what encoding the site is using.
My final line that actually writes the data out somewhere is:
wks.update_cell(row + 1, worksheet.goog[value + "_col"], (str(worksheet.goog[value][row])).encode('ascii', 'ignore'))
How would I incorporate using unicode assuming the encoding is UTF-8 on the way in and I want it to be ASCII on the way out?
You error is because of:
str(worksheet.goog[value][row])
Calling str you are trying to encode the ascii, what you should be doing is encoding to utf-8:
worksheet.goog[value][row].encode("utf-8")
As far as How would I incorporate using unicode assuming the encoding is UTF-8 on the way in and I want it to be ASCII on the way out? goes, you can't there is no ascii latin ă etc... unless you want to get the the closest ascii equivalent using something like Unidecode.
I think I may have figured my own problem out.
apiTree.xpath("//boardgames/boardgame/name[#primary='true']")[0].text
Actually defaults to unicode. So what I did was change this line to:
name = (apiTree.xpath("//boardgames/boardgame/name[#primary='true']")[0].text).encode('ascii', errors='ignore')
And I just output without changing anything:
wks.update_cell(row + 1, worksheet.goog[value + "_col"], worksheet.goog[value][row])
Due to the nature of the data, ASCII only is mostly fine. Although, I may be able to use UTF-8 and catch some extra characters...but this is not relevant to the question.
:)

Decoding error with my Python function

I am using the Robot framework to automate some HTTP POST related tests. I wrote a custom Python library that has a function to do a HTTP POST. It looks like this:
# This function will do a http post and return the json response
def Http_Post_using_python(json_dict,url):
post_data = json_dict.encode('utf-8')
headers = {}
headers['Content-Type'] = 'application/json'
h = httplib2.Http()
resp, content = h.request(url,'POST',post_data,headers)
return resp, content
This works fine as long as I am not using any Unicode characters. When I have Unicode characters in the json_dict variable (for example, 메시지), it fails with this error:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xeb in position 164: ordinal not in range(128)
I am running Python 2.7.3 on Windows 7. I saw several related questions, but I have not been able to resolve the issue. I am new to Python and programming, so any help is appreciated.
Thanks.
You're getting this error because json_dict is a str, not a unicode. Without knowing anything else about the application, a simple solution would be:
if isinstance(json_dict, unicode):
json_dict = json_dict.encode("utf-8")
post_data = json_dict
However, if you're using json.dumps(…) to create the json_dict, then you don't need to encode it – that will be done by json.dumps(…).
Use requests:
requests.post(url, data=data, headers=headers)
It will deal with the encodings for you.
You're getting an error because of Python 2's automatic encoding/decoding, which is basically a bug and was fixed in Python 3. In brief, Python 2's str objects are really "bytes", and the right way to handle string data is in a unicode object. Since unicodes were introduced later, Python 2 will automatically try to convert between them and strings when you get them confused. To do so it needs to know an encoding; since you don't specify one, it defaults to ascii which doesn't have the characters needed.
Why is Python automatically trying to decode for you? Because you're calling .encode() on a str object. It's already encoded, so Python first tries to decode it for you, and guesses the ascii encoding.
You should read The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!).
Try this:
#coding=utf-8
test = "메시지"
test.decode('utf8')
In the line #coding=utf-8 i just set the file encoding to UTF-8 (to be able to write "메시지").
You need to decode the string into utf-8. decode method documentation

UTF-8 Python issues with Google Datastore

I've been roaming around these forums asking questions about issues related to Python and UTF-8 encoding/decoding.
This time around I've stumbled upon something which initially seemed an easy problem.
In my previous question (http://stackoverflow.com/questions/7138797/problems-with-python-in-google-app-engine-utf-8-and-ascii) I asked how to ensure proper addition of UTF-8 strings to variables:
Messages.append(ChatMessage(chatter, msg))
The solution was something along those lines:
Messages.append(ChatMessage(chatter.encode( "utf-8" ), msg.encode( "utf-8" )))
Pretty simple.
However, now I am faced with the challenge to send the data to Google App Engine Datastore. The code from the book I was using (Code in the Cloud)looked as follows (I skipped the redundant parts):
#START: ChatMessage
class ChatMessage(db.Model):
user = db.StringProperty(required=True)
timestamp = db.DateTimeProperty(auto_now_add=True)
message = db.TextProperty(required=True)
def __str__(self):
return "%s (%s): %s" % (self.user, self.timestamp, self.message)
#END: ChatMessage
# START: PostHandler
class ChatRoomPoster(webapp.RequestHandler):
def post(self):
chatter = self.request.get("name")
msgtext = self.request.get("message")
msg = ChatMessage(user=chatter, message=msgtext)
msg.put() #<callout id="co.put"/>
self.redirect('/')
# END: PostHandler
I thought that swaping a part of the PostHandler with the following bit:
msg = ChatMessage(user=chatter.encode( "utf-8" ), message=msgtext.encode( "utf-8" ))
... would do the trick. Unfortunately, that did not happen. I still keep getting
File "/base/data/home/apps/s~markcc-chatroom-one-pl/1.353054484690143927/pchat.py", line 147, in post
msg = ChatMessage(user=chatter.encode( "utf-8" ), message=msgtext.encode( "utf-8" ))
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc4 in position 0: ordinal not in range(128)
Naturally, I declared (# -- coding: utf-8 --) statement and put:
self.response.headers['Content-Type'] = 'text/html; charset=UTF-8'
in the file. It does nothing to alleviate the issue.
As you can see I am not very well-versed in Python, and encoding/decoding problems are, for me, a bit of novelty. I would appreciate your assistance. If anyonone could explain to me where I went wrong in this case and what practices to use to avoid similar quandaries in the future? Thank you in advance.
encode turns unicode into bytes, and decode turns bytes into unicode. You have to be careful not to mix the two. Your error means either:
chatter or msgtext is already bytes, and you are trying to encode it. One of the worst 'features' of Python 2 is that it lets you do this - it tries to first decode the bytes using ascii (the most limited encoding), and then re-encode them with whatever you've asked for. This is fixed in Python 3, but you can't use that on App Engine.
App Engine expects to store unicode (it does). So you need to pass it a unicode string without encoding it. In fact, if your data is already in a bytestring, you would need to decode it before you can store it.
In short, the first thing to try is simply not calling .encode before you store the data.
(I may have pointed you to it before, but if not, please take the time to read this article about unicode)

Categories