UTF-8 Python issues with Google Datastore - python

I've been roaming around these forums asking questions about issues related to Python and UTF-8 encoding/decoding.
This time around I've stumbled upon something which initially seemed an easy problem.
In my previous question (http://stackoverflow.com/questions/7138797/problems-with-python-in-google-app-engine-utf-8-and-ascii) I asked how to ensure proper addition of UTF-8 strings to variables:
Messages.append(ChatMessage(chatter, msg))
The solution was something along those lines:
Messages.append(ChatMessage(chatter.encode( "utf-8" ), msg.encode( "utf-8" )))
Pretty simple.
However, now I am faced with the challenge to send the data to Google App Engine Datastore. The code from the book I was using (Code in the Cloud)looked as follows (I skipped the redundant parts):
#START: ChatMessage
class ChatMessage(db.Model):
user = db.StringProperty(required=True)
timestamp = db.DateTimeProperty(auto_now_add=True)
message = db.TextProperty(required=True)
def __str__(self):
return "%s (%s): %s" % (self.user, self.timestamp, self.message)
#END: ChatMessage
# START: PostHandler
class ChatRoomPoster(webapp.RequestHandler):
def post(self):
chatter = self.request.get("name")
msgtext = self.request.get("message")
msg = ChatMessage(user=chatter, message=msgtext)
msg.put() #<callout id="co.put"/>
self.redirect('/')
# END: PostHandler
I thought that swaping a part of the PostHandler with the following bit:
msg = ChatMessage(user=chatter.encode( "utf-8" ), message=msgtext.encode( "utf-8" ))
... would do the trick. Unfortunately, that did not happen. I still keep getting
File "/base/data/home/apps/s~markcc-chatroom-one-pl/1.353054484690143927/pchat.py", line 147, in post
msg = ChatMessage(user=chatter.encode( "utf-8" ), message=msgtext.encode( "utf-8" ))
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc4 in position 0: ordinal not in range(128)
Naturally, I declared (# -- coding: utf-8 --) statement and put:
self.response.headers['Content-Type'] = 'text/html; charset=UTF-8'
in the file. It does nothing to alleviate the issue.
As you can see I am not very well-versed in Python, and encoding/decoding problems are, for me, a bit of novelty. I would appreciate your assistance. If anyonone could explain to me where I went wrong in this case and what practices to use to avoid similar quandaries in the future? Thank you in advance.

encode turns unicode into bytes, and decode turns bytes into unicode. You have to be careful not to mix the two. Your error means either:
chatter or msgtext is already bytes, and you are trying to encode it. One of the worst 'features' of Python 2 is that it lets you do this - it tries to first decode the bytes using ascii (the most limited encoding), and then re-encode them with whatever you've asked for. This is fixed in Python 3, but you can't use that on App Engine.
App Engine expects to store unicode (it does). So you need to pass it a unicode string without encoding it. In fact, if your data is already in a bytestring, you would need to decode it before you can store it.
In short, the first thing to try is simply not calling .encode before you store the data.
(I may have pointed you to it before, but if not, please take the time to read this article about unicode)

Related

Tornado Invalid x-www-form-urlencoded body: 'latin-1' codec can't encode characters in position 774-777: ordinal not in range(256)

I'm using tornado to accept some data sended from clients I don't have access to. Everything works fine if only English characters appear in the data. When utf-8 encoded Chinese characters(3 bytes) are within the data, Tornado gives me this warning and the 'get_argument' function can't get anything at all.
I debuged and simplified my code to the simplest, yet the warning still comes up
class DataHandler(tornado.web.RequestHandler):
def post(self):
print("test")
print(self.get_argument("data"))
print("1")
application = tornado.web.Application([
(r"/data", Data),
])
application.listen(5000)
tornado.ioloop.IOLoop.instance().start()
The data's format looks like this:
data={"id":"00f1c423","mac":"11:22:33:44:55:66"}
The data is x-www-form-urlencoded and WireShark shows the Chinese characters are perfectly 3-bytes utf-8 which starts with E(1110). The position mentioned in the warning(774-777) is where the Chinese characters begins and it's always 5 bytes, despite the changing of Chinese characters.
I'm confused about the 'encode' in the warning. I actually did nothing about encoding in my code, so I presume it's what Tornado does within the RequestHandler class. But since Tornado defaults to use utf-8 codec, where does this latin-1 come from? And most importantly, how can I fix it?
This won't be a problem anymore. Tornado did some changes to support x-www-form-urlencoded body with values consisting of encoded bytes which are not url-encoded into ascii.
See: tornado merge request
Also: github issue #2733

Is there a way to get around unicode issues when using win32api/com modules in python 3?

I've looked around and haven't found anything just yet. I'm going through emails in an inbox and checking for a specific word set. It works on most emails but some of them don't parse. I checked the broken emails using.
print (msg.Body.encode('utf8'))
and my problem messages all start with b'.
like this
b'\xe6\xa0\xbc\xe6\xb5\xb4\xe3\xb9\xac\xe6\xa0\xbc\xe6\x85\xa5\xe3\xb9\xa4\xe0\xa8\x8d\xe6\xb4\xbc\xe7\x91\xa5\xe2\x81\xa1\xe7\x91\x
I think this is forcing python to read the body as bytes but I'm not sure. Either way after the b, no matter what encoding I try I don't get anything but garbage text.
I've tried other encoding methods as well decoding before but I'm just getting a ton of attribute errrors.
import win32api
import win32com.client
import datetime
import os
import time
outlook = win32com.client.Dispatch("Outlook.Application").GetNamespace("MAPI")
dater = datetime.date.today() - datetime.timedelta(days = 1)
dater = str(dater.strftime("%m-%d-%Y"))
print (dater)
#for folders in outlook.folders:
# print(folders)
Receipt = outlook.folders[8]
print(Receipt)
Ritems = Receipt.folders["Inbox"]
Rmessage = Ritems.items
for msg in Rmessage:
if (msg.Class == 46 and msg.CreationTime.strftime("%m-%d-%Y") == dater):
print (msg.CreationTime)
print (msg.Subject)
print (msg.Body.encode('utf8'))
print ('..............................')
End result is to have the message printed out in the console, or at least give Python a way to read it so I can find the text I'm looking for in the body.
The byte literal posted in the question is valid UTF-8. First two characters are U+683C and U+6D74 from the CJK Unified Ideographs block, U+4E00 - U+9FFF.
Since you don't know the source encoding there is no way to be completely sure about it, but chances are that email body is just Han characters encoded in UTF-8 (Determine the encoding of text in Python). If you are not being able to see the UTF-8 characters correctly you should check your terminal or display character set.
That said, you should to get the fundamentals of character representation right. Randomly encoding or decoding is hardly going to solve anything. I would suggest you begin by reading Spolsky's introduction to Unicode and then move to Batchelder on Unicode in Python.
As martineau said the proper encoding I was searching for was utf16. The other messages were encoded using utf8. So a simple mail scrape turned out to be an excellent lesson in encoding as well message Classes (off topic). Thanks for the help.

Decoding error with my Python function

I am using the Robot framework to automate some HTTP POST related tests. I wrote a custom Python library that has a function to do a HTTP POST. It looks like this:
# This function will do a http post and return the json response
def Http_Post_using_python(json_dict,url):
post_data = json_dict.encode('utf-8')
headers = {}
headers['Content-Type'] = 'application/json'
h = httplib2.Http()
resp, content = h.request(url,'POST',post_data,headers)
return resp, content
This works fine as long as I am not using any Unicode characters. When I have Unicode characters in the json_dict variable (for example, 메시지), it fails with this error:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xeb in position 164: ordinal not in range(128)
I am running Python 2.7.3 on Windows 7. I saw several related questions, but I have not been able to resolve the issue. I am new to Python and programming, so any help is appreciated.
Thanks.
You're getting this error because json_dict is a str, not a unicode. Without knowing anything else about the application, a simple solution would be:
if isinstance(json_dict, unicode):
json_dict = json_dict.encode("utf-8")
post_data = json_dict
However, if you're using json.dumps(…) to create the json_dict, then you don't need to encode it – that will be done by json.dumps(…).
Use requests:
requests.post(url, data=data, headers=headers)
It will deal with the encodings for you.
You're getting an error because of Python 2's automatic encoding/decoding, which is basically a bug and was fixed in Python 3. In brief, Python 2's str objects are really "bytes", and the right way to handle string data is in a unicode object. Since unicodes were introduced later, Python 2 will automatically try to convert between them and strings when you get them confused. To do so it needs to know an encoding; since you don't specify one, it defaults to ascii which doesn't have the characters needed.
Why is Python automatically trying to decode for you? Because you're calling .encode() on a str object. It's already encoded, so Python first tries to decode it for you, and guesses the ascii encoding.
You should read The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!).
Try this:
#coding=utf-8
test = "메시지"
test.decode('utf8')
In the line #coding=utf-8 i just set the file encoding to UTF-8 (to be able to write "메시지").
You need to decode the string into utf-8. decode method documentation

How to fix unicode issue when using a web service with Python Suds

I am trying to work with the HORRIBLE web services at Commission Junction (CJ). I can get the client to connect and receive information from CJ, but their database seems to include a bunch of bad characters that cause a UnicideDecodeError.
Right now I am doing:
from suds.client import Client
wsdlLink = 'https://link-search.api.cj.com/wsdl/version2/linkSearchServiceV2.wsdl'
client = Client(wsdlLink)
result = client.service.searchLinks(developerKey='XXX', websiteId='XXX', promotionType='coupon')
This works fine until I hit a record that has something like 'CorpNet® 10% Off Any Service' then the ® causes it to break and I get
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 758: ordinal not in range(128)" error.
Is there a way to encode the ® on my end so that it does not break when SUDS reads in the result?
UPDATE:
To clarify, the ® is coming from the CJ database and is in their response. SO somehow I need to decode the non-ascii characters BEFORE SUDS deals with the response. I am not sure how (or if) this is done in SUDs.
Implicit UnicodeDecodeErrors is something you get when trying to add str and unicode objects. Python will then try to decode the str into unicode, but using the ASCII encoding. If your str then contains anything that is not ascii, you will get this error.
Your solution is the decode it manually like so:
thestring = thestring.decode('utf8')
Try, as much as possible, to decode any string that may contain non-ascii characters as soo as you are handed it from whatever module you get it from, in this case suds.
Then, if suds can't handle Unicode (which may be the case) make sure you encode it back just before handing the text back to suds (or any other library that breaks if you give it unicode).
That should solve things nicely. It may be a big change, as you need to move all your internal processing from str to unicode, but it's worth it. :)
The "registered" character is U+00AE and is encoded as "\xc2\xae" in UTF-8. It looks like you have a str object encoded in UTF-8 but some code is doing (probably by default) your_str_object.decode("ascii") which will fail with the error message you showed.
What you need to do is show us a complete example (i.e. ALL the code necessary to get the error), plus the full error message and traceback, so that at least we can guess whether the problem is in your code or in imported code.
I am using SUDS to interface with Salesforce via their SOAP API. I ran into the same situation until I followed #J.F.Sabastian's advice by not mixing str and unicode string types. For example, passing a SOQL string like this does work with SUDS 0.3.9:
qstr = u"select Id, FirstName, LastName from Contact where FirstName='%s' and LastName='%s'" % (u'Jorge', u'López')
I did not seem to need to do str.decode("utf-8") either.
If you're running your script from PyDev on Eclipse, you might want to go into Project => Properties and under Resource, set "Text File Encoding" to UTF-8, on my Mac, this defaults to "MacRoman". I suppose on Windoze, the default is either Cp1252 or ISO-8859-1 (Latin). You could also set this in your Workspace of your Projects inherit this setting from their workspace. This only effects the program source code.

UnicodeDecodeError when passing GET data in Python/AppEngine

This feels like a really basic question, but I haven't been able to find an answer.
I would like to read data from an url, for example GET data from a querystring. I am using the webapp framework in Python. I tried the following code, but since I've a total beginner at Python/appengine, I've certainly done something wrong.
class MainPage(webapp.RequestHandler):
def get(self):
self.response.out.write(self.request.get('data'))
application = webapp.WSGIApplication([('/', MainPage),('/search', Search),('/next', Next)],debug=False)
def main():
run_wsgi_app(application)
if __name__ == "__main__":
main()
When testing in my test environment, the URL http://localhost/?data=test just returns this error message below. Without the querystring, it just displays a blank page as expected.
UnicodeDecodeError: 'ascii' codec can't decode byte 0xd6 in position 40: ordinal not in range(128)
What am I doing wrong and what should I do instead?
You try to e.g. print an ASCII coded string actually containing data of a different charset. This can happen e.g. with Latin-1 encoded data. Try converting your input to unicode using
unicoded = unicode(non_unicode_string, source_encoding)
where source_encoding is something like 'cp1252', 'iso-8859-1' etc., and sending this to output.
Have a look at this HOWTO. For a list of encodings supported by Python, see this
Check out this blog post on how to do unicode right in Python. In a nutshell, you're trying to decode a byte string (implicitly) as ASCII, and it contains a byte that isn't valid in that codec. Your string is probably in UTF-8.

Categories