It's puzzling me that I've got 2 functions with HTTP POST where one breaks foreign characters and I just do self.request.POST.get('text') to get the value in both functions. The difference I see is that where it breaks it inherits blobstoreuploadhandler so therefore I suspect that it might have to do with that change. I don't understand for example why ÅÄÖ first works and then I make a seemingly unrelated change and suddently any non-ASCII character get mangled.
Please help me understand how python should work with unicode and utf-8.
I have the complete 2 code examples where one works and the other distorts foreign characters like ÅÄÖ and I just need to know what to change and I think it should be possible to adjust so that it behaves as expected.
To understand exactly what the problem is maybe it helps to know that if I input ÅÄÖ the output becomes xcTW when it should be ÅÄÖ.
The 2 pieces of code mentioned are
class AList(RequestHandler, I18NHandler):
...
a.text = self.request.POST.get('text')
The above works. Then I changed to
class AList(RequestHandler, I18NHandler, blobstore_handlers.BlobstoreUploadHandler):
...
a.text = self.request.POST.get('text')
And this seems to be the only difference. The 2 ideas I have is deploying 2 examples with the same app and see what is really causing this issue since it may or may not be in the code I paste here.
And this is also just a production issue when locally foreign characters work as expected.
It seems it is related to the usage of blobstoreuploadhandler since the following reproduces the garbled characters by email:
class ContactUploadHandler(blobstore_handlers.BlobstoreUploadHandler):
def post(self):
message = mail.EmailMessage(sender='admin#myapplicationatappspot.com', subject=self.request.POST.get('subject'))
message.body = ('%s \nhttp://www.myapplicationatappspot.com/') % ( self.request.POST.get('text') )
message.to='info#myapplicationatappspot.com'
message.send()
self.redirect('/service.html')
It looks like you've hit this bug: http://code.google.com/p/googleappengine/issues/detail?id=2749
As a workaround until it gets fixed, you can encode all your input in base64 using JavaScript. It's not ideal but it did the trick for me.
xcTW is the result of base-64 encoding the cp1252 or latin1 encoding of those 3 characters; see the following IDLE session:
>>> import base64; print repr(base64.b64decode('xcTW'))
'\xc5\xc4\xd6'
>>> print repr('ÅÄÖ')
'\xc5\xc4\xd6'
>>>
BUT base-64 encoding mangles ASCII characters as well:
>>> base64.b64encode('abcdef')
'YWJjZGVm'
>>>
Looks like you need to look into the transfer encoding.
If you can't work out from this what is happening, try publishing your two pieces of code.
Update More of the train of thought: a "blob" is a Binary Large OBject, hence the base64 encoding to ensure that it can be transported across a network that might not be 8-bit clean. I'm not sure why you are using blobs if you are expecting text. If you really must stick that 3rd arg in there, then just use base64.b64decode() on the bytes that are returned. If all else fails, read the gae docs to see if there's a way of turning off the base 64 encoding.
Even more ToT: perhaps the blobhandler transmits in ASCII if it fits otherwise base64-encodes it -- this would fit with the reported behaviour. In that case you have to detect what the encoding is. I say again: read the gae docs.
Related
Facing some issue in calling API using request library. Problem is described as follows
The code:.
r = requests.post(url, data=json.dumps(json_data), headers=headers)
When I perform r.text the apostrophe in the string is giving me as
like this Bachelor\u2019s Degree. This should actually give me the response as Bachelor's Degree.
I tried json.loads also but the single quote problem remains the same,
How to get the string value correctly.
What you see here ("Bachelor\u2019s Degree") is the string's inner representation, where "\u2019" is the unicode codepoint for "RIGHT SINGLE QUOTATION MARK". This is perfectly correct, there's nothing wrong here, if you print() this string you'll get what you expect:
>>> s = 'Bachelor\u2019s Degree'
>>> print(s)
Bachelor’s Degree
Learning about unicode and encodings might save you quite some time FWIW.
EDIT:
When I save in db and then on displaying on HTML it will cause issue
right?
Have you tried ?
Your database connector is supposed to encode it to the proper encoding (according to your fields, tables and client encoding settings).
wrt/ "displaying it on HTML", it mostly depends on whether you're using Python 2.7.x or Python 3.x AND on how you build your HTML, but if you're using some decent framework with a decent template engine (if not you should reconsider your stack) chances are it will work out of the box.
As I already mentionned, learning about unicode and encodings will save you a lot of time.
It's just using a UTF-8 encoding, it is not "wrong".
string = 'Bachelor\u2019s Degree'
print(string)
Bachelor’s Degree
You can decode and encode it again, but I can't see any reason why you would want to do that (this might not work in Python 2):
string = 'Bachelor\u2019s Degree'.encode().decode('utf-8')
print(string)
Bachelor’s Degree
From requests docs:
When you make a request, Requests makes educated guesses about the
encoding of the response based on the HTTP headers. The text encoding
guessed by Requests is used when you access r.text
On the response object, you may use .content instead of .text to get the response in UTF-8
I apologize in advance as I am not sure how to ask this! Okay so I am attempting to use a twitter API within Python. Here is the snippet of code giving me issues:
trends = twitter.Api.GetTrendsCurrent(api)
print str(trends)
This returns:
UnicodeEncodeError: 'ascii' codec can't encode characters in position 1-5: ordinal not in range(128)
When I attempt to .encode, the interpreter tells me I cannot encode a Trend object. How do I get around this?
Simple answer:
Use repr, not str. It should always, always work (unless the API itself is broken and that is where the error is being thrown from).
Long answer:
By default, when you cast a Unicode string to a byte str (and vice versa) in Python 2, it will use the ascii encoding by default for the conversion process. This works most of the time, but not always. Thus, nasty edge cases like this are a pain. One of the big reasons for the break in backwards compatibility in Python 3 was to change this behavior.
Use latin1 for testing. It may not be the correct encoding, but it will always (always, always, always) work and give you a jumping off point for debugging this properly so you at least can print something.
trends = twitter.Api.GetTrendsCurrent(api)
print type(trends)
print unicode(trends)
print unicode(trends).encode('latin1')
Or, better yet, when encoding force it to ignore or replace errors:
trends = twitter.Api.GetTrendsCurrent(api)
print type(trends)
print unicode(trends)
print unicode(trends).encode('utf8', 'xmlcharrefreplace')
Chances are, since you are dealing with a web based API, you are dealing with UTF-8 data anyway; it is pretty much the default encoding across the board on the web.
And I'm just suffering from the question about python crawler.
First, the websites have two different hexadecimal of Chinese chracters. I can convert one of them (which is E4BDA0E5A5BD), the other one is C4E3BAC3 which I have no method to convert, or maybe I am missing some methods. The two hexadecimal values are '你好' in Chinese.
Second, I have found a website which can convert the hexadecimal, and to my surprise the answer is exactly what I cannot covert by myself.
The url is http://www.uol123.com/hantohex.html
Then I made a question: how to get the result which is in the text box (well I don't know what it is called exactly). I used firefox + httpfox to observe the post's data, and I find that the result which is converted by the website is in the Content, here is the pic:
And then I print the post, it has POST Data, and some headers, but no info about Content.
Third, then I google how to use ajax, and I really found a code about how to use ajax.
Here is the url http://outofmemory.cn/code-snippet/1885/python-moni-ajax-request-get-ajax-request-response
But when I run this, it has an error which says "ValueError: No JSON object could be decoded."
And pardon that I am a newbie, so I cannot post images!!!
I am looking forward to your help sincerely.
Any help will be appreciated.
you're talking about different encodings for these chinese characters. there are at least three different widely used encodings guobiao (for mainland China), big5 (on Taiwan) and unicode (everywhere else).
here's how to convert your kanji into the different encodings:
>>> a = u'你好' -- your original characters
>>> a
u'\u4f60\u597d' -- in unicode
>>> a.encode('utf-8')
'\xe4\xbd\xa0\xe5\xa5\xbd' -- in UTF-8
>>> a.encode('big5')
'\xa7A\xa6n' -- in Taiwanese Big5
>>> a.encode('gb2312-80')
'\xc4\xe3\xba\xc3' -- in Guobiao
>>>
You may check other available encodings here.
Ah, almost forgot. to convert from Unicode into the encoding you use encode() method. to convert back from the encoded contents of the web site you may use decode() method. just don't forget to specify the correct encoding.
I have two applications running on diffrent servers with diffrent DB's. I need to post some data from one to another, so ,i use post method. I concatenate related info into a string, then POST it...
My data is something like:
26AU223/AHMET DEMİROĞLU/18439586958/0//2000-07-31/2000-06-11/42.00/0
For turkish characters, i try to use
var1 = '26AU223/AHMET DEMİROĞLU/18439586958/0//2000-07-31/2000-06-11/42.00/0'
var1.encode('iso8859_9')
but when i receive this data on the second application and decode it, i realize that Turkish characters can not be decoded correctly, so my result is :
26AU223/AHMET DEM�O�U/18439586958/0//2011-07-31/2008-06-11/42.00/0
So İ and Ğ causes problem, and also following first letters R and L are mis-decoded too.
I tried diffrent encoding parameters for turish, also tries to POST daha without encode/decode (both applications use UTF-8) but i get a similar encoding error, with a strange � instead of İR and ĞL .
With Python 2.x, this is obviously wrong:
var1 = '26AU223/AHMET DEMİROĞLU/18439586958/0//2000-07-31/2000-06-11/42.00/0'
var1.encode('iso8859_9')
Python 2 has a bad design flaw in that it allows you to .encode() byte strings (str type). You must have a Unicode string, and then encode that before POSTing it. And using encodings other than UTF-8 is not reasonable.
var1 = u'26AU223/AHMET DEMİROĞLU/18439586958/0//2000-07-31/2000-06-11/42.00/0'
buf = var1.encode('utf-8')
# ...send buf over the network...
assert buf.decode('utf-8') == var1
And if you're constructing the POST data yourself, don't forget to do URL escaping.
I solve the problem with the easiest possible way (:
before quote my text, i cast it to string :
quote(str(var1))
And on the other side, unquote it in a similar way:
unquote(str(var1))
That solve the problem
Are you getting a Unicode string object on the remote side? In that case, your problem is that the code responsible for reading the HTTP message body assumes a wrong character set. Set the HTTP request Content-Type header to 'text/plain;charset=ISO-8859-9'.
I am trying to work with the HORRIBLE web services at Commission Junction (CJ). I can get the client to connect and receive information from CJ, but their database seems to include a bunch of bad characters that cause a UnicideDecodeError.
Right now I am doing:
from suds.client import Client
wsdlLink = 'https://link-search.api.cj.com/wsdl/version2/linkSearchServiceV2.wsdl'
client = Client(wsdlLink)
result = client.service.searchLinks(developerKey='XXX', websiteId='XXX', promotionType='coupon')
This works fine until I hit a record that has something like 'CorpNet® 10% Off Any Service' then the ® causes it to break and I get
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 758: ordinal not in range(128)" error.
Is there a way to encode the ® on my end so that it does not break when SUDS reads in the result?
UPDATE:
To clarify, the ® is coming from the CJ database and is in their response. SO somehow I need to decode the non-ascii characters BEFORE SUDS deals with the response. I am not sure how (or if) this is done in SUDs.
Implicit UnicodeDecodeErrors is something you get when trying to add str and unicode objects. Python will then try to decode the str into unicode, but using the ASCII encoding. If your str then contains anything that is not ascii, you will get this error.
Your solution is the decode it manually like so:
thestring = thestring.decode('utf8')
Try, as much as possible, to decode any string that may contain non-ascii characters as soo as you are handed it from whatever module you get it from, in this case suds.
Then, if suds can't handle Unicode (which may be the case) make sure you encode it back just before handing the text back to suds (or any other library that breaks if you give it unicode).
That should solve things nicely. It may be a big change, as you need to move all your internal processing from str to unicode, but it's worth it. :)
The "registered" character is U+00AE and is encoded as "\xc2\xae" in UTF-8. It looks like you have a str object encoded in UTF-8 but some code is doing (probably by default) your_str_object.decode("ascii") which will fail with the error message you showed.
What you need to do is show us a complete example (i.e. ALL the code necessary to get the error), plus the full error message and traceback, so that at least we can guess whether the problem is in your code or in imported code.
I am using SUDS to interface with Salesforce via their SOAP API. I ran into the same situation until I followed #J.F.Sabastian's advice by not mixing str and unicode string types. For example, passing a SOQL string like this does work with SUDS 0.3.9:
qstr = u"select Id, FirstName, LastName from Contact where FirstName='%s' and LastName='%s'" % (u'Jorge', u'López')
I did not seem to need to do str.decode("utf-8") either.
If you're running your script from PyDev on Eclipse, you might want to go into Project => Properties and under Resource, set "Text File Encoding" to UTF-8, on my Mac, this defaults to "MacRoman". I suppose on Windoze, the default is either Cp1252 or ISO-8859-1 (Latin). You could also set this in your Workspace of your Projects inherit this setting from their workspace. This only effects the program source code.