Turkish character problem in post data - python

I have two applications running on diffrent servers with diffrent DB's. I need to post some data from one to another, so ,i use post method. I concatenate related info into a string, then POST it...
My data is something like:
26AU223/AHMET DEMİROĞLU/18439586958/0//2000-07-31/2000-06-11/42.00/0
For turkish characters, i try to use
var1 = '26AU223/AHMET DEMİROĞLU/18439586958/0//2000-07-31/2000-06-11/42.00/0'
var1.encode('iso8859_9')
but when i receive this data on the second application and decode it, i realize that Turkish characters can not be decoded correctly, so my result is :
26AU223/AHMET DEM�O�U/18439586958/0//2011-07-31/2008-06-11/42.00/0
So İ and Ğ causes problem, and also following first letters R and L are mis-decoded too.
I tried diffrent encoding parameters for turish, also tries to POST daha without encode/decode (both applications use UTF-8) but i get a similar encoding error, with a strange � instead of İR and ĞL .

With Python 2.x, this is obviously wrong:
var1 = '26AU223/AHMET DEMİROĞLU/18439586958/0//2000-07-31/2000-06-11/42.00/0'
var1.encode('iso8859_9')
Python 2 has a bad design flaw in that it allows you to .encode() byte strings (str type). You must have a Unicode string, and then encode that before POSTing it. And using encodings other than UTF-8 is not reasonable.
var1 = u'26AU223/AHMET DEMİROĞLU/18439586958/0//2000-07-31/2000-06-11/42.00/0'
buf = var1.encode('utf-8')
# ...send buf over the network...
assert buf.decode('utf-8') == var1
And if you're constructing the POST data yourself, don't forget to do URL escaping.

I solve the problem with the easiest possible way (:
before quote my text, i cast it to string :
quote(str(var1))
And on the other side, unquote it in a similar way:
unquote(str(var1))
That solve the problem

Are you getting a Unicode string object on the remote side? In that case, your problem is that the code responsible for reading the HTTP message body assumes a wrong character set. Set the HTTP request Content-Type header to 'text/plain;charset=ISO-8859-9'.

Related

python request library giving wrong value single quotes

Facing some issue in calling API using request library. Problem is described as follows
The code:.
r = requests.post(url, data=json.dumps(json_data), headers=headers)
When I perform r.text the apostrophe in the string is giving me as
like this Bachelor\u2019s Degree. This should actually give me the response as Bachelor's Degree.
I tried json.loads also but the single quote problem remains the same,
How to get the string value correctly.
What you see here ("Bachelor\u2019s Degree") is the string's inner representation, where "\u2019" is the unicode codepoint for "RIGHT SINGLE QUOTATION MARK". This is perfectly correct, there's nothing wrong here, if you print() this string you'll get what you expect:
>>> s = 'Bachelor\u2019s Degree'
>>> print(s)
Bachelor’s Degree
Learning about unicode and encodings might save you quite some time FWIW.
EDIT:
When I save in db and then on displaying on HTML it will cause issue
right?
Have you tried ?
Your database connector is supposed to encode it to the proper encoding (according to your fields, tables and client encoding settings).
wrt/ "displaying it on HTML", it mostly depends on whether you're using Python 2.7.x or Python 3.x AND on how you build your HTML, but if you're using some decent framework with a decent template engine (if not you should reconsider your stack) chances are it will work out of the box.
As I already mentionned, learning about unicode and encodings will save you a lot of time.
It's just using a UTF-8 encoding, it is not "wrong".
string = 'Bachelor\u2019s Degree'
print(string)
Bachelor’s Degree
You can decode and encode it again, but I can't see any reason why you would want to do that (this might not work in Python 2):
string = 'Bachelor\u2019s Degree'.encode().decode('utf-8')
print(string)
Bachelor’s Degree
From requests docs:
When you make a request, Requests makes educated guesses about the
encoding of the response based on the HTTP headers. The text encoding
guessed by Requests is used when you access r.text
On the response object, you may use .content instead of .text to get the response in UTF-8

Handling foreign characters in HTTP POST with gae python

It's puzzling me that I've got 2 functions with HTTP POST where one breaks foreign characters and I just do self.request.POST.get('text') to get the value in both functions. The difference I see is that where it breaks it inherits blobstoreuploadhandler so therefore I suspect that it might have to do with that change. I don't understand for example why ÅÄÖ first works and then I make a seemingly unrelated change and suddently any non-ASCII character get mangled.
Please help me understand how python should work with unicode and utf-8.
I have the complete 2 code examples where one works and the other distorts foreign characters like ÅÄÖ and I just need to know what to change and I think it should be possible to adjust so that it behaves as expected.
To understand exactly what the problem is maybe it helps to know that if I input ÅÄÖ the output becomes xcTW when it should be ÅÄÖ.
The 2 pieces of code mentioned are
class AList(RequestHandler, I18NHandler):
...
a.text = self.request.POST.get('text')
The above works. Then I changed to
class AList(RequestHandler, I18NHandler, blobstore_handlers.BlobstoreUploadHandler):
...
a.text = self.request.POST.get('text')
And this seems to be the only difference. The 2 ideas I have is deploying 2 examples with the same app and see what is really causing this issue since it may or may not be in the code I paste here.
And this is also just a production issue when locally foreign characters work as expected.
It seems it is related to the usage of blobstoreuploadhandler since the following reproduces the garbled characters by email:
class ContactUploadHandler(blobstore_handlers.BlobstoreUploadHandler):
def post(self):
message = mail.EmailMessage(sender='admin#myapplicationatappspot.com', subject=self.request.POST.get('subject'))
message.body = ('%s \nhttp://www.myapplicationatappspot.com/') % ( self.request.POST.get('text') )
message.to='info#myapplicationatappspot.com'
message.send()
self.redirect('/service.html')
It looks like you've hit this bug: http://code.google.com/p/googleappengine/issues/detail?id=2749
As a workaround until it gets fixed, you can encode all your input in base64 using JavaScript. It's not ideal but it did the trick for me.
xcTW is the result of base-64 encoding the cp1252 or latin1 encoding of those 3 characters; see the following IDLE session:
>>> import base64; print repr(base64.b64decode('xcTW'))
'\xc5\xc4\xd6'
>>> print repr('ÅÄÖ')
'\xc5\xc4\xd6'
>>>
BUT base-64 encoding mangles ASCII characters as well:
>>> base64.b64encode('abcdef')
'YWJjZGVm'
>>>
Looks like you need to look into the transfer encoding.
If you can't work out from this what is happening, try publishing your two pieces of code.
Update More of the train of thought: a "blob" is a Binary Large OBject, hence the base64 encoding to ensure that it can be transported across a network that might not be 8-bit clean. I'm not sure why you are using blobs if you are expecting text. If you really must stick that 3rd arg in there, then just use base64.b64decode() on the bytes that are returned. If all else fails, read the gae docs to see if there's a way of turning off the base 64 encoding.
Even more ToT: perhaps the blobhandler transmits in ASCII if it fits otherwise base64-encodes it -- this would fit with the reported behaviour. In that case you have to detect what the encoding is. I say again: read the gae docs.

how to url-safe encode a string with python? and urllib.quote is wrong

Hello i was wondering if you know any other way to encode a string to a url-safe, because urllib.quote is doing it wrong, the output is different than expected:
If i try
urllib.quote('á')
i get
'%C3%A1'
But thats not the correct output, it should be
%E1
As demostrated by the tool provided here this site
And this is not me being difficult, the incorrect output of quote is preventing the browser to found resources, if i try
urllib.quote('\images\á\some file.jpg')
And then i try with the javascript tool i mentioned i get this strings respectively
%5Cimages%5C%C3%A1%5Csome%20file.jpg
%5Cimages%5C%E1%5Csome%20file.jpg
Note how is almost the same but the url provided by quote doesn't work and the other one it does.
I tried messing with encode('utf-8) on the string provided to quote but it does not make a difference.
I tried with other spanish words with accents and the ñ they all are differently represented.
Is this a python bug?
Do you know some module that get this right?
According to RFC 3986, %C3%A1 is correct. Characters are supposed to be converted to an octet stream using UTF-8 before the octet stream is percent-encoded. The site you link is out of date.
See Why does the encoding's of a URL and the query string part differ? for more detail on the history of handling non-ASCII characters in URLs.
Ok, got it, i have to encode to iso-8859-1 like this
word = u'á'
word = word.encode('iso-8859-1')
print word
Python is interpreted in ASCII by default, so even though your file may be encoded differently, your UTF-8 char is interpereted as two ASCII chars.
Try putting a comment as the first of second line of your code like this to match the file encoding, and you might need to use u'á' also.
# coding: utf-8
What about using unicode strings and the numeric representation (ord) of the char?
>>> print '%{0:X}'.format(ord(u'á'))
%E1
In this question it seems some guy wrote a pretty large function to convert to ascii urls, thats what i need. But i was hoping there was some encoding tool in the std lib for the job.

How to fix unicode issue when using a web service with Python Suds

I am trying to work with the HORRIBLE web services at Commission Junction (CJ). I can get the client to connect and receive information from CJ, but their database seems to include a bunch of bad characters that cause a UnicideDecodeError.
Right now I am doing:
from suds.client import Client
wsdlLink = 'https://link-search.api.cj.com/wsdl/version2/linkSearchServiceV2.wsdl'
client = Client(wsdlLink)
result = client.service.searchLinks(developerKey='XXX', websiteId='XXX', promotionType='coupon')
This works fine until I hit a record that has something like 'CorpNet® 10% Off Any Service' then the ® causes it to break and I get
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 758: ordinal not in range(128)" error.
Is there a way to encode the ® on my end so that it does not break when SUDS reads in the result?
UPDATE:
To clarify, the ® is coming from the CJ database and is in their response. SO somehow I need to decode the non-ascii characters BEFORE SUDS deals with the response. I am not sure how (or if) this is done in SUDs.
Implicit UnicodeDecodeErrors is something you get when trying to add str and unicode objects. Python will then try to decode the str into unicode, but using the ASCII encoding. If your str then contains anything that is not ascii, you will get this error.
Your solution is the decode it manually like so:
thestring = thestring.decode('utf8')
Try, as much as possible, to decode any string that may contain non-ascii characters as soo as you are handed it from whatever module you get it from, in this case suds.
Then, if suds can't handle Unicode (which may be the case) make sure you encode it back just before handing the text back to suds (or any other library that breaks if you give it unicode).
That should solve things nicely. It may be a big change, as you need to move all your internal processing from str to unicode, but it's worth it. :)
The "registered" character is U+00AE and is encoded as "\xc2\xae" in UTF-8. It looks like you have a str object encoded in UTF-8 but some code is doing (probably by default) your_str_object.decode("ascii") which will fail with the error message you showed.
What you need to do is show us a complete example (i.e. ALL the code necessary to get the error), plus the full error message and traceback, so that at least we can guess whether the problem is in your code or in imported code.
I am using SUDS to interface with Salesforce via their SOAP API. I ran into the same situation until I followed #J.F.Sabastian's advice by not mixing str and unicode string types. For example, passing a SOQL string like this does work with SUDS 0.3.9:
qstr = u"select Id, FirstName, LastName from Contact where FirstName='%s' and LastName='%s'" % (u'Jorge', u'López')
I did not seem to need to do str.decode("utf-8") either.
If you're running your script from PyDev on Eclipse, you might want to go into Project => Properties and under Resource, set "Text File Encoding" to UTF-8, on my Mac, this defaults to "MacRoman". I suppose on Windoze, the default is either Cp1252 or ISO-8859-1 (Latin). You could also set this in your Workspace of your Projects inherit this setting from their workspace. This only effects the program source code.

Decoding unicode from Javascript in Python & Django

On a website I have the word pluș sent via POST to a Django view.
It is sent as plu%25C8%2599. So I took that string and tried to figure out a way how to make %25C8%2599 back into ș.
I tried decoding the string like this:
from urllib import unquote_plus
s = "plu%25C8%2599"
print unquote_plus(unquote_plus(s).decode('utf-8'))
The result i get is pluÈ which actually has a length of 5, not 4.
How can I get the original string pluș after it's encoded ?
edit:
I managed to do it like this
def js_unquote(quoted):
quoted = quoted.encode('utf-8')
quoted = unquote_plus(unquote_plus(quoted)).decode('utf-8')
return quoted
It looks weird but works the way I needed it.
URL-decode twice, then decode as UTF-8.
You can't unless you know what the encoding is. Unicode itself is not an encoding. You might try BeautifulSoup or UnicodeDammit, which might help you get the result you were hoping for.
http://www.crummy.com/software/BeautifulSoup/
I hope this helps!
Also take a look at:
http://www.joelonsoftware.com/articles/Unicode.html
unquote_plus(s).encode('your_lang_encoding')
I was try like that. I was tried to sent a json POST request by HTML form to directly a django URI, which is included unicode characters like "şğüöçı+" and it works. I have used iso_8859-9 encoder in encode() function.

Categories