Decoding error with my Python function

Decoding error with my Python function - python

I am using the Robot framework to automate some HTTP POST related tests. I wrote a custom Python library that has a function to do a HTTP POST. It looks like this:
# This function will do a http post and return the json response
def Http_Post_using_python(json_dict,url):
post_data = json_dict.encode('utf-8')
headers = {}
headers['Content-Type'] = 'application/json'
h = httplib2.Http()
resp, content = h.request(url,'POST',post_data,headers)
return resp, content
This works fine as long as I am not using any Unicode characters. When I have Unicode characters in the json_dict variable (for example, 메시지), it fails with this error:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xeb in position 164: ordinal not in range(128)
I am running Python 2.7.3 on Windows 7. I saw several related questions, but I have not been able to resolve the issue. I am new to Python and programming, so any help is appreciated.
Thanks.

You're getting this error because json_dict is a str, not a unicode. Without knowing anything else about the application, a simple solution would be:
if isinstance(json_dict, unicode):
json_dict = json_dict.encode("utf-8")
post_data = json_dict
However, if you're using json.dumps(…) to create the json_dict, then you don't need to encode it – that will be done by json.dumps(…).

Use requests:
requests.post(url, data=data, headers=headers)
It will deal with the encodings for you.
You're getting an error because of Python 2's automatic encoding/decoding, which is basically a bug and was fixed in Python 3. In brief, Python 2's str objects are really "bytes", and the right way to handle string data is in a unicode object. Since unicodes were introduced later, Python 2 will automatically try to convert between them and strings when you get them confused. To do so it needs to know an encoding; since you don't specify one, it defaults to ascii which doesn't have the characters needed.
Why is Python automatically trying to decode for you? Because you're calling .encode() on a str object. It's already encoded, so Python first tries to decode it for you, and guesses the ascii encoding.
You should read The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!).

Try this:
#coding=utf-8
test = "메시지"
test.decode('utf8')
In the line #coding=utf-8 i just set the file encoding to UTF-8 (to be able to write "메시지").
You need to decode the string into utf-8. decode method documentation

Related

Python 2.7, Requests library, can't get unicode

Documentation for Request library says that requests.get() method returns unicode always. But when I try to know what an encoding was returned, I see "windows-1251". That's a problem. When I try to get requests.get(url).text, there's an error, because current url's content has a Cyrillic symbols.
import requests
url = 'https://www.weblancer.net/jobs/'
r = requests.get(url)
print r.encoding
print r.text
I got something like that:
windows-1251
UnicodeEncodeError: 'ascii' codec can't encode characters in position 256-263: ordinal not in range(128)
Is it a problem of Python 2.7 or there is not a problem at all ?
Help me

From the docs:
Requests will automatically decode content from the server. Most
unicode charsets are seamlessly decoded.
When you make a request, Requests makes educated guesses about the
encoding of the response based on the HTTP headers.
requests.get().encoding is telling you the encoding that was used to convert the bitstream from the server into the Unicode text that is in the response.
In your case it is correct: the headers in the response say that the character set is windows-1251
The error you are having is after that. The python you are using is trying to encode the Unicode into ascii to print it, and failing.
You can say print r.text.encode(r.encoding) ... which is the same result as Padraic's suggestion in comments - that is r.content.
Note:
requests.get().encoding is an lvar: you can set it to what you want, if it guessed wrongly.

Fetching URL and converting to UTF-8 Python

I would like to do my first project in python but I have problem with coding. When I fetch data it shows coded letters instead of my native letters, for example '\xc4\x87' instead of 'ć'. The code is below:
import urllib.request
import sys
page = urllib.request.urlopen("http://olx.pl/")
test = page.read()
print(test)
print(sys.stdin.encoding)
z = "ł"
print(z)
print(z.encode("utf-8"))
I know that code here is poor but I tried many options to change encoding. I wrote z = "ł" to check if it can print any 'special' letter and it shows. I tried to encode it and it works also as it should. Sys.stdin.encoding shows cp852.

The data you read from a urlopen() response is encoded data. You'd need to first decode that data using the right encoding.
You appear to have downloaded UTF-8 data; you'd have to decode that data first before you had text:
test = page.read().decode('utf8')
However, it is up to the server to tell you what data was received. Check for a characterset in the headers:
encoding = page.info().getparam('charset')
This can still be None; many data formats include the encoding as part of the format. XML for example is UTF-8 by default but the XML declaration at the start can contain information about what codec was used for that document. An XML parser would extract that information to ensure you get properly decoded Unicode text when parsing.
You may not be able to print that data; the 852 codepage can only handle 256 different codepoints, while the Unicode standard is far larger.

The urlopen is returning to you a bytes object. That means it's a raw, encoded stream of bytes. Python 3 prints that in a repr format, which uses escape codes for non-ASCII characters. To get the canonical unicode you would have to decode it. The right way to do that would be to inspect the header and look for the encoding declaration. But for this we can assume UTF-8 and you can simply decode it as such, not encode it.
import urllib.request
import sys
page = urllib.request.urlopen("http://olx.pl/")
test = page.read()
print(test.decode("utf-8")) # <- note change
Now, Python 3 defaults to UTF-8 source encoding. So you can embed non-ASCII like this if your editor supports unicode and saving as UTF-8.
z = "ł"
print(z)
Printing it will only work if your terminal supports UTF-8 encoding. On Linux and OSX they do, so this is not a problem there.

The others are correct, but I'd like to offer a simpler solution. Use requests. It's 3rd party, so you'll need to install it via pip:
pip install requests
But it's a lot simpler to use than the urllib libraries. For your particular case, it handles the decoding for you out of the box:
import requests
r = requests.get("http://olx.pl/")
print(r.encoding)
# UTF-8
print(type(r.text))
# <class 'str'>
print(r.text)
# The HTML
Breakdown:
get sends an HTTP GET request to the server and returns the respose.
We print the encoding requests thinks the text is in. It chooses this based on the response header Martijin mentions.
We show that r.text is already a decoded text type (unicode in Python 2 and str in Python 3)
Then we actually print the response.
Note that we don't have to print the encoding or type; I've just done so for diagnostic purposes to show what requests is doing. requests is designed to simplify a lot of other details of working with HTTP requests, and it does a good job of it.

Python string encoding issue

I am using the Amazon MWS API to get the sales report for my store and then save that report in a table in the database. Unfortunately I am getting an encoding error when I try to encode the information as Unicode. After looking through the report (exactly as amazon sent it) I saw this string which is the location of the buyer:
'S�o Paulo'
so I tried to encode it like so:
encodeme = 'S�o Paulo'
encodeme.encode('utf-8)
but got the following error
UnicodeDecodeError: 'ascii' codec can't decode byte 0xef in position 1: ordinal not in range(128)
The whole reason why I am trying to encode it is because as soon as Django sees the � character it throws a warning and cuts off the string, meaning that the location is saved as S instead of
São Paulo
Any help is appreciated.

It looks like you are having some kind of encoding problem.
First, you should be very certain what encoding Amazon is using in the report body they send you. Is it UTF-8? Is it ISO 8859-1? Something else?
Unfortunately the Amazon MWS Reports API documentation, especially their API Reference, is not very forthcoming about what encoding they use. They only encoding I see them mention is UTF-8, so that should be your first guess. The GetReport API documentation (p.36-37) describes the response element Report as being type xs:string, but I don't see where they define that data type. Maybe they mean XML Schema's string datatype.
So, I suggest you save the byte sequence you are receiving as your report body from Amazon in a file, with zero transformations. Be aware that your code which calls AWS might be modifying the report body string inadvertently. Examine the non-ASCII bytes in that file with a binary editor. Is the "São" of "São" stored as S\xC3\xA3o, indicating UTF-8 encoding? Or is it stored as S\xE3o, indicating ISO 8859-1 encoding?
I'm guessing that you receive your report as a flat file. The Amazon AWS documentation says that you can request reports be delivered to you as XML. This would have the advantage of giving you a reply with an explicit encoding declaration.
Once you know the encoding of the report body, you now need to handle it properly. You imply that you are using the Django framework and Python language code to receive the report from Amazon AWS.
One thing to get very clear (as Skirmantas also explains):
Unicode strings hold characters. Byte strings hold bytes (octets).
Encoding converts a Unicode string into a byte string.
Decoding converts a byte string into a Unicode string.
The string you get from Amazon AWS is a byte string. You need to decode it to get a Unicode string. But your code fragment, encodeme = 'São Paulo', gives you a byte string. encodeme.encode('utf-8) performs an encode() on the byte string, which isn't what you want. (The missing closing quote on 'utf-8 doesn't help.)
Try this example code:
>>> reportbody = 'S\xc3\xa3o Paulo' # UTF-8 encoded byte string
>>> reportbody.decode('utf-8') # returns a Unicode string, u'...'
u'S\xe3o Paulo'
You might find some background reading helpful. I agree with Hoxieboy that you should take the time to read Python's Unicode HOWTO. Also check out the top answers to What do I need to know about Unicode?.

I think you have to decode it using a correct encoding rather than encode it to utf-8. Try
s = s.decode('utf-8')
However you need to know which encoding to use. Input can come in other encodings that utf-8.
The error which you received UnicodeDecodeError means that your object is not unicode, it is a bytestring. When you do bytestring.encode, the string firstly is decoded into unicode object with default encoding (ascii) and only then it is encoded with utf-8.
I'll try to explain the difference of unicode string and utf-8 bytestring in python.
unicode is a python's datatype which represents a unicode string. You use unicode for most of string operations in your program. Python probably uses utf-8 in its internals though it could also be utf-16 and this doesn't matter for you.
bytestring is a binary safe string. It can be of any encoding. When you receive data, for example you open a file, you get a bytestring and in most cases you will want to decode it to unicode. When you write to file you have to encode unicode objects into bytestrings. Sometimes decoding/encoding is done for you by a framework or library. Not always however framework can do this because not always framework can known which encoding to use.
utf-8 is an encoding which can correctly represent any unicode string as a bytestring. However you can't decode any kind of bytestring with utf-8 into unicode. You need to know what encoding is used in the bytestring to decode it.

Official Python unicode documentation
You might try that webpage if you haven't already and see if you can get the answer you're looking for ;)

Script having trouble passing Unicode through a REST interface

I am having trouble getting my Python script to ass Unicode data over RESTful http call.
I have a script that reads data from web site X using a REST interface and then pushes it into web site Y using it's REST interface. Both system are open source and are run on our servers. Site X uses PHP, Apache and PostgreSQL. Site Y is Java, Tomcat and PostgreSQL. The script doing the processing is currently in Python.
In general, the script works very well. We do have a few international users, and when trying to process a user with unicode characters in their name things break down. The original version of the script read the JSON data into the Python. The data was converted automagically into Unicode. I am pretty sure everything was working fine up to this point. To output the data I used subprocess.Popen() to call curl. This works for regular ascii, but the unicode was getting mangled somewhere in transit. I didn't get an error anywhere, but when viewing the results on site B it is no longer correctly encoded.
I know that Unicode is supported for these fields because I can craft a request using Firefox that correctly adds the data to site B.
Next idea was to not use curl, but just do everything in Python. I experimented by passing a hand constructed Unicode string to Python's urllib to make the REST call, but I received an error from urllib.urlopen():
UnicodeEncodeError: 'ascii' codec can't encode characters in position 103-105: ordinal not in range(128)
Any ideas on how to make this work? I would rather not re-write too much, but if there is another scripting language that would be better suited I wouldn't mind hearing about that also.
Here is my Python test script:
import urllib
uni = u"abc_\u03a0\u03a3\u03a9"
post = u"xdat%3Auser.login=unitest&"
post += u"xdat%3Auser.primary_password=nauihe4r93nf83jshhd83&"
post += u"xdat%3Auser.firstname=" + uni + "&"
post += u"xdat%3Auser.lastname=" + uni ;
url = u"http://localhost:8081/xnat/app/action/XDATRegisterUser"
data = urllib.urlopen(url,post).read()

With regard to your test script, it is failing because you are passing unicode object to urllib.urlencode() (it is being called for you by urlopen()). It does not support unicode objects, so it implicitly encodes using the default charset, which is ascii. Obviously, it fails.
The simplest way to handle POSTing unicode objects is to be explicit; Gather your data and build a dict, encode unicode values with an appropriate charset, urlencode the dict (to get a POSTable ascii string), then initiate the request. Your example could be rewritten as:
import urllib
import urllib2
## Build our post data dict
data = {
'xdat:user.login' : u'unitest',
'xdat:primary_password' : u'nauihe4r93nf83jshhd83',
'xdat:firstname' : u"abc_\u03a0\u03a3\u03a9",
'xdat:lastname' : u"abc_\u03a0\u03a3\u03a9",
}
## Encode the unicode using an appropriate charset
data = dict([(key, value.encode('utf8')) for key, value in data.iteritems()])
## Urlencode it for POSTing
data = urllib.urlencode(data)
## Build a POST request, get the response
url = "http://localhost:8081/xnat/app/action/XDATRegisterUser"
request = urllib2.Request(url, data)
response = urllib2.urlopen(request)
EDIT: More generally, when you make an http request with python (say urllib2.urlopen),
the content of the response is not decoded to unicode for you. That means you need to be aware of the encoding used by the server that sent it. Look at the content-type header; Usually it includes a charset=xyz.
It is always prudent to decode your input as early as possible, and encode your output as late as possible.

How to fix unicode issue when using a web service with Python Suds

I am trying to work with the HORRIBLE web services at Commission Junction (CJ). I can get the client to connect and receive information from CJ, but their database seems to include a bunch of bad characters that cause a UnicideDecodeError.
Right now I am doing:
from suds.client import Client
wsdlLink = 'https://link-search.api.cj.com/wsdl/version2/linkSearchServiceV2.wsdl'
client = Client(wsdlLink)
result = client.service.searchLinks(developerKey='XXX', websiteId='XXX', promotionType='coupon')
This works fine until I hit a record that has something like 'CorpNet® 10% Off Any Service' then the ® causes it to break and I get
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 758: ordinal not in range(128)" error.
Is there a way to encode the ® on my end so that it does not break when SUDS reads in the result?
UPDATE:
To clarify, the ® is coming from the CJ database and is in their response. SO somehow I need to decode the non-ascii characters BEFORE SUDS deals with the response. I am not sure how (or if) this is done in SUDs.

Implicit UnicodeDecodeErrors is something you get when trying to add str and unicode objects. Python will then try to decode the str into unicode, but using the ASCII encoding. If your str then contains anything that is not ascii, you will get this error.
Your solution is the decode it manually like so:
thestring = thestring.decode('utf8')
Try, as much as possible, to decode any string that may contain non-ascii characters as soo as you are handed it from whatever module you get it from, in this case suds.
Then, if suds can't handle Unicode (which may be the case) make sure you encode it back just before handing the text back to suds (or any other library that breaks if you give it unicode).
That should solve things nicely. It may be a big change, as you need to move all your internal processing from str to unicode, but it's worth it. :)

The "registered" character is U+00AE and is encoded as "\xc2\xae" in UTF-8. It looks like you have a str object encoded in UTF-8 but some code is doing (probably by default) your_str_object.decode("ascii") which will fail with the error message you showed.
What you need to do is show us a complete example (i.e. ALL the code necessary to get the error), plus the full error message and traceback, so that at least we can guess whether the problem is in your code or in imported code.

I am using SUDS to interface with Salesforce via their SOAP API. I ran into the same situation until I followed #J.F.Sabastian's advice by not mixing str and unicode string types. For example, passing a SOQL string like this does work with SUDS 0.3.9:
qstr = u"select Id, FirstName, LastName from Contact where FirstName='%s' and LastName='%s'" % (u'Jorge', u'López')
I did not seem to need to do str.decode("utf-8") either.
If you're running your script from PyDev on Eclipse, you might want to go into Project => Properties and under Resource, set "Text File Encoding" to UTF-8, on my Mac, this defaults to "MacRoman". I suppose on Windoze, the default is either Cp1252 or ISO-8859-1 (Latin). You could also set this in your Workspace of your Projects inherit this setting from their workspace. This only effects the program source code.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.