Script having trouble passing Unicode through a REST interface - python

I am having trouble getting my Python script to ass Unicode data over RESTful http call.
I have a script that reads data from web site X using a REST interface and then pushes it into web site Y using it's REST interface. Both system are open source and are run on our servers. Site X uses PHP, Apache and PostgreSQL. Site Y is Java, Tomcat and PostgreSQL. The script doing the processing is currently in Python.
In general, the script works very well. We do have a few international users, and when trying to process a user with unicode characters in their name things break down. The original version of the script read the JSON data into the Python. The data was converted automagically into Unicode. I am pretty sure everything was working fine up to this point. To output the data I used subprocess.Popen() to call curl. This works for regular ascii, but the unicode was getting mangled somewhere in transit. I didn't get an error anywhere, but when viewing the results on site B it is no longer correctly encoded.
I know that Unicode is supported for these fields because I can craft a request using Firefox that correctly adds the data to site B.
Next idea was to not use curl, but just do everything in Python. I experimented by passing a hand constructed Unicode string to Python's urllib to make the REST call, but I received an error from urllib.urlopen():
UnicodeEncodeError: 'ascii' codec can't encode characters in position 103-105: ordinal not in range(128)
Any ideas on how to make this work? I would rather not re-write too much, but if there is another scripting language that would be better suited I wouldn't mind hearing about that also.
Here is my Python test script:
import urllib
uni = u"abc_\u03a0\u03a3\u03a9"
post = u"xdat%3Auser.login=unitest&"
post += u"xdat%3Auser.primary_password=nauihe4r93nf83jshhd83&"
post += u"xdat%3Auser.firstname=" + uni + "&"
post += u"xdat%3Auser.lastname=" + uni ;
url = u"http://localhost:8081/xnat/app/action/XDATRegisterUser"
data = urllib.urlopen(url,post).read()

With regard to your test script, it is failing because you are passing unicode object to urllib.urlencode() (it is being called for you by urlopen()). It does not support unicode objects, so it implicitly encodes using the default charset, which is ascii. Obviously, it fails.
The simplest way to handle POSTing unicode objects is to be explicit; Gather your data and build a dict, encode unicode values with an appropriate charset, urlencode the dict (to get a POSTable ascii string), then initiate the request. Your example could be rewritten as:
import urllib
import urllib2
## Build our post data dict
data = {
'xdat:user.login' : u'unitest',
'xdat:primary_password' : u'nauihe4r93nf83jshhd83',
'xdat:firstname' : u"abc_\u03a0\u03a3\u03a9",
'xdat:lastname' : u"abc_\u03a0\u03a3\u03a9",
}
## Encode the unicode using an appropriate charset
data = dict([(key, value.encode('utf8')) for key, value in data.iteritems()])
## Urlencode it for POSTing
data = urllib.urlencode(data)
## Build a POST request, get the response
url = "http://localhost:8081/xnat/app/action/XDATRegisterUser"
request = urllib2.Request(url, data)
response = urllib2.urlopen(request)
EDIT: More generally, when you make an http request with python (say urllib2.urlopen),
the content of the response is not decoded to unicode for you. That means you need to be aware of the encoding used by the server that sent it. Look at the content-type header; Usually it includes a charset=xyz.
It is always prudent to decode your input as early as possible, and encode your output as late as possible.

Related

python request library giving wrong value single quotes

Facing some issue in calling API using request library. Problem is described as follows
The code:.
r = requests.post(url, data=json.dumps(json_data), headers=headers)
When I perform r.text the apostrophe in the string is giving me as
like this Bachelor\u2019s Degree. This should actually give me the response as Bachelor's Degree.
I tried json.loads also but the single quote problem remains the same,
How to get the string value correctly.
What you see here ("Bachelor\u2019s Degree") is the string's inner representation, where "\u2019" is the unicode codepoint for "RIGHT SINGLE QUOTATION MARK". This is perfectly correct, there's nothing wrong here, if you print() this string you'll get what you expect:
>>> s = 'Bachelor\u2019s Degree'
>>> print(s)
Bachelor’s Degree
Learning about unicode and encodings might save you quite some time FWIW.
EDIT:
When I save in db and then on displaying on HTML it will cause issue
right?
Have you tried ?
Your database connector is supposed to encode it to the proper encoding (according to your fields, tables and client encoding settings).
wrt/ "displaying it on HTML", it mostly depends on whether you're using Python 2.7.x or Python 3.x AND on how you build your HTML, but if you're using some decent framework with a decent template engine (if not you should reconsider your stack) chances are it will work out of the box.
As I already mentionned, learning about unicode and encodings will save you a lot of time.
It's just using a UTF-8 encoding, it is not "wrong".
string = 'Bachelor\u2019s Degree'
print(string)
Bachelor’s Degree
You can decode and encode it again, but I can't see any reason why you would want to do that (this might not work in Python 2):
string = 'Bachelor\u2019s Degree'.encode().decode('utf-8')
print(string)
Bachelor’s Degree
From requests docs:
When you make a request, Requests makes educated guesses about the
encoding of the response based on the HTTP headers. The text encoding
guessed by Requests is used when you access r.text
On the response object, you may use .content instead of .text to get the response in UTF-8

Fetching URL and converting to UTF-8 Python

I would like to do my first project in python but I have problem with coding. When I fetch data it shows coded letters instead of my native letters, for example '\xc4\x87' instead of 'ć'. The code is below:
import urllib.request
import sys
page = urllib.request.urlopen("http://olx.pl/")
test = page.read()
print(test)
print(sys.stdin.encoding)
z = "ł"
print(z)
print(z.encode("utf-8"))
I know that code here is poor but I tried many options to change encoding. I wrote z = "ł" to check if it can print any 'special' letter and it shows. I tried to encode it and it works also as it should. Sys.stdin.encoding shows cp852.
The data you read from a urlopen() response is encoded data. You'd need to first decode that data using the right encoding.
You appear to have downloaded UTF-8 data; you'd have to decode that data first before you had text:
test = page.read().decode('utf8')
However, it is up to the server to tell you what data was received. Check for a characterset in the headers:
encoding = page.info().getparam('charset')
This can still be None; many data formats include the encoding as part of the format. XML for example is UTF-8 by default but the XML declaration at the start can contain information about what codec was used for that document. An XML parser would extract that information to ensure you get properly decoded Unicode text when parsing.
You may not be able to print that data; the 852 codepage can only handle 256 different codepoints, while the Unicode standard is far larger.
The urlopen is returning to you a bytes object. That means it's a raw, encoded stream of bytes. Python 3 prints that in a repr format, which uses escape codes for non-ASCII characters. To get the canonical unicode you would have to decode it. The right way to do that would be to inspect the header and look for the encoding declaration. But for this we can assume UTF-8 and you can simply decode it as such, not encode it.
import urllib.request
import sys
page = urllib.request.urlopen("http://olx.pl/")
test = page.read()
print(test.decode("utf-8")) # <- note change
Now, Python 3 defaults to UTF-8 source encoding. So you can embed non-ASCII like this if your editor supports unicode and saving as UTF-8.
z = "ł"
print(z)
Printing it will only work if your terminal supports UTF-8 encoding. On Linux and OSX they do, so this is not a problem there.
The others are correct, but I'd like to offer a simpler solution. Use requests. It's 3rd party, so you'll need to install it via pip:
pip install requests
But it's a lot simpler to use than the urllib libraries. For your particular case, it handles the decoding for you out of the box:
import requests
r = requests.get("http://olx.pl/")
print(r.encoding)
# UTF-8
print(type(r.text))
# <class 'str'>
print(r.text)
# The HTML
Breakdown:
get sends an HTTP GET request to the server and returns the respose.
We print the encoding requests thinks the text is in. It chooses this based on the response header Martijin mentions.
We show that r.text is already a decoded text type (unicode in Python 2 and str in Python 3)
Then we actually print the response.
Note that we don't have to print the encoding or type; I've just done so for diagnostic purposes to show what requests is doing. requests is designed to simplify a lot of other details of working with HTTP requests, and it does a good job of it.

lxml.html parsing and utf-8 with requests

i used requests to retrieve a url which contains some unicode characters, and want to do some processing with it , then write it out.
r=requests.get(url)
f=open('unicode_test_1.html','w');f.write(r.content);f.close()
html = lxml.html.fromstring(r.content)
htmlOut = lxml.html.tostring(html)
f=open('unicode_test_2.html','w');f.write(htmlOut);f.close()
in unicode_test_1.html, all chars looks fine, but in unicode_test_2.html, some chars changed to gibberish, why is that ?
i then tried
html = lxml.html.fromstring(r.text)
htmlOut = lxml.html.tostring(html,encoding='latin1')
f=open('unicode_test_2.html','w');f.write(htmlOut);f.close()
it seems it's working now. but i don't know why is this happening, always use latin1 ?
what's the difference between r.text and r.content, and why can't i write html out using encoding='utf-8' ?
You've not specified if you're using python 2 or 3. Encoding is handled quite differently depending on which version you're using. The following advice is more or less universal anyway.
The difference between r.text and r.content is in the Requests docs. Simply put Requests will attempt to figure out the character encoding for you and return Unicode after decoding it. This which is accessible via r.text. To get just the bytes use r.content.
You really need to get to grips with encodings. Read http://www.joelonsoftware.com/articles/Unicode.html and watch https://www.youtube.com/watch?v=sgHbC6udIqc to get started. Also, do a search for "Overcoming frustration: Correctly using unicode in python2" for additional help.
Just to clarify, it's not as simple as always use one encoding over another. Make a Unicode sandwich by doing any I/O in bytes and work with Unicode in your application. If you start with bytes (isinstance(mytext, str)) you need to know the encoding to decode to Unicode, if you start with Unicode (isinstance(mytext, unicode)) you should encode to UTF-8 as it will handle all the worlds characters.
Make sure your editor, files, server and database are configured to UTF-8 also otherwise you'll get more 'gibberish'.
If you want further help post the source files and output of your script.

Decoding error with my Python function

I am using the Robot framework to automate some HTTP POST related tests. I wrote a custom Python library that has a function to do a HTTP POST. It looks like this:
# This function will do a http post and return the json response
def Http_Post_using_python(json_dict,url):
post_data = json_dict.encode('utf-8')
headers = {}
headers['Content-Type'] = 'application/json'
h = httplib2.Http()
resp, content = h.request(url,'POST',post_data,headers)
return resp, content
This works fine as long as I am not using any Unicode characters. When I have Unicode characters in the json_dict variable (for example, 메시지), it fails with this error:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xeb in position 164: ordinal not in range(128)
I am running Python 2.7.3 on Windows 7. I saw several related questions, but I have not been able to resolve the issue. I am new to Python and programming, so any help is appreciated.
Thanks.
You're getting this error because json_dict is a str, not a unicode. Without knowing anything else about the application, a simple solution would be:
if isinstance(json_dict, unicode):
json_dict = json_dict.encode("utf-8")
post_data = json_dict
However, if you're using json.dumps(…) to create the json_dict, then you don't need to encode it – that will be done by json.dumps(…).
Use requests:
requests.post(url, data=data, headers=headers)
It will deal with the encodings for you.
You're getting an error because of Python 2's automatic encoding/decoding, which is basically a bug and was fixed in Python 3. In brief, Python 2's str objects are really "bytes", and the right way to handle string data is in a unicode object. Since unicodes were introduced later, Python 2 will automatically try to convert between them and strings when you get them confused. To do so it needs to know an encoding; since you don't specify one, it defaults to ascii which doesn't have the characters needed.
Why is Python automatically trying to decode for you? Because you're calling .encode() on a str object. It's already encoded, so Python first tries to decode it for you, and guesses the ascii encoding.
You should read The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!).
Try this:
#coding=utf-8
test = "메시지"
test.decode('utf8')
In the line #coding=utf-8 i just set the file encoding to UTF-8 (to be able to write "메시지").
You need to decode the string into utf-8. decode method documentation

How to fix unicode issue when using a web service with Python Suds

I am trying to work with the HORRIBLE web services at Commission Junction (CJ). I can get the client to connect and receive information from CJ, but their database seems to include a bunch of bad characters that cause a UnicideDecodeError.
Right now I am doing:
from suds.client import Client
wsdlLink = 'https://link-search.api.cj.com/wsdl/version2/linkSearchServiceV2.wsdl'
client = Client(wsdlLink)
result = client.service.searchLinks(developerKey='XXX', websiteId='XXX', promotionType='coupon')
This works fine until I hit a record that has something like 'CorpNet® 10% Off Any Service' then the ® causes it to break and I get
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 758: ordinal not in range(128)" error.
Is there a way to encode the ® on my end so that it does not break when SUDS reads in the result?
UPDATE:
To clarify, the ® is coming from the CJ database and is in their response. SO somehow I need to decode the non-ascii characters BEFORE SUDS deals with the response. I am not sure how (or if) this is done in SUDs.
Implicit UnicodeDecodeErrors is something you get when trying to add str and unicode objects. Python will then try to decode the str into unicode, but using the ASCII encoding. If your str then contains anything that is not ascii, you will get this error.
Your solution is the decode it manually like so:
thestring = thestring.decode('utf8')
Try, as much as possible, to decode any string that may contain non-ascii characters as soo as you are handed it from whatever module you get it from, in this case suds.
Then, if suds can't handle Unicode (which may be the case) make sure you encode it back just before handing the text back to suds (or any other library that breaks if you give it unicode).
That should solve things nicely. It may be a big change, as you need to move all your internal processing from str to unicode, but it's worth it. :)
The "registered" character is U+00AE and is encoded as "\xc2\xae" in UTF-8. It looks like you have a str object encoded in UTF-8 but some code is doing (probably by default) your_str_object.decode("ascii") which will fail with the error message you showed.
What you need to do is show us a complete example (i.e. ALL the code necessary to get the error), plus the full error message and traceback, so that at least we can guess whether the problem is in your code or in imported code.
I am using SUDS to interface with Salesforce via their SOAP API. I ran into the same situation until I followed #J.F.Sabastian's advice by not mixing str and unicode string types. For example, passing a SOQL string like this does work with SUDS 0.3.9:
qstr = u"select Id, FirstName, LastName from Contact where FirstName='%s' and LastName='%s'" % (u'Jorge', u'López')
I did not seem to need to do str.decode("utf-8") either.
If you're running your script from PyDev on Eclipse, you might want to go into Project => Properties and under Resource, set "Text File Encoding" to UTF-8, on my Mac, this defaults to "MacRoman". I suppose on Windoze, the default is either Cp1252 or ISO-8859-1 (Latin). You could also set this in your Workspace of your Projects inherit this setting from their workspace. This only effects the program source code.

Categories