How to encode latin characters with urllib3 request_encode_url?

How to encode latin characters with urllib3 request_encode_url? - python

i have a function to download a website html code with the urllib3 library. I'm using the request_encode_url function to pass arguments by GET and it works fine if i do not use special latin characters like 'ñ'. If i use 'ñ', the url is not properly encoded.
For instance, if i pass an argument like "El señor" this function converts it to "El+seÃ±or" instead of "El+se%F1or".
z='El señor'
fields={'sec':'search','value': z}
http = urllib3.PoolManager()
r = http.request_encode_url('GET', 'http://www.myurl.com/search.php',fields)
The expected url must be like:
http://www.myurl.com/search.php?sec=search&value=El+se%F1or
but if i use special characters i obtain next url:
http://www.myurl.com/search.php?sec=search&value=El+seÃ±or
Somebody can say me how can i pass arguments with special characters to encode a correct url?
I'm using Python 3.4

I found a solution, maybe it's a silly thing but i have a low level in python.
I solve it encoding string into latin1:
z='El señor'
fields={'sec':'search','value': z.encode('latin1')}
http = urllib3.PoolManager()
r = http.request_encode_url('GET', 'http://www.myurl.com/search.php',fields)

Related

mimic web URL encode for Chinese character in python

I want to mimic URL encoding for Chinese characters. For my use case, I have a searching URL for a e-commerce site
'https://search.jd.com/Search?keyword={}'.format('ipad')
When I search a product in english, this works fine. However, I need to have input in Chinese, I tried
'https://search.jd.com/Search?keyword={}'.format('耐克t恤')
, and found the following encoding under the network tab
https://list.tmall.com/search_product.htm?q=%C4%CD%BF%CBt%D0%F4
So basically, I need to encode inputs like '耐克t恤' into '%C4%CD%BF%CBt%D0%F4'. I'm not sure which encoding the website is using? Also, how to convert Chinese characters to these encodings with python?
Update: I checked headers and it seems like content encoding is gzip?

Try using the library urllib.parse module. More specifically, urllib.parse.urlencode() function. You can pass the encoding (in this case it appears to be 'gb2312') and a dict containing the query parameters to get a valid valid url suffix which you can use directly.
In this case, your code will look something like:
import urllib.parse
keyword = '耐克t恤'
url = 'https://search.jd.com/Search?{url_suffix}'.format(url_suffix=urllib.parse.urlencode({'keyword': keyword}, encoding='gb2312'))
More info about encoding here
More info about urlencode here

The encoding used seems to be GB2312
This could help you:
def encodeGB2312(data):
hexData = data.encode(encoding='GB2312').hex().upper()
encoded = '%' + '%'.join(hexData[i:i + 2] for i in range(0, len(hexData), 2))
return encoded
output = encodeGB2312('耐克t恤')
print(output)
url = f'https://list.tmall.com/search_product.htm?q={output}'
print(url)
Output:
%C4%CD%BF%CB%74%D0%F4
https://list.tmall.com/search_product.htm?q=%C4%CD%BF%CB%74%D0%F4
The only problem with my code is that it doesn't seem to 100% corrospond with the link you are trying to achieve. It converts the t chacaracter into GB2312 encoding. While it seems to use the non encoded t character in your link. Altough it still seems to work when opening the url.
Edit:
Vignesh Bayari R his post handles the URL in the correct (intended) way. But in this case my solution works too.

python request library giving wrong value single quotes

Facing some issue in calling API using request library. Problem is described as follows
The code:.
r = requests.post(url, data=json.dumps(json_data), headers=headers)
When I perform r.text the apostrophe in the string is giving me as
like this Bachelor\u2019s Degree. This should actually give me the response as Bachelor's Degree.
I tried json.loads also but the single quote problem remains the same,
How to get the string value correctly.

What you see here ("Bachelor\u2019s Degree") is the string's inner representation, where "\u2019" is the unicode codepoint for "RIGHT SINGLE QUOTATION MARK". This is perfectly correct, there's nothing wrong here, if you print() this string you'll get what you expect:
>>> s = 'Bachelor\u2019s Degree'
>>> print(s)
Bachelor’s Degree
Learning about unicode and encodings might save you quite some time FWIW.
EDIT:
When I save in db and then on displaying on HTML it will cause issue
right?
Have you tried ?
Your database connector is supposed to encode it to the proper encoding (according to your fields, tables and client encoding settings).
wrt/ "displaying it on HTML", it mostly depends on whether you're using Python 2.7.x or Python 3.x AND on how you build your HTML, but if you're using some decent framework with a decent template engine (if not you should reconsider your stack) chances are it will work out of the box.
As I already mentionned, learning about unicode and encodings will save you a lot of time.

It's just using a UTF-8 encoding, it is not "wrong".
string = 'Bachelor\u2019s Degree'
print(string)
Bachelor’s Degree
You can decode and encode it again, but I can't see any reason why you would want to do that (this might not work in Python 2):
string = 'Bachelor\u2019s Degree'.encode().decode('utf-8')
print(string)
Bachelor’s Degree

From requests docs:
When you make a request, Requests makes educated guesses about the
encoding of the response based on the HTTP headers. The text encoding
guessed by Requests is used when you access r.text
On the response object, you may use .content instead of .text to get the response in UTF-8

Python requests: URL with percent character

I have been searching all over the place for this, but I couldn't solve my issue.
I am using a local API to fetch some data, in that API, the wildcard is the percent character %.
The URL is like so : urlReq = 'http://myApiURL?ID=something&parameter=%w42'
And then I'm passing this to the get function:
req = requests.get(urlReq,auth=HTTPBasicAuth(user, pass))
And get the following error: InvalidURL: Invalid percent-escape sequence: 'w4'
I have tried escaping the % character using %%, but in vain. I also tried the following:
urlReq = 'http://myApiURL?ID=something&parameter=%sw42' % '%' but didn't work as well.
Does anyone know how to solve this?
PS I'm using Python 2.7.8 :: Anaconda 1.9.1 (64-bit)

You should have a look at urllib.quote - that should do the trick. Have a look at the docs for reference.
To expand on this answer: The problem is, that % (+ a hexadecimal number) is the escape sequence for special characters in URLs. If you want the server to interpret your % literaly, you need to escape it as well, which is done by replacing it with %25. The aforementioned qoute function does stuff like that for you.

Let requests construct the query string for you by passing the parameters in the params argument to requests.get() (see documentation):
api_url = 'http://myApiURL'
params = {'ID': 'something', 'parameter': '%w42'}
r = requests.get(api_url, params=params, auth=(user, pass))
requests should then percent encode the parameters in the query string for you. Having said that, at least with requests version 2.11.1 on my machine, I find that the % is encoded when passing it in the url, so perhaps you could check which version you are using.
Also for basic authentication you can simply pass the user name and password in a tuple as shown above.

in requests you should use requests.compat.quote_plus here's take alook
example :
>>> requests.compat.quote_plus('example: parameter=%w42')
'example%3A+parameter%3D%25w42'

Credits to #Tryph:
the % is used to encode special characters in urls. you can encode the % character with this sequence %25. see here for more detail: w3schools.com/tags/ref_urlencode.asp

Decoding error with my Python function

I am using the Robot framework to automate some HTTP POST related tests. I wrote a custom Python library that has a function to do a HTTP POST. It looks like this:
# This function will do a http post and return the json response
def Http_Post_using_python(json_dict,url):
post_data = json_dict.encode('utf-8')
headers = {}
headers['Content-Type'] = 'application/json'
h = httplib2.Http()
resp, content = h.request(url,'POST',post_data,headers)
return resp, content
This works fine as long as I am not using any Unicode characters. When I have Unicode characters in the json_dict variable (for example, 메시지), it fails with this error:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xeb in position 164: ordinal not in range(128)
I am running Python 2.7.3 on Windows 7. I saw several related questions, but I have not been able to resolve the issue. I am new to Python and programming, so any help is appreciated.
Thanks.

You're getting this error because json_dict is a str, not a unicode. Without knowing anything else about the application, a simple solution would be:
if isinstance(json_dict, unicode):
json_dict = json_dict.encode("utf-8")
post_data = json_dict
However, if you're using json.dumps(…) to create the json_dict, then you don't need to encode it – that will be done by json.dumps(…).

Use requests:
requests.post(url, data=data, headers=headers)
It will deal with the encodings for you.
You're getting an error because of Python 2's automatic encoding/decoding, which is basically a bug and was fixed in Python 3. In brief, Python 2's str objects are really "bytes", and the right way to handle string data is in a unicode object. Since unicodes were introduced later, Python 2 will automatically try to convert between them and strings when you get them confused. To do so it needs to know an encoding; since you don't specify one, it defaults to ascii which doesn't have the characters needed.
Why is Python automatically trying to decode for you? Because you're calling .encode() on a str object. It's already encoded, so Python first tries to decode it for you, and guesses the ascii encoding.
You should read The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!).

Try this:
#coding=utf-8
test = "메시지"
test.decode('utf8')
In the line #coding=utf-8 i just set the file encoding to UTF-8 (to be able to write "메시지").
You need to decode the string into utf-8. decode method documentation

Script having trouble passing Unicode through a REST interface

I am having trouble getting my Python script to ass Unicode data over RESTful http call.
I have a script that reads data from web site X using a REST interface and then pushes it into web site Y using it's REST interface. Both system are open source and are run on our servers. Site X uses PHP, Apache and PostgreSQL. Site Y is Java, Tomcat and PostgreSQL. The script doing the processing is currently in Python.
In general, the script works very well. We do have a few international users, and when trying to process a user with unicode characters in their name things break down. The original version of the script read the JSON data into the Python. The data was converted automagically into Unicode. I am pretty sure everything was working fine up to this point. To output the data I used subprocess.Popen() to call curl. This works for regular ascii, but the unicode was getting mangled somewhere in transit. I didn't get an error anywhere, but when viewing the results on site B it is no longer correctly encoded.
I know that Unicode is supported for these fields because I can craft a request using Firefox that correctly adds the data to site B.
Next idea was to not use curl, but just do everything in Python. I experimented by passing a hand constructed Unicode string to Python's urllib to make the REST call, but I received an error from urllib.urlopen():
UnicodeEncodeError: 'ascii' codec can't encode characters in position 103-105: ordinal not in range(128)
Any ideas on how to make this work? I would rather not re-write too much, but if there is another scripting language that would be better suited I wouldn't mind hearing about that also.
Here is my Python test script:
import urllib
uni = u"abc_\u03a0\u03a3\u03a9"
post = u"xdat%3Auser.login=unitest&"
post += u"xdat%3Auser.primary_password=nauihe4r93nf83jshhd83&"
post += u"xdat%3Auser.firstname=" + uni + "&"
post += u"xdat%3Auser.lastname=" + uni ;
url = u"http://localhost:8081/xnat/app/action/XDATRegisterUser"
data = urllib.urlopen(url,post).read()

With regard to your test script, it is failing because you are passing unicode object to urllib.urlencode() (it is being called for you by urlopen()). It does not support unicode objects, so it implicitly encodes using the default charset, which is ascii. Obviously, it fails.
The simplest way to handle POSTing unicode objects is to be explicit; Gather your data and build a dict, encode unicode values with an appropriate charset, urlencode the dict (to get a POSTable ascii string), then initiate the request. Your example could be rewritten as:
import urllib
import urllib2
## Build our post data dict
data = {
'xdat:user.login' : u'unitest',
'xdat:primary_password' : u'nauihe4r93nf83jshhd83',
'xdat:firstname' : u"abc_\u03a0\u03a3\u03a9",
'xdat:lastname' : u"abc_\u03a0\u03a3\u03a9",
}
## Encode the unicode using an appropriate charset
data = dict([(key, value.encode('utf8')) for key, value in data.iteritems()])
## Urlencode it for POSTing
data = urllib.urlencode(data)
## Build a POST request, get the response
url = "http://localhost:8081/xnat/app/action/XDATRegisterUser"
request = urllib2.Request(url, data)
response = urllib2.urlopen(request)
EDIT: More generally, when you make an http request with python (say urllib2.urlopen),
the content of the response is not decoded to unicode for you. That means you need to be aware of the encoding used by the server that sent it. Look at the content-type header; Usually it includes a charset=xyz.
It is always prudent to decode your input as early as possible, and encode your output as late as possible.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.