Python urllib urlencode problem with æøå - python

How can I urlencode a string with special chars æøå?
ex.
urllib.urlencode('http://www.test.com/q=testæøå')
I get this error :(..
not a valid non-string sequence or
mapping object

urlencode is intended to take a dictionary, for example:
>>> q= u'\xe6\xf8\xe5' # u'æøå'
>>> params= {'q': q.encode('utf-8')}
>>> 'http://www.test.com/?'+urllib.urlencode(params)
'http://www.test.com/?q=%C3%A6%C3%B8%C3%A5'
If you just want to URL-encode a single string, the function you're looking for is quote:
>>> 'http://www.test.com/?q='+urllib.quote(q.encode('utf-8'))
'http://www.test.com/?q=%C3%A6%C3%B8%C3%A5'
I'm guessing UTF-8 is the right encoding (it should be, for modern sites). If what you actually want ?q=%E6%F8%E5, then the encoding you want is probably cp1252 (similar to iso-8859-1).

You should pass dictionary to urlencode, not a string. See the correct example below:
from urllib import urlencode
print 'http://www.test.com/?' + urlencode({'q': 'testæøå'})

Related

Python converting characters to URL encoding

I am trying to convert texts into URLs, but certain characters are not being converted as I'm expecting. For example:
>>> import urllib
>>> my_text="City of Liège"
>>> my_url=urllib.parse.quote(my_text,safe='')
>>> my_url
'City%20of%20Li%C3%A8ge'
The spaces get converted properly, however, the "è" should get converted into %E8, but it is returned as %C3%A8. What am I missing ?
I am using Python 3.6.
Your string is UTF-8 encoded, and the URL encoded string reflects this.
0xC3A8 is the UTF-8 encoding of the Unicode value U+00E8, which is described as "LATIN SMALL LETTER E WITH GRAVE".
In order to get the string you are after, you need to let Python know which codepage you're using, like this:
my_text=bytes("City of Liège",'cp1252')

Converting url encoded string(utf-8) to string in python?

I have a url encoded token(utf-8)= "EC%2d2EC7" and I want to convert it to "EC-2EC7" i.e convert %2d to -.
>>> token = "EC%2d2EC7"
>>> token.encode("utf-8")
'EC%2d2EC7'
I also tried urllib.quote but same result. Is the problem that token is already in utf-8 so it can't convert? What can I do?
My python version: 2.7.10
You can use urllib.unquote:
from urllib import unquote
print unquote("EC%2d2EC7")
Another way is to use requests.utils.unquote:
from requests.utils import unquote
print unquote("EC%2d2EC7")
Output:
EC-2EC7
You are looking for unquote instead of decode.
urllib.unquote('EC%2d2EC7')

Is there any way to make simplejson less strict?

I'm interested in having simplejson.loads() successfully parse the following:
{foo:3}
It throws a JSONDecodeError saying "expecting property name" but in reality it's saying "I require double quotes around my property names". This is annoying for my use case, and I'd prefer a less strict behavior. I've read the docs, but beyond making my own decoder class, I don't see anything obvious that changes this behavior.
You can use YAML (>=1.2)as it is a superset of JSON, you can do:
>>> import yaml
>>> s = '{foo: 8}'
>>> yaml.load(s)
{'foo': 8}
You can try demjson.
>>> import demjson
>>> demjson.decode('{foo:3}')
{u'foo': 3}
No, this is not possible. To successfully parse that using simplejson you would first need to transform it into a valid JSON string.
Depending on how strict the format of your incoming string is this could be pretty simple or extremely complex.
For a simple case, if you will always have a JSON object that only has letters and underscores in keys (without quotes) and integers as values, you could use the following to transform it into valid JSON:
import re
your_string = re.sub(r'([a-zA-Z_]+)', r'"\1"', your_string)
For example:
>>> re.sub(r'([a-zA-Z_]+)', r'"\1"', '{foo:3, bar:4}')
'{"foo":3, "bar":4}'

how to encode url in python

I have created a function for decoding url.
from urllib import unquote
def unquote_u(source):
result = source
if '%u' in result:
result = result.replace('%u','\\u').decode('unicode_escape')
result = unquote(result)
print result
return result
if __name__=='__main__':
unquote_u('{%22%22%3A%22test_%E5%93%A6%E4%BA%88%E4%BB%A5%E8%85%BF%E5%93%A6.doc.txt%22%2C%22mimeType%22%3A%22text%2Fplain%22%2C%22compressed%22%3Afalse%7D')
But, I am not ale to get proper file name.
proper file name is : test_哦予以腿哦.doc
can anyone tell me how to do that?
urllib.unquote can do it:
>>> urllib.unquote('{%22%22%3A%22test_%E5%93%A6%E4%BA%88%E4%BB%A5%E8%85%BF%E5%93%A6.doc.txt%22%2C%22mimeType%22%3A%22text%2Fplain%22%2C%22compressed%22%3AFalse%7D')
'{"":"test_\xe5\x93\xa6\xe4\xba\x88\xe4\xbb\xa5\xe8\x85\xbf\xe5\x93\xa6.doc.txt","mimeType":"text/plain","compressed":False}'
>>> eval(_)
{'': 'test_\xe5\x93\xa6\xe4\xba\x88\xe4\xbb\xa5\xe8\x85\xbf\xe5\x93\xa6.doc.txt', 'mimeType': 'text/plain', 'compressed': False}
>>> _['']
'test_\xe5\x93\xa6\xe4\xba\x88\xe4\xbb\xa5\xe8\x85\xbf\xe5\x93\xa6.doc.txt'
>>> print _
test_哦予以腿哦.doc.txt
Note that I had to change "false" to "False" in the quoted string. Also that the string after unquote is still UTF-8 encoded; you can use str.decode('utf8') to get a Unicode string if that is what you require.
As JBernardo mentions, eval() of unsafe data is a very bad idea. Anybody knowing, or even suspecting, that a server-side script is eval()-ing form data can easily craft a POST with commands that can compromise the server. Better would be this:
>>> import json, urllib
>>> json.loads(urllib.unquote('{%22%22%3A%22test_%E5%93%A6%E4%BA%88%E4%BB%A5%E8%85%BF%E5%93%A6.doc.txt%22%2C%22mimeType%22%3A%22text%2Fplain%22%2C%22compressed%22%3Afalse%7D'))['']
u'test_\u54e6\u4e88\u4ee5\u817f\u54e6.doc.txt'
>>> print _
test_哦予以腿哦.doc.txt
Also note that this later approach didn't require changing false to False; in fact it doesn't work if I do. The json package takes care of that.
One thing to add, after get unquoted url from urllib.unquote(url), you probably need use decode('utf8') to convert the raw string into a unicode string.

Unescape Python Strings From HTTP

I've got a string from an HTTP header, but it's been escaped.. what function can I use to unescape it?
myemail%40gmail.com -> myemail#gmail.com
Would urllib.unquote() be the way to go?
I am pretty sure that urllib's unquote is the common way of doing this.
>>> import urllib
>>> urllib.unquote("myemail%40gmail.com")
'myemail#gmail.com'
There's also unquote_plus:
Like unquote(), but also replaces plus signs by spaces, as required for unquoting HTML form values.
In Python 3, these functions are urllib.parse.unquote and urllib.parse.unquote_plus.
The latter is used for example for query strings in the HTTP URLs, where the space characters () are traditionally encoded as plus character (+), and the + is percent-encoded to %2B.
In addition to these there is the unquote_to_bytes that converts the given encoded string to bytes, which can be used when the encoding is not known or the encoded data is binary data. However there is no unquote_plus_to_bytes, if you need it, you can do:
def unquote_plus_to_bytes(s):
if isinstance(s, bytes):
s = s.replace(b'+', b' ')
else:
s = s.replace('+', ' ')
return unquote_to_bytes(s)
More information on whether to use unquote or unquote_plus is available at URL encoding the space character: + or %20.
Yes, it appears that urllib.unquote() accomplishes that task. (I tested it against your example on codepad.)

Categories