Converting url encoded string(utf-8) to string in python? - python

I have a url encoded token(utf-8)= "EC%2d2EC7" and I want to convert it to "EC-2EC7" i.e convert %2d to -.
>>> token = "EC%2d2EC7"
>>> token.encode("utf-8")
'EC%2d2EC7'
I also tried urllib.quote but same result. Is the problem that token is already in utf-8 so it can't convert? What can I do?
My python version: 2.7.10

You can use urllib.unquote:
from urllib import unquote
print unquote("EC%2d2EC7")
Another way is to use requests.utils.unquote:
from requests.utils import unquote
print unquote("EC%2d2EC7")
Output:
EC-2EC7

You are looking for unquote instead of decode.
urllib.unquote('EC%2d2EC7')

Related

Converting a unicode format string into actual utf string

When I call an api, I get an output like this-
\\u092f\\u0939 \\u0909\\u0926\\u093e\\u0939\\u0930\\u0923 \\u092a\\u093e\\u0920
The utf encoding is stored as a string. So string[1] here would be 'u'.
Anyone know how do I convert this into actual utf-8 string using python?
You can use the codecs module:
>>> s = "\\u092f\\u0939 \\u0909\\u0926\\u093e\\u0939\\u0930\\u0923 \\u092a\\u093e\\u0920"
>>> print(s)
\u092f\u0939 \u0909\u0926\u093e\u0939\u0930\u0923 \u092a\u093e\u0920
So:
>>> import codecs
>>> codecs.decode(s, 'unicode-escape')
'यह उदाहरण पाठ'

Python: decode to url format

I need convert kyrillic
Астрахань
to
%C0%F1%F2%F0%E0%F5%E0%ED%FC
I try to use
urllib.parse.quote_plus()
bit it returns
%D0%90%D1%81%D1%82%D1%80%D0%B0%D1%85%D0%B0%D0%BD%D1%8C
What should I use to convert to another format?
I could guess that you are using Windows cp1251 encoding. quote_plus uses by default utf_8, but also support any specific one:
>>> print(urllib.parse.quote_plus('Астрахань', encoding='cp1251'))
%C0%F1%F2%F0%E0%F5%E0%ED%FC

Python converting characters to URL encoding

I am trying to convert texts into URLs, but certain characters are not being converted as I'm expecting. For example:
>>> import urllib
>>> my_text="City of Liège"
>>> my_url=urllib.parse.quote(my_text,safe='')
>>> my_url
'City%20of%20Li%C3%A8ge'
The spaces get converted properly, however, the "è" should get converted into %E8, but it is returned as %C3%A8. What am I missing ?
I am using Python 3.6.
Your string is UTF-8 encoded, and the URL encoded string reflects this.
0xC3A8 is the UTF-8 encoding of the Unicode value U+00E8, which is described as "LATIN SMALL LETTER E WITH GRAVE".
In order to get the string you are after, you need to let Python know which codepage you're using, like this:
my_text=bytes("City of Liège",'cp1252')

How to read text file into pandas and convert utf8 chars to string / latin? [duplicate]

I have a list containing URLs with escaped characters in them. Those characters have been set by urllib2.urlopen when it recovers the html page:
http://www.sample1webpage.com/index.php?title=%E9%A6%96%E9%A1%B5&action=edit
http://www.sample1webpage.com/index.php?title=%E9%A6%96%E9%A1%B5&action=history
http://www.sample1webpage.com/index.php?title=%E9%A6%96%E9%A1%B5&variant=zh
Is there a way to transform them back to their unescaped form in python?
P.S.: The URLs are encoded in utf-8
Using urllib package (import urllib) :
Python 2.7
From official documentation :
urllib.unquote(string)
Replace %xx escapes by their single-character equivalent.
Example: unquote('/%7Econnolly/') yields '/~connolly/'.
Python 3
From official documentation :
urllib.parse.unquote(string, encoding='utf-8', errors='replace')
[…]
Example: unquote('/El%20Ni%C3%B1o/') yields '/El Niño/'.
And if you are using Python3 you could use:
import urllib.parse
urllib.parse.unquote(url)
or urllib.unquote_plus
>>> import urllib
>>> urllib.unquote('erythrocyte+membrane+protein+1%2C+PfEMP1+%28VAR%29')
'erythrocyte+membrane+protein+1,+PfEMP1+(VAR)'
>>> urllib.unquote_plus('erythrocyte+membrane+protein+1%2C+PfEMP1+%28VAR%29')
'erythrocyte membrane protein 1, PfEMP1 (VAR)'
You can use urllib.unquote
import re
def unquote(url):
return re.compile('%([0-9a-fA-F]{2})',re.M).sub(lambda m: chr(int(m.group(1),16)), url)

Python urllib urlencode problem with æøå

How can I urlencode a string with special chars æøå?
ex.
urllib.urlencode('http://www.test.com/q=testæøå')
I get this error :(..
not a valid non-string sequence or
mapping object
urlencode is intended to take a dictionary, for example:
>>> q= u'\xe6\xf8\xe5' # u'æøå'
>>> params= {'q': q.encode('utf-8')}
>>> 'http://www.test.com/?'+urllib.urlencode(params)
'http://www.test.com/?q=%C3%A6%C3%B8%C3%A5'
If you just want to URL-encode a single string, the function you're looking for is quote:
>>> 'http://www.test.com/?q='+urllib.quote(q.encode('utf-8'))
'http://www.test.com/?q=%C3%A6%C3%B8%C3%A5'
I'm guessing UTF-8 is the right encoding (it should be, for modern sites). If what you actually want ?q=%E6%F8%E5, then the encoding you want is probably cp1252 (similar to iso-8859-1).
You should pass dictionary to urlencode, not a string. See the correct example below:
from urllib import urlencode
print 'http://www.test.com/?' + urlencode({'q': 'testæøå'})

Categories