How to unquote URL quoted UTF-8 strings in Python - python

thestring = urllib.quote(thestring.encode('utf-8'))
This will encode it. How to decode it?

What about
backtonormal = urllib.unquote(thestring)

if you mean to decode a string from utf-8, you can first transform the string to unicode and then to any other encoding you would like (or leave it in unicode), like this
unicodethestring = unicode(thestring, 'utf-8')
latin1thestring = unicodethestring.encode('latin-1','ignore')
'ignore' meaning that if you encounter a character that is not in the latin-1 character set you ignore this character.

Related

Python codec safe_encode method

Given a byte string, for instanceB = b"\x81\xc9\x00\x07I ABCD_\xe2\x86\x97_" I want to be able to convert this to the valid printable UTF-8 string that is as UTF-8 as possible: S = "\\x81\\xc9\\x00\\x07I ABCD_↗_". Note that the first group of hex bytes are not valid UTF-8 characters, but the last 3 do define a valid UTF-8 character (the arrow). It seems like this should be part of codecs but I cannot figure out how to make this happen.
for instance
>>> codecs.decode(codecs.escape_encode(B, 'utf-8')[0], 'utf-8')
'\\x81\\xc9\\x00\\x07I\\x19ABCD_\\xe2\\x86\\x97_'
escapes a valid UTF-8 character along with the invalid characters.
Specifying 'backslashreplace' as the error handling mode when decoding a bytestring will replace un-decodable bytes with backslashed escape sequences:
decoded = b.decode('utf-8', errors='backslashreplace')
Also, this is a decoding operation, not an encoding operation. Decoding is bytes->string. Encoding is string->bytes.

'ascii' codec can't encode character u'\xe9'

I already tried all previous answers and solution.
I am trying to use this value, which gave me encoding related error.
ar = [u'http://dbpedia.org/resource/Anne_Hathaway', u'http://dbpedia.org/resource/Jodie_Bain', u'http://dbpedia.org/resource/Wendy_Divine', u'http://dbpedia.org/resource/Jos\xe9_El\xedas_Moreno', u'http://dbpedia.org/resource/Baaba_Maal']
So I tried,
d = [x.decode('utf-8') for x in ar]
which gives:
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 31: ordinal not in range(128)
I tried out
d = [x.encode('utf-8') for x in ar]
which removes error but changes the original content
original value was u'http://dbpedia.org/resource/Jos\xe9_El\xedas_Moreno' which converted to 'http://dbpedia.org/resource/Jos\xc3\xa9_El\xc3\xadas_Moreno' while using encode
what is correct way to deal with this scenario?
Edit
Error comes when I feed these links in
req = urllib2.Request()
The second version of your string is the correct utf-8 representation of your original unicode string. If you want to have a meaningful comparison, you have to use the same representation for both the stored string and the user input string. The sane thing to do here is to always use Unicode string internally (in your code), and make sure both your user inputs and stored strings are correctly decoded to unicode from their respective encodings at your system's boundaries (storage subsystem and user inputs subsystem).
Also you seem to be a bit confused about unicode and encodings, so reading this and this might help.
Unicode strings in python are "raw" unicode, so make sure to .encode() and .decode() them as appropriate. Using utf8 encoding is considered a best practice among multiple dev groups all over the world.
To encode use the quote function from the urllib2 library:
from urllib2 import quote
escaped_string = quote(unicode_string.encode('utf-8'))
To decode, use unquote:
from urllib2 import unquote
src = "http://dbpedia.org/resource/Jos\xc3\xa9_El\xc3\xadas_Moreno"
unicode_string = unquote(src).decode('utf-8')
Also, if you're more interested in Unicode and UTF-8 work, check out Unicode HOWTO and
In your Unicode list, u'http://dbpedia.org/resource/Jos\xe9_El\xedas_Moreno' is an ASCII safe way to represent a Unicode string. When encoded in a form that supports the full Western European character set, such as UTF-8, it's: http://dbpedia.org/resource/José_Elías_Moreno
Your .encode("UTF-8") is correct and would have looked ok in a UTF-8 editor or browser. What you saw after the encode was an ASCII safe representation of UTF-8.
For example, your trouble chars were é and í.
é = 00E9 Unicode = C3A9 UTF-8
í = 00ED Unicode = C3AD UTF-8
In short, your .encode() method is correct and should be used for writing to files or to a browser.

Python 2.7: How to convert unicode escapes in a string into actual utf-8 characters

I use python 2.7 and I'm receiving a string from a server (not in unicode!).
Inside that string I find text with unicode escape sequences. For example like this:
<a href = "http://www.mypage.com/\u0441andmoretext">\u00b2<\a>
How do I convert those \uxxxx - back to utf-8? The answers I found were either dealing with &# or required eval() which is too slow for my purposes. I need a universal solution for any text containing such sequenes.
Edit:
<\a> is a typo but I want a tolerance against such typos as well. There should only be reaction to \u
The example text is meant in proper python syntax like this:
"<a href = \"http://www.mypage.com/\\u0441andmoretext\">\\u00b2<\\a>"
The desired output is in proper python syntax
"<a href = \"http://www.mypage.com/\xd1\x81andmoretext\">\xc2\xb2<\\a>"
Try
>>> s = "<a href = \"http://www.mypage.com/\\u0441andmoretext\">\\u00b2<\\a>"
>>> s.decode("raw_unicode_escape")
u'<a href = "http://www.mypage.com/\u0441andmoretext">\xb2<\\a>'
And then you can encode to utf8 as usual.
Python does contain some special string codecs for cases like this.
In this case, if there are no other characters outside the 32-127 range, you can safely decode your byte-string using the "unicode_escape" codec to have a proper Unicode text object in Python.
(On which your program should be performing all textual operations) -
Whenever you are outputting that text again, you convert it to utf-8 as usual:
rawtext = r"""<a href="http://www.mypage.com/\u0441andmoretext">\u00b2<\a>"""
text = rawtext.decode("unicode_escape")
# Text operations go here
...
output_text = text.encode("utf-8")
If there are othe bytes outside the 32-127 range, the unicode_escape codec
assumes them to be in the latin1 encoding. So if your response mixes utf-8 and these \uXXXX sequences you have to:
decode the original string using utf-8
encode back to latin1
decode using "unicode_escape"
work on the text
encode back to utf-8

UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-3: ordinal not in range(128)

when i run my code i get this error:
UserId = "{}".format(source[1]) UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-3: ordinal not in range(128)
My code is:
def view_menu(type, source, parameters):
ADMINFILE = 'static/users.txt'
fp = open(ADMINFILE, 'r')
users = ast.literal_eval(fp.read())
if not parameters:
if not source[1] in users:
UserId = "{}".format(source[1])
users.append(UserId)
write_file(ADMINFILE,str(users))
fp.close()
reply(type, source, u"test")
else:
reply(type, source, u"test")
register_command_handler(view_menu, 'test', ['info','muc','all'], 0, '')
Please how i can solve this problem.
Thank you
The problem is that "{}" is non-Unicode str, and you're trying to format a unicode into it. Python 2.x handles that by automatically encoding the unicode with sys.getdefaultencoding(), which is usually 'ascii', but you have some non-ASCII characters.
There are two ways to solve this:
Explicitly encode that unicode in the appropriate character set. For example, if it's UTF-8, do "{}".format(source[1].encode('utf-8')).
Use a unicode format string: u"{}".format(source[1]). You may still need to encode that UserId later; I have no idea how your write_file function works. But it's generally better to keep everything Unicode as long as possible, only encoding and decoding at the very edges, than to try to mix and match the two.
All that being said, this line of code is useless. "{}".format(foo) converts foo to a str, and then formats it into the exact same str. Why?
Take these functions here when handling strings of unknown encoding:
You want to work with the text?
def read_unicode(text, charset='utf-8'):
if isinstance(text, basestring):
if not isinstance(text, unicode):
text = unicode(obj, charset)
return text
You want to store the text, for example in a database, use this:
def write_unicode(text, charset='utf-8'):
return text.encode(charset)
a solution is to set a default encoding to utf-8 instead of ascii in your sitecustomize.py
Changing default encoding of Python?
Your file static/users.txt must contain any non-unicode characters. You must specify any encoding in your program. for intsnace utf-8. You can read more about it here: Unicode HOWTO.

Python encode url with special characters

I want to encode URL with special characters. In my case it is: š, ä, õ, æ, ø (it is not a finite list).
urllib2.quote(symbol) gives very strange result, which is not correct. How else these symbols can be encoded?
urllib2.quote("Grønlandsleiret, Oslo, Norway") gives a %27Gr%B8nlandsleiret%2C%20Oslo%2C%20Norway%27
Use UTF-8 explicitly then:
urllib2.quote(u"Grønlandsleiret, Oslo, Norway".encode('UTF-8'))
And always state the encoding in your file. See PEP 0263.
A non-UTF-8 string needs to be decode first, then encoded:
# You've got a str "s".
s = s.decode('latin-1') # (or what the encoding might be …)
# Now "s" is a unicode object.
s = s.encode('utf-8') # Encode as UTF-8 string.
# Now "s" is a str again.
s = urllib2.quote(s) # URL encode.
# Now "s" is encoded the way you need it.

Categories