How can encode('ascii', 'ignore') throw a UnicodeDecodeError? - python

This line
data = get_url_contents(r[0]).encode('ascii', 'ignore')
produces this error
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 11450: ordinal not in range(128)
Why? I assumed that because I'm using 'ignore' that it should be impossible to have decode errors when saving the output to a value to a string variable.

Due to a quirk of Python 2, you can call encode on a byte string (i.e. text that's already encoded). In this case, it first tries to convert it to a unicode object by decoding with ascii. So, if get_url_contents is returning a byte string, your line effectively does this:
get_url_contents(r[0]).decode('ascii').encode('ascii', 'ignore')
In Python 3, byte strings don't have an encode method, so the same problem would just cause an AttributeError.
(Of course, I don't know that this is the problem - it could be related to the get_url_contents function. But what I've described above is my best guess)

Related

Troubles in printing unicode string that I have in byte format

Reading from a database I get the following value
b'd\xe2\x80\x99int'
How can I print it to get the string d’int (note that this is different from d'int)?
I tried with print(b'd\xe2\x80\x99int'.decode('utf-8')) but I get the error:
UnicodeEncodeError: 'ascii' codec can't encode character '\u2019' in position 1: ordinal not in range(128)
EDIT: thanks to the comment I understood that the problem is not in my Python code but in emacs, I am having exactly the same problem as described here Unicode conversion issue using Python in Emacs
I will close the question
You can use bytes.decode()
>>> bytes.decode(b'd\xe2\x80\x99int', 'utf8')
'd’int'
or analogously, the .decode() method over the bytes object itself:
>>> b'd\xe2\x80\x99int'.decode('utf-8')
'd’int'

'utf8' codec can't decode byte 0xc3 while decode('utf-8') in python

Today I was hit with strange error in my script:
'utf8' codec can't decode byte 0xc3 in position 21: invalid continuation byte
I'm reading data from socket sock.recv and result is buff.decode('utf-8') where buff is the returned data.
But today I found pretty much "unicorn" where one of the characters returned "▒" <-- this is what throw decode utf-8 into exception. Is there some pre process that would either remove or replace such a strange character?
There is a second parameter for .decode() named errors. You can set it to 'ignore' to ignore all non-utf8 characters, or set it to 'replace' to replace them with the diamond question mark (�).
buff.decode('utf-8', 'ignore')

why does pythons `s.encode('ascii', 'replace')` fails encoding

Why does using replace here:
s = s.encode('ascii', 'replace')
Give me this error?:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xcb in position 6755: ordinal not in range(128)
Isn't the whole point of 'replace' or 'ignore' to not fail when it can't decode a byte. Am I not understanding this?
(sorry I can't provide the actual string, the corpus is very large)
In any case, how do I tell python to ignore or replace characters that aren't ascii?
Note that you're getting a UnicodeDecodeError, not a UnicodeEncodeError.
That's because s.encode() takes a unicode string as input, but in this case you're not giving it one; you're giving it a bytestring instead.
Thus, it's encoding the bytestring you're handing it to unicode before trying to decode it, and it's in that initial encode that the error occurs.
This three-way round-trip is silly, but if you really wanted to do it:
s_bytes = '\xcb' # standard Python 2 string, aka a Python 3 bytestring
s_unicode = s_bytes.decode('ascii', 'replace') # a unicode string now
s_ascii = s_unicode.encode('ascii', 'replace') # a bytestring again

Unexpected recurrence of Python unicode string ascii codec error

After days and months of desperation I recently found a solution to overcome the infamous UnicodeEncodeError: 'ascii' codec cant encoe character u'\u2026' in position 18: ordinal not in range (128). It was dealing with multilingual strings pretty well until recently, I bumped into this error AGAIN!
I tried type(thatstring) and it returned Unicode.
So I tried:
thatstring=thatstring.decode('utf-8')
This was handling those multilanguage strings pretty well but it came back now. I also tried
thatstring=thatstring.decode('utf-8','ignore')
No use.
thatstring=thatstring.encode('utf-8','ignore')
bounces with the error
UnicodeDecodeError: 'ascii' codec cant decode byte 0xc3 in position 48: ordinal not in range (128) faster than its counterpart.
Please help me. Thanks.
You did the right thing by trying type(thatstring), but you didn't draw the right conclusion from the result.
A unicode string has already been decoded, so trying to decode it again will produce an error if it contains non-ascii characters. When you use decode() on a unicode object, you effectively force python to do something like this:
temp = thatstring.encode('ascii') # convert unicode to bytes first
thatstring = temp.decode('utf-8') # now decode bytes back to unicode
Obviously, the first line will blow up as soon as it finds a non-ascii character, which explains why you see a unicode encode error, even though you are trying to decode the string. So the simple answer to your problem is: don't do that!
Instead, whenever your program receives string inputs, and wants to make sure they're converted to unicode, it should do something like this:
if isinstance(thatstring, bytes):
thatstring = thatstring.decode(encoding)

exceptions.UnicodeDecodeError - 'ascii' codec can't decode byte

I keep getting this error:
<type 'exceptions.UnicodeDecodeError'>: 'ascii' codec can't decode byte 0xe2 in position 0: ordinal not in range(128)
args = ('ascii', '\xe2\x9d\xb6 Senn =)', 0, 1, 'ordinal not in range(128)')
encoding = 'ascii'
end = 1
message = ''
object = '\xe2\x9d\xb6 Senn =)'
reason = 'ordinal not in range(128)'
start = 0
Using this code:
steamFriend = data['response']['players'][i]
n = steamUser(steamFriend['personaname'].encode("utf-8"), steamFriend['steamid'], steamFriend['avatarfull'], steamFriend['profileurl'], steamFriend['personastate'], False)
Some things to note here:
steamFriend is a JSON object
I get this error only sometimes, because the steamFriend['personaname'] contains some weird symbols (for example ❶), and I don't know how to parse this correctly so I don't get errors.
Any help is greatly appreciated.
Also, \xe2\x9d\xb6 Senn =) is supposed to represent ❶ Senn =), if that helps.
Without seeing the full code it is hard to tell, but it seems that steamUser expects ascii input. If that is the problem, you can solve it by:
streamFriend['personaname'].encode("ascii", errors="ignore")
or
streamFriend['personaname'].encode("ascii", errors="replace")
Obviously you will lose unicode characters in the process.
If the quoted error is occurring on the n=... line, the implication is that steamFriend['personaname'] is a byte string, not a Unicode string.
Consequently when you ask to .encode it, Python has to decode the string to Unicode in order to be able to encode it back to bytes. An implicit decoding happens using the default encoding, which is ASCII, so because the byte string does not contain only ASCII you get a failure.
Are you sure you didn't mean to do:
steamFriend['personaname'].decode("utf-8")
decoding the byte string '\xe2\x9d\xb6 Senn =)' using UTF-8 would give you the Unicode string u'\u2776 Senn =)', where U+2776=❶ so that would seem more like what you are after.
(Normally, however, JSON strings are explicitly Unicode, so it's not clear where you would have got the byte string from. How are you loading the JSON content?)

Categories