exceptions.UnicodeDecodeError - 'ascii' codec can't decode byte - python

I keep getting this error:
<type 'exceptions.UnicodeDecodeError'>: 'ascii' codec can't decode byte 0xe2 in position 0: ordinal not in range(128)
args = ('ascii', '\xe2\x9d\xb6 Senn =)', 0, 1, 'ordinal not in range(128)')
encoding = 'ascii'
end = 1
message = ''
object = '\xe2\x9d\xb6 Senn =)'
reason = 'ordinal not in range(128)'
start = 0
Using this code:
steamFriend = data['response']['players'][i]
n = steamUser(steamFriend['personaname'].encode("utf-8"), steamFriend['steamid'], steamFriend['avatarfull'], steamFriend['profileurl'], steamFriend['personastate'], False)
Some things to note here:
steamFriend is a JSON object
I get this error only sometimes, because the steamFriend['personaname'] contains some weird symbols (for example ❶), and I don't know how to parse this correctly so I don't get errors.
Any help is greatly appreciated.
Also, \xe2\x9d\xb6 Senn =) is supposed to represent ❶ Senn =), if that helps.

Without seeing the full code it is hard to tell, but it seems that steamUser expects ascii input. If that is the problem, you can solve it by:
streamFriend['personaname'].encode("ascii", errors="ignore")
or
streamFriend['personaname'].encode("ascii", errors="replace")
Obviously you will lose unicode characters in the process.

If the quoted error is occurring on the n=... line, the implication is that steamFriend['personaname'] is a byte string, not a Unicode string.
Consequently when you ask to .encode it, Python has to decode the string to Unicode in order to be able to encode it back to bytes. An implicit decoding happens using the default encoding, which is ASCII, so because the byte string does not contain only ASCII you get a failure.
Are you sure you didn't mean to do:
steamFriend['personaname'].decode("utf-8")
decoding the byte string '\xe2\x9d\xb6 Senn =)' using UTF-8 would give you the Unicode string u'\u2776 Senn =)', where U+2776=❶ so that would seem more like what you are after.
(Normally, however, JSON strings are explicitly Unicode, so it's not clear where you would have got the byte string from. How are you loading the JSON content?)

Related

why does pythons `s.encode('ascii', 'replace')` fails encoding

Why does using replace here:
s = s.encode('ascii', 'replace')
Give me this error?:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xcb in position 6755: ordinal not in range(128)
Isn't the whole point of 'replace' or 'ignore' to not fail when it can't decode a byte. Am I not understanding this?
(sorry I can't provide the actual string, the corpus is very large)
In any case, how do I tell python to ignore or replace characters that aren't ascii?
Note that you're getting a UnicodeDecodeError, not a UnicodeEncodeError.
That's because s.encode() takes a unicode string as input, but in this case you're not giving it one; you're giving it a bytestring instead.
Thus, it's encoding the bytestring you're handing it to unicode before trying to decode it, and it's in that initial encode that the error occurs.
This three-way round-trip is silly, but if you really wanted to do it:
s_bytes = '\xcb' # standard Python 2 string, aka a Python 3 bytestring
s_unicode = s_bytes.decode('ascii', 'replace') # a unicode string now
s_ascii = s_unicode.encode('ascii', 'replace') # a bytestring again

How to allow encode('utf-8') twice without getting error in python?

I have a legacy code segment that always encode('utf-8') for me when I pass in an unicode string (directly from database), is there a way to change unicode string to other format to allow it to be encoded to 'utf-8' again without getting an error, since I am not allowed to change the legacy code segment.
I've tried decoding it first but it returns this error
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-1: ordinal not in range(128)
If I leave the unicode string as is it returns
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe8 in position 986: ordinal not in range(128)
If I change the legacy code to not encode('utf-8') it works, but this is not a viable option
Edit:
Here is the code snippet
#!/usr/bin/env python2
# -*- coding: utf-8 -*-
if __name__ == "__main__":
# 1
a = u'贸易'
# 2
a = a.decode('utf-8')
# 3
a.encode('utf-8')
For some reason if I skip #2 I don't get the error that I mentioned above, I double check the type for the string, it seems like both is unicode, and both is the same character, but the code I am working on does not allow me to encode or decode to utf-8 , while the same character in some snippet allows me to do that.
Consider the following cases:
If you want a unicode string, and you already have a unicode string, you need do nothing.
If you want a bytestring, and you already have a bytestring, you need do nothing.
If you have a unicode string and want a bytestring, you encode it.
If you have a bytestring and want a unicode string, you decode it.
In none of these cases is it appropriate to encode or decode more than once.
In order for encode('utf-8') to make sense, the string must be a unicode string (or contain all-ASCII characters...). So, unless it's a unicode instance already, you have to decode it first from whatever encoding it's in to a unicode string, after which you can pass it into your legacy interface.
At no point does it make sense for anything to be double-encoded -- encoding takes a string and transforms it to a series of bytes; decoding takes a series of bytes and transforms them back into a string. The confusion only arises because Python 2 uses the str for both plain-ASCII strings and byte sequences.
>>> u'é'.encode('utf-8') # unicode string
'\xc3\xa9' # bytes, not unicode string
>>> '\xc3\xa9'.decode('utf-8')
u'\xe9' # unicode string
>>> u'\xe9' == u'é'
True

Unexpected recurrence of Python unicode string ascii codec error

After days and months of desperation I recently found a solution to overcome the infamous UnicodeEncodeError: 'ascii' codec cant encoe character u'\u2026' in position 18: ordinal not in range (128). It was dealing with multilingual strings pretty well until recently, I bumped into this error AGAIN!
I tried type(thatstring) and it returned Unicode.
So I tried:
thatstring=thatstring.decode('utf-8')
This was handling those multilanguage strings pretty well but it came back now. I also tried
thatstring=thatstring.decode('utf-8','ignore')
No use.
thatstring=thatstring.encode('utf-8','ignore')
bounces with the error
UnicodeDecodeError: 'ascii' codec cant decode byte 0xc3 in position 48: ordinal not in range (128) faster than its counterpart.
Please help me. Thanks.
You did the right thing by trying type(thatstring), but you didn't draw the right conclusion from the result.
A unicode string has already been decoded, so trying to decode it again will produce an error if it contains non-ascii characters. When you use decode() on a unicode object, you effectively force python to do something like this:
temp = thatstring.encode('ascii') # convert unicode to bytes first
thatstring = temp.decode('utf-8') # now decode bytes back to unicode
Obviously, the first line will blow up as soon as it finds a non-ascii character, which explains why you see a unicode encode error, even though you are trying to decode the string. So the simple answer to your problem is: don't do that!
Instead, whenever your program receives string inputs, and wants to make sure they're converted to unicode, it should do something like this:
if isinstance(thatstring, bytes):
thatstring = thatstring.decode(encoding)

Encoding and Decoding of text in Python

I am currently working with a python script (appengine) that takes an input from the user (text) and stores it in the database for re-distribution later.
The text that comes in is unknown, in terms of encoding and I need to have it encoded only once.
Example Texts from clients:
This%20is%20a%20test
This is a test
Now in python what I thought I could do is decode it then encode it so both samples become:
This%20is%20a%20test
This%20is%20a%20test
The code that I am using is as follows:
#
# Dencode as UTF-8
#
pl = pl.encode('UTF-8')
#
#Unquote the string, then requote to assure encoding
#
pl = urllib.quote(urllib.unquote(pl))
Where pl is from the POST parameter for payload.
The Issue
The issue is that sometimes I get special (Chinese, Arabic) type chars and I get the following error.
'ascii' codec can't encode character u'\xc3' in position 0: ordinal not in range(128)
..snip..
return codecs.utf_8_decode(input, errors, True)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xc3' in position 0: ordinal not in range(128)
does anyone know the best solution to process the string given the above issue?
Thanks.
Replace
pl = pl.encode('UTF-8')
with
pl = pl.decode('UTF-8')
since you're trying to decode a byte-string into a string of characters.
A design issue with Python 2 lets you .encode a bytestring (which is already encoded) by automatically decoding it as ASCII (which is why it apparently works for ASCII strings, failing only for non-ASCII bytes).

How can encode('ascii', 'ignore') throw a UnicodeDecodeError?

This line
data = get_url_contents(r[0]).encode('ascii', 'ignore')
produces this error
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 11450: ordinal not in range(128)
Why? I assumed that because I'm using 'ignore' that it should be impossible to have decode errors when saving the output to a value to a string variable.
Due to a quirk of Python 2, you can call encode on a byte string (i.e. text that's already encoded). In this case, it first tries to convert it to a unicode object by decoding with ascii. So, if get_url_contents is returning a byte string, your line effectively does this:
get_url_contents(r[0]).decode('ascii').encode('ascii', 'ignore')
In Python 3, byte strings don't have an encode method, so the same problem would just cause an AttributeError.
(Of course, I don't know that this is the problem - it could be related to the get_url_contents function. But what I've described above is my best guess)

Categories