Inconsistent character decoding errors in Python - python

In both cp1250 and latin2, there is no character corresponding to the byte \x88 (cf. gray cells in the code page tables). Yet, if I try to decode this byte using the two encodings in Python 3, I get different results. The first encoding yields an error:
>>> b"\x88".decode("cp1250")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python3.6/encodings/cp1250.py", line 15, in decode
return codecs.charmap_decode(input,errors,decoding_table)
UnicodeDecodeError: 'charmap' codec can't decode byte 0x88 in position 0: character maps to <undefined>
Which makes sense, since there really is no character defined for that byte. However, the second encoding returns a character corresponding to Unicode codepoint \u0088, even though it shouldn't be defined either:
>>> b"\x88".decode("latin2")
'\x88'
>>> "\x88" == "\u0088"
True
Why is that?

Related

UnicodeDecodeError: when decoding bytes to string

I'm trying to decode a bytes value into a readable string using decode() function and utf-16 as follows:
s1 = b'\xe2\x80\x99'
print(s1.decode("utf-16-le"))
Expecting a decoded string value, I get this error:
Traceback (most recent call last):
File "<string>", line 5, in <module>
File "/usr/lib/python3.8/encodings/utf_16_le.py", line 16, in decode
return codecs.utf_16_le_decode(input, errors, True)
UnicodeDecodeError: 'utf-16-le' codec can't decode byte 0x99 in position 2: truncated data
There seems to be some error in the sequence character, but I can't understand why.
Are you sure that's UTF-16? Because...
A UTF-16 encoded string's length is always an even number of bytes. Never 3 bytes, because 3 is odd.
b'\xe2\x80\x99' is the UTF-8 encoding of U+2019 Right Single Quotation Mark

Python - UnicodeDecode Error - Resolving [duplicate]

Odd error with unicode for me. I was dealing with unicode fine, but when I ran it this morning one item u'\u201d' gave error and gives me
UnicodeError: ASCII encoding error: ordinal not in range(128)
I looked up the code and apparently its utf-32 but when I try to decode it in the interpreter:
c = u'\u201d'
c.decode('utf-32', 'replace')
Or any other operation with it for that matter, it just doesnt recognize it in any codec but yet I found it as "RIGHT DOUBLE QUOTATION MARK"
I get:
Traceback (most recent call last):
File "<pyshell#154>", line 1, in <module>
c.decode('utf-32')
File "C:\Python27\lib\encodings\utf_32.py", line 11, in decode
return codecs.utf_32_decode(input, errors, True)
UnicodeEncodeError: 'ascii' codec can't encode character u'\u201d' in position 0: ordinal not in range(128)
You already have a unicode string, there is no need to decode it to a unicode string again.
What happens in that case is that python helpfully tries to first encode it for you, so that you can then decode it from utf-32. It uses the default encoding to do so, which happens to be ASCII. Here is an explicit encode to show you the exception raised in that case:
>>> u'\u201d'.encode('ASCII')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\u201d' in position 0: ordinal not in range(128)
In short, when you have a unicode literal like u'', there is no need to decode it.
Read up on the unicode, encodings, and default settings in the Python Unicode HOWTO. Another invaluable article on the subject is Joel Spolsky's Minimun Unicode knowledge post.

Confusion about Python Decode method

I'm trying to run the command u'\xe1'.decode("utf-8") in python and I get this error:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe1' in position 0: ordinal not in range(128)
Why does it say I'm trying to decode ascii when I'm passing utf-8 as the first argument? In addition to this, is there any way I can get the character á from u'\xe1' and save it in a string?
decode will take a string and convert it to unicode (eg: "\xb0".decode("utf8") ==> u"\xb0")
encode will take unicode and convert it to a string (eg: u"\xb0".encode("utf8") ==> "\xb0")
neither has much to do with the rendering of a string... it is mostly an internal representation
try
print u"\xe1"
(your terminal will need to support unicode (idle will work ... dos terminal not so much))
>>> print u"\xe1"
á
>>> print repr(u"\xe1".encode("utf8"))
'\xc3\xa1'
>>> print repr("\xc3\xa1".decode("utf8"))
u'\xe1'

python 2.7 string.join() with unicode

I have bunch of byte strings (str, not unicode, in python 2.7) containing unicode data (in utf-8 encoding).
I am trying to join them( by "".join(utf8_strings) or u"".join(utf8_strings)) which throws
UnicodeDecodeError: 'ascii' codec can't decode byte 0xec in position 0: ordinal not in range(128)`
Is there any way to make use of .join() method for non-ascii strings? sure I can concatenate them in a for loop, but that wouldn't be cost-effective.
Joining byte strings using ''.join() works just fine; the error you see would only appear if you mixed unicode and str objects:
>>> utf8 = [u'\u0123'.encode('utf8'), u'\u0234'.encode('utf8')]
>>> ''.join(utf8)
'\xc4\xa3\xc8\xb4'
>>> u''.join(utf8)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc4 in position 0: ordinal not in range(128)
>>> ''.join(utf8 + [u'unicode object'])
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc4 in position 0: ordinal not in range(128)
The exceptions above are raised when using the Unicode value u'' as the joiner, and adding a Unicode string to the list of strings to join, respectively.
"".join(...) will work if each parameter is a str (whatever the encoding may be).
The issue you are seeing is probably not related to the join, but the data you supply to it. Post more code so we can see what's really wrong.

Unicode error Ordinal not in range

Odd error with unicode for me. I was dealing with unicode fine, but when I ran it this morning one item u'\u201d' gave error and gives me
UnicodeError: ASCII encoding error: ordinal not in range(128)
I looked up the code and apparently its utf-32 but when I try to decode it in the interpreter:
c = u'\u201d'
c.decode('utf-32', 'replace')
Or any other operation with it for that matter, it just doesnt recognize it in any codec but yet I found it as "RIGHT DOUBLE QUOTATION MARK"
I get:
Traceback (most recent call last):
File "<pyshell#154>", line 1, in <module>
c.decode('utf-32')
File "C:\Python27\lib\encodings\utf_32.py", line 11, in decode
return codecs.utf_32_decode(input, errors, True)
UnicodeEncodeError: 'ascii' codec can't encode character u'\u201d' in position 0: ordinal not in range(128)
You already have a unicode string, there is no need to decode it to a unicode string again.
What happens in that case is that python helpfully tries to first encode it for you, so that you can then decode it from utf-32. It uses the default encoding to do so, which happens to be ASCII. Here is an explicit encode to show you the exception raised in that case:
>>> u'\u201d'.encode('ASCII')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\u201d' in position 0: ordinal not in range(128)
In short, when you have a unicode literal like u'', there is no need to decode it.
Read up on the unicode, encodings, and default settings in the Python Unicode HOWTO. Another invaluable article on the subject is Joel Spolsky's Minimun Unicode knowledge post.

Categories