Confusion about Python Decode method - python

I'm trying to run the command u'\xe1'.decode("utf-8") in python and I get this error:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe1' in position 0: ordinal not in range(128)
Why does it say I'm trying to decode ascii when I'm passing utf-8 as the first argument? In addition to this, is there any way I can get the character á from u'\xe1' and save it in a string?

decode will take a string and convert it to unicode (eg: "\xb0".decode("utf8") ==> u"\xb0")
encode will take unicode and convert it to a string (eg: u"\xb0".encode("utf8") ==> "\xb0")
neither has much to do with the rendering of a string... it is mostly an internal representation
try
print u"\xe1"
(your terminal will need to support unicode (idle will work ... dos terminal not so much))
>>> print u"\xe1"
á
>>> print repr(u"\xe1".encode("utf8"))
'\xc3\xa1'
>>> print repr("\xc3\xa1".decode("utf8"))
u'\xe1'

Related

Inconsistent character decoding errors in Python

In both cp1250 and latin2, there is no character corresponding to the byte \x88 (cf. gray cells in the code page tables). Yet, if I try to decode this byte using the two encodings in Python 3, I get different results. The first encoding yields an error:
>>> b"\x88".decode("cp1250")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python3.6/encodings/cp1250.py", line 15, in decode
return codecs.charmap_decode(input,errors,decoding_table)
UnicodeDecodeError: 'charmap' codec can't decode byte 0x88 in position 0: character maps to <undefined>
Which makes sense, since there really is no character defined for that byte. However, the second encoding returns a character corresponding to Unicode codepoint \u0088, even though it shouldn't be defined either:
>>> b"\x88".decode("latin2")
'\x88'
>>> "\x88" == "\u0088"
True
Why is that?

How to handle "UnicodeEncodeError: 'ascii' codec can't encode character u'\u2019' "?

I am trying to access a table on a SQL Server using the Python 2.7 module adodbapi, and print certain information to the command prompt (Windows). Here is my original code snippet:
query_str = "SELECT id, headline, state, severity FROM GPS3.Defect ORDER BY id"
cur.execute(query_str)
dr_data = cur.fetchall()
con.close()
for i in dr_data:
print i
It will print out about 30 rows, all correctly formatted, but then it will stop and give me this:
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2019' in position 52: oridinal not in range(128)
So I looked this up online and went through the presentation explaining Unicode in Python, and I thought I understood. So I would explicitly tell the Python
interpreter that it was dealing with Unicode, and should encode it into UTF-8.
This is what i came up with:
for i in dr_data:
print (u"%s"%i).encode('utf-8')
However, I suppose I don't actually understand Unicode because I get the same exact error when I run this. I know this question is asked a lot, but could someone explain to me, simply, what is going on here? Thanks in advance.
Your error message does not agree with the statement that you are printing on a Windows command prompt. It does not default to the ascii codec. On US Windows, it defaults to cp437.
You can just print Unicode to the console without trying to encode it. Python will encode the Unicode strings to the console encoding. Here is an example. Note the source file is saved in UTF-8 encoding, and the encoding is declared with the special #coding:utf8 comment. This allows any Unicode character to be in the source code.
#coding:utf8
s1 = u'αßΓπΣσµτ' # cp437-supported
s2 = u'ÀÁÂÃÄÅ' # cp1252-supported
s3 = u'我是美国人。' # unsupported by cp437 or cp1252.
Since my US Windows console default to cp437, only s1 will display without error.
C:\>chcp
Active code page: 437
C:\>py -2 -i test.py
>>> print s1
αßΓπΣσµτ
>>> print s2
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Python27\lib\encodings\cp437.py", line 12, in encode
return codecs.charmap_encode(input,errors,encoding_map)
UnicodeEncodeError: 'charmap' codec can't encode characters in position 0-3: character maps to <undefined>
>>> print s3
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Python27\lib\encodings\cp437.py", line 12, in encode
return codecs.charmap_encode(input,errors,encoding_map)
UnicodeEncodeError: 'charmap' codec can't encode characters in position 0-5: character maps to <undefined>
Note the error message indicates what encoding it tried to use: cp437.
If I change the console encoding, now s2 will work correctly:
C:\>chcp 1252
Active code page: 1252
C:\>py -2 -i test.py
>>> print s1
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Python27\lib\encodings\cp1252.py", line 12, in encode
return codecs.charmap_encode(input,errors,encoding_table)
UnicodeEncodeError: 'charmap' codec can't encode character u'\u03b1' in position 0: character maps to <undefined>
>>> print s2
ÀÁÂÃÄÅ
>>> print s3
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Python27\lib\encodings\cp1252.py", line 12, in encode
return codecs.charmap_encode(input,errors,encoding_table)
UnicodeEncodeError: 'charmap' codec can't encode characters in position 0-5: character maps to <undefined>
Now s3 contains characters that common Western encodings don't support. You can go into Control Panel and change the system locale to Chinese and the console will then support Chinese encodings, but a better solution is to use a Python IDE that supports UTF-8, an encoding that supports all Unicode characters (subject to font support, or course). Below is the output of PythonWin, an editor that comes with the pywin32 Python extension:
>>> print s1
αßΓπΣσµτ
>>> print s2
ÀÁÂÃÄÅ
>>> print s3
我是美国人。
In summary, just use Unicode strings, and ideally use a terminal with UTF-8 and it will "just work". Convert text data to Unicode as soon as it is read from file, user input, network socket, etc. Process and print in Unicode, but encode it when it leaves the program (write to file, network socket, etc.).

Python - UnicodeDecode Error - Resolving [duplicate]

Odd error with unicode for me. I was dealing with unicode fine, but when I ran it this morning one item u'\u201d' gave error and gives me
UnicodeError: ASCII encoding error: ordinal not in range(128)
I looked up the code and apparently its utf-32 but when I try to decode it in the interpreter:
c = u'\u201d'
c.decode('utf-32', 'replace')
Or any other operation with it for that matter, it just doesnt recognize it in any codec but yet I found it as "RIGHT DOUBLE QUOTATION MARK"
I get:
Traceback (most recent call last):
File "<pyshell#154>", line 1, in <module>
c.decode('utf-32')
File "C:\Python27\lib\encodings\utf_32.py", line 11, in decode
return codecs.utf_32_decode(input, errors, True)
UnicodeEncodeError: 'ascii' codec can't encode character u'\u201d' in position 0: ordinal not in range(128)
You already have a unicode string, there is no need to decode it to a unicode string again.
What happens in that case is that python helpfully tries to first encode it for you, so that you can then decode it from utf-32. It uses the default encoding to do so, which happens to be ASCII. Here is an explicit encode to show you the exception raised in that case:
>>> u'\u201d'.encode('ASCII')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\u201d' in position 0: ordinal not in range(128)
In short, when you have a unicode literal like u'', there is no need to decode it.
Read up on the unicode, encodings, and default settings in the Python Unicode HOWTO. Another invaluable article on the subject is Joel Spolsky's Minimun Unicode knowledge post.

python 2.7 string.join() with unicode

I have bunch of byte strings (str, not unicode, in python 2.7) containing unicode data (in utf-8 encoding).
I am trying to join them( by "".join(utf8_strings) or u"".join(utf8_strings)) which throws
UnicodeDecodeError: 'ascii' codec can't decode byte 0xec in position 0: ordinal not in range(128)`
Is there any way to make use of .join() method for non-ascii strings? sure I can concatenate them in a for loop, but that wouldn't be cost-effective.
Joining byte strings using ''.join() works just fine; the error you see would only appear if you mixed unicode and str objects:
>>> utf8 = [u'\u0123'.encode('utf8'), u'\u0234'.encode('utf8')]
>>> ''.join(utf8)
'\xc4\xa3\xc8\xb4'
>>> u''.join(utf8)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc4 in position 0: ordinal not in range(128)
>>> ''.join(utf8 + [u'unicode object'])
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc4 in position 0: ordinal not in range(128)
The exceptions above are raised when using the Unicode value u'' as the joiner, and adding a Unicode string to the list of strings to join, respectively.
"".join(...) will work if each parameter is a str (whatever the encoding may be).
The issue you are seeing is probably not related to the join, but the data you supply to it. Post more code so we can see what's really wrong.

Unicode error Ordinal not in range

Odd error with unicode for me. I was dealing with unicode fine, but when I ran it this morning one item u'\u201d' gave error and gives me
UnicodeError: ASCII encoding error: ordinal not in range(128)
I looked up the code and apparently its utf-32 but when I try to decode it in the interpreter:
c = u'\u201d'
c.decode('utf-32', 'replace')
Or any other operation with it for that matter, it just doesnt recognize it in any codec but yet I found it as "RIGHT DOUBLE QUOTATION MARK"
I get:
Traceback (most recent call last):
File "<pyshell#154>", line 1, in <module>
c.decode('utf-32')
File "C:\Python27\lib\encodings\utf_32.py", line 11, in decode
return codecs.utf_32_decode(input, errors, True)
UnicodeEncodeError: 'ascii' codec can't encode character u'\u201d' in position 0: ordinal not in range(128)
You already have a unicode string, there is no need to decode it to a unicode string again.
What happens in that case is that python helpfully tries to first encode it for you, so that you can then decode it from utf-32. It uses the default encoding to do so, which happens to be ASCII. Here is an explicit encode to show you the exception raised in that case:
>>> u'\u201d'.encode('ASCII')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\u201d' in position 0: ordinal not in range(128)
In short, when you have a unicode literal like u'', there is no need to decode it.
Read up on the unicode, encodings, and default settings in the Python Unicode HOWTO. Another invaluable article on the subject is Joel Spolsky's Minimun Unicode knowledge post.

Categories