Python3 different behaviour between latin-1 and cp1252 when decoding unmapped characters - python

I'm trying to read in Python3 a text file specifying encoding cp1252 which has unmapped characters (for instance byte 0x8d).
with open(inputfilename, mode='r', encoding='cp1252') as inputfile:
print(inputfile.readlines())
I obviously get the following exception:
Traceback (most recent call last):
File "test.py", line 9, in <module>
print(inputfile.readlines())
File "/usr/lib/python3.6/encodings/cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x8d in position 14: character maps to <undefined>
I'd like to understand why, when reading the same file with encoding latin-1, I don't get the same exception and the byte 0x8d is represented as hex string:
$ python3 test.py
['This is a test\x8d file\n']
As far as i know byte 0x8d does not have a match on both encodings (latin-1 and cp1252). What am I missing? Why Python3 behaviour is different?

from the docs: The simplest text encoding (called 'latin-1' or 'iso-8859-1') maps the code points 0–255 to the bytes 0x0–0xff
https://docs.python.org/3/library/codecs.html

Related

How to handle "UnicodeEncodeError: 'ascii' codec can't encode character u'\u2019' "?

I am trying to access a table on a SQL Server using the Python 2.7 module adodbapi, and print certain information to the command prompt (Windows). Here is my original code snippet:
query_str = "SELECT id, headline, state, severity FROM GPS3.Defect ORDER BY id"
cur.execute(query_str)
dr_data = cur.fetchall()
con.close()
for i in dr_data:
print i
It will print out about 30 rows, all correctly formatted, but then it will stop and give me this:
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2019' in position 52: oridinal not in range(128)
So I looked this up online and went through the presentation explaining Unicode in Python, and I thought I understood. So I would explicitly tell the Python
interpreter that it was dealing with Unicode, and should encode it into UTF-8.
This is what i came up with:
for i in dr_data:
print (u"%s"%i).encode('utf-8')
However, I suppose I don't actually understand Unicode because I get the same exact error when I run this. I know this question is asked a lot, but could someone explain to me, simply, what is going on here? Thanks in advance.
Your error message does not agree with the statement that you are printing on a Windows command prompt. It does not default to the ascii codec. On US Windows, it defaults to cp437.
You can just print Unicode to the console without trying to encode it. Python will encode the Unicode strings to the console encoding. Here is an example. Note the source file is saved in UTF-8 encoding, and the encoding is declared with the special #coding:utf8 comment. This allows any Unicode character to be in the source code.
#coding:utf8
s1 = u'αßΓπΣσµτ' # cp437-supported
s2 = u'ÀÁÂÃÄÅ' # cp1252-supported
s3 = u'我是美国人。' # unsupported by cp437 or cp1252.
Since my US Windows console default to cp437, only s1 will display without error.
C:\>chcp
Active code page: 437
C:\>py -2 -i test.py
>>> print s1
αßΓπΣσµτ
>>> print s2
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Python27\lib\encodings\cp437.py", line 12, in encode
return codecs.charmap_encode(input,errors,encoding_map)
UnicodeEncodeError: 'charmap' codec can't encode characters in position 0-3: character maps to <undefined>
>>> print s3
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Python27\lib\encodings\cp437.py", line 12, in encode
return codecs.charmap_encode(input,errors,encoding_map)
UnicodeEncodeError: 'charmap' codec can't encode characters in position 0-5: character maps to <undefined>
Note the error message indicates what encoding it tried to use: cp437.
If I change the console encoding, now s2 will work correctly:
C:\>chcp 1252
Active code page: 1252
C:\>py -2 -i test.py
>>> print s1
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Python27\lib\encodings\cp1252.py", line 12, in encode
return codecs.charmap_encode(input,errors,encoding_table)
UnicodeEncodeError: 'charmap' codec can't encode character u'\u03b1' in position 0: character maps to <undefined>
>>> print s2
ÀÁÂÃÄÅ
>>> print s3
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Python27\lib\encodings\cp1252.py", line 12, in encode
return codecs.charmap_encode(input,errors,encoding_table)
UnicodeEncodeError: 'charmap' codec can't encode characters in position 0-5: character maps to <undefined>
Now s3 contains characters that common Western encodings don't support. You can go into Control Panel and change the system locale to Chinese and the console will then support Chinese encodings, but a better solution is to use a Python IDE that supports UTF-8, an encoding that supports all Unicode characters (subject to font support, or course). Below is the output of PythonWin, an editor that comes with the pywin32 Python extension:
>>> print s1
αßΓπΣσµτ
>>> print s2
ÀÁÂÃÄÅ
>>> print s3
我是美国人。
In summary, just use Unicode strings, and ideally use a terminal with UTF-8 and it will "just work". Convert text data to Unicode as soon as it is read from file, user input, network socket, etc. Process and print in Unicode, but encode it when it leaves the program (write to file, network socket, etc.).

Python - UnicodeDecode Error - Resolving [duplicate]

Odd error with unicode for me. I was dealing with unicode fine, but when I ran it this morning one item u'\u201d' gave error and gives me
UnicodeError: ASCII encoding error: ordinal not in range(128)
I looked up the code and apparently its utf-32 but when I try to decode it in the interpreter:
c = u'\u201d'
c.decode('utf-32', 'replace')
Or any other operation with it for that matter, it just doesnt recognize it in any codec but yet I found it as "RIGHT DOUBLE QUOTATION MARK"
I get:
Traceback (most recent call last):
File "<pyshell#154>", line 1, in <module>
c.decode('utf-32')
File "C:\Python27\lib\encodings\utf_32.py", line 11, in decode
return codecs.utf_32_decode(input, errors, True)
UnicodeEncodeError: 'ascii' codec can't encode character u'\u201d' in position 0: ordinal not in range(128)
You already have a unicode string, there is no need to decode it to a unicode string again.
What happens in that case is that python helpfully tries to first encode it for you, so that you can then decode it from utf-32. It uses the default encoding to do so, which happens to be ASCII. Here is an explicit encode to show you the exception raised in that case:
>>> u'\u201d'.encode('ASCII')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\u201d' in position 0: ordinal not in range(128)
In short, when you have a unicode literal like u'', there is no need to decode it.
Read up on the unicode, encodings, and default settings in the Python Unicode HOWTO. Another invaluable article on the subject is Joel Spolsky's Minimun Unicode knowledge post.

How to read special characters

I am getting a value of column from database like below:
`;;][#+©
When I am reading this in my Python code this is giving below error message:
UnicodeDecodeError: 'utf8' codec can't decode byte 0xa9 in position 7: invalid start byte
Then I tried below code but not working:
unicode(' `;;][#+©', 'utf-8')
Now how can I solve this problem?
First, read this article on Unicode. The string you have is encoded in some encoding, but not UTF8. The reason we can tell it's not UTF8 is that the 7th byte 0xa9 (= 169) isn't in the range 0-127 (ASCII), but isn't preceded by a leading byte.
So the trick is to work out what encoding it is. We've got a hint: the encoding needs to represent the byte 0xa9 as the glyph ©. I'd guess that it's either the Windows-1252 or Latin-1 encodings because they're very common, and looking up A9 in the grid (character encodings are essentially the same as playing battleships) gives the copyright sign in both.
>>> unicode(' `;;][#+©')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 8: ordinal not in range(128)
>>> unicode(' `;;][#+©', 'latin-1')
u' `;;][#+\xc2\xa9'
>>> unicode(' `;;][#+©', 'cp1252')
u' `;;][#+\xc2\xa9'

Python 2.7 UnicodeDecodeError: 'ascii' codec can't decode byte

I've been parsing some docx files (UTF-8 encoded XML) with special characters (Czech alphabet). When I try to output to stdout, everything goes smoothly, but I'm unable to output data to the file,
Traceback (most recent call last):
File "./test.py", line 360, in
ofile.write(u'\t\t\t\t\t\n')
UnicodeEncodeError: 'ascii' codec can't encode character u'\xed' in position 37: ordinal not in range(128)
Although I explicitly cast the word variable to unicode type (type(word) returned unicode), I tried to encode it with .encode('utf-8) I'm still stuck with this error.
Here is a sample of the code as it looks now:
for word in word_list:
word = unicode(word)
#...
ofile.write(u'\t\t\t\t\t<feat att="writtenForm" val="'+word+u'"/>\n')
#...
I also tried the following:
for word in word_list:
word = word.encode('utf-8')
#...
ofile.write(u'\t\t\t\t\t<feat att="writtenForm" val="'+word+u'"/>\n')
#...
Even the combination of these two:
word = unicode(word)
word = word.encode('utf-8')
I was kind of desperate so I even tried to encode the word variable inside the ofile.write()
ofile.write(u'\t\t\t\t\t<feat att="writtenForm" val="'+word.encode('utf-8')+u'"/>\n')
I would appreciate any hints of what I'm doing wrong.
ofile is a bytestream, which you are writing a character string to. Therefore, it tries to handle your mistake by encoding to a byte string. This is only generally safe with ASCII characters. Since word contains non-ASCII characters, it fails:
>>> open('/dev/null', 'wb').write(u'ä')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe4' in position 0:
ordinal not in range(128)
Make ofile a text stream by opening the file with io.open, with a mode like 'wt', and an explicit encoding:
>>> import io
>>> io.open('/dev/null', 'wt', encoding='utf-8').write(u'ä')
1L
Alternatively, you can also use codecs.open with pretty much the same interface, or encode all strings manually with encode.
Phihag's answer is correct. I just want to propose to convert the unicode to a byte-string manually with an explicit encoding:
ofile.write((u'\t\t\t\t\t<feat att="writtenForm" val="' +
word + u'"/>\n').encode('utf-8'))
(Maybe you like to know how it's done using basic mechanisms instead of advanced wizardry and black magic like io.open.)
I've had a similar error when writing to word documents (.docx). Specifically with the Euro symbol (€).
x = "€".encode()
Which gave the error:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 0: ordinal not in range(128)
How I solved it was by:
x = "€".decode()
I hope this helps!
The best solution i found in stackoverflow is in this post:
How to fix: "UnicodeDecodeError: 'ascii' codec can't decode byte"
put in the beggining of the code and the default codification will be utf8
# encoding=utf8
import sys
reload(sys)
sys.setdefaultencoding('utf8')

Unicode error Ordinal not in range

Odd error with unicode for me. I was dealing with unicode fine, but when I ran it this morning one item u'\u201d' gave error and gives me
UnicodeError: ASCII encoding error: ordinal not in range(128)
I looked up the code and apparently its utf-32 but when I try to decode it in the interpreter:
c = u'\u201d'
c.decode('utf-32', 'replace')
Or any other operation with it for that matter, it just doesnt recognize it in any codec but yet I found it as "RIGHT DOUBLE QUOTATION MARK"
I get:
Traceback (most recent call last):
File "<pyshell#154>", line 1, in <module>
c.decode('utf-32')
File "C:\Python27\lib\encodings\utf_32.py", line 11, in decode
return codecs.utf_32_decode(input, errors, True)
UnicodeEncodeError: 'ascii' codec can't encode character u'\u201d' in position 0: ordinal not in range(128)
You already have a unicode string, there is no need to decode it to a unicode string again.
What happens in that case is that python helpfully tries to first encode it for you, so that you can then decode it from utf-32. It uses the default encoding to do so, which happens to be ASCII. Here is an explicit encode to show you the exception raised in that case:
>>> u'\u201d'.encode('ASCII')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\u201d' in position 0: ordinal not in range(128)
In short, when you have a unicode literal like u'', there is no need to decode it.
Read up on the unicode, encodings, and default settings in the Python Unicode HOWTO. Another invaluable article on the subject is Joel Spolsky's Minimun Unicode knowledge post.

Categories