I am getting a value of column from database like below:
`;;][#+©
When I am reading this in my Python code this is giving below error message:
UnicodeDecodeError: 'utf8' codec can't decode byte 0xa9 in position 7: invalid start byte
Then I tried below code but not working:
unicode(' `;;][#+©', 'utf-8')
Now how can I solve this problem?
First, read this article on Unicode. The string you have is encoded in some encoding, but not UTF8. The reason we can tell it's not UTF8 is that the 7th byte 0xa9 (= 169) isn't in the range 0-127 (ASCII), but isn't preceded by a leading byte.
So the trick is to work out what encoding it is. We've got a hint: the encoding needs to represent the byte 0xa9 as the glyph ©. I'd guess that it's either the Windows-1252 or Latin-1 encodings because they're very common, and looking up A9 in the grid (character encodings are essentially the same as playing battleships) gives the copyright sign in both.
>>> unicode(' `;;][#+©')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 8: ordinal not in range(128)
>>> unicode(' `;;][#+©', 'latin-1')
u' `;;][#+\xc2\xa9'
>>> unicode(' `;;][#+©', 'cp1252')
u' `;;][#+\xc2\xa9'
Related
In both cp1250 and latin2, there is no character corresponding to the byte \x88 (cf. gray cells in the code page tables). Yet, if I try to decode this byte using the two encodings in Python 3, I get different results. The first encoding yields an error:
>>> b"\x88".decode("cp1250")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python3.6/encodings/cp1250.py", line 15, in decode
return codecs.charmap_decode(input,errors,decoding_table)
UnicodeDecodeError: 'charmap' codec can't decode byte 0x88 in position 0: character maps to <undefined>
Which makes sense, since there really is no character defined for that byte. However, the second encoding returns a character corresponding to Unicode codepoint \u0088, even though it shouldn't be defined either:
>>> b"\x88".decode("latin2")
'\x88'
>>> "\x88" == "\u0088"
True
Why is that?
I'm trying to read in Python3 a text file specifying encoding cp1252 which has unmapped characters (for instance byte 0x8d).
with open(inputfilename, mode='r', encoding='cp1252') as inputfile:
print(inputfile.readlines())
I obviously get the following exception:
Traceback (most recent call last):
File "test.py", line 9, in <module>
print(inputfile.readlines())
File "/usr/lib/python3.6/encodings/cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x8d in position 14: character maps to <undefined>
I'd like to understand why, when reading the same file with encoding latin-1, I don't get the same exception and the byte 0x8d is represented as hex string:
$ python3 test.py
['This is a test\x8d file\n']
As far as i know byte 0x8d does not have a match on both encodings (latin-1 and cp1252). What am I missing? Why Python3 behaviour is different?
from the docs: The simplest text encoding (called 'latin-1' or 'iso-8859-1') maps the code points 0–255 to the bytes 0x0–0xff
https://docs.python.org/3/library/codecs.html
Odd error with unicode for me. I was dealing with unicode fine, but when I ran it this morning one item u'\u201d' gave error and gives me
UnicodeError: ASCII encoding error: ordinal not in range(128)
I looked up the code and apparently its utf-32 but when I try to decode it in the interpreter:
c = u'\u201d'
c.decode('utf-32', 'replace')
Or any other operation with it for that matter, it just doesnt recognize it in any codec but yet I found it as "RIGHT DOUBLE QUOTATION MARK"
I get:
Traceback (most recent call last):
File "<pyshell#154>", line 1, in <module>
c.decode('utf-32')
File "C:\Python27\lib\encodings\utf_32.py", line 11, in decode
return codecs.utf_32_decode(input, errors, True)
UnicodeEncodeError: 'ascii' codec can't encode character u'\u201d' in position 0: ordinal not in range(128)
You already have a unicode string, there is no need to decode it to a unicode string again.
What happens in that case is that python helpfully tries to first encode it for you, so that you can then decode it from utf-32. It uses the default encoding to do so, which happens to be ASCII. Here is an explicit encode to show you the exception raised in that case:
>>> u'\u201d'.encode('ASCII')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\u201d' in position 0: ordinal not in range(128)
In short, when you have a unicode literal like u'', there is no need to decode it.
Read up on the unicode, encodings, and default settings in the Python Unicode HOWTO. Another invaluable article on the subject is Joel Spolsky's Minimun Unicode knowledge post.
I have bunch of byte strings (str, not unicode, in python 2.7) containing unicode data (in utf-8 encoding).
I am trying to join them( by "".join(utf8_strings) or u"".join(utf8_strings)) which throws
UnicodeDecodeError: 'ascii' codec can't decode byte 0xec in position 0: ordinal not in range(128)`
Is there any way to make use of .join() method for non-ascii strings? sure I can concatenate them in a for loop, but that wouldn't be cost-effective.
Joining byte strings using ''.join() works just fine; the error you see would only appear if you mixed unicode and str objects:
>>> utf8 = [u'\u0123'.encode('utf8'), u'\u0234'.encode('utf8')]
>>> ''.join(utf8)
'\xc4\xa3\xc8\xb4'
>>> u''.join(utf8)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc4 in position 0: ordinal not in range(128)
>>> ''.join(utf8 + [u'unicode object'])
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc4 in position 0: ordinal not in range(128)
The exceptions above are raised when using the Unicode value u'' as the joiner, and adding a Unicode string to the list of strings to join, respectively.
"".join(...) will work if each parameter is a str (whatever the encoding may be).
The issue you are seeing is probably not related to the join, but the data you supply to it. Post more code so we can see what's really wrong.
Odd error with unicode for me. I was dealing with unicode fine, but when I ran it this morning one item u'\u201d' gave error and gives me
UnicodeError: ASCII encoding error: ordinal not in range(128)
I looked up the code and apparently its utf-32 but when I try to decode it in the interpreter:
c = u'\u201d'
c.decode('utf-32', 'replace')
Or any other operation with it for that matter, it just doesnt recognize it in any codec but yet I found it as "RIGHT DOUBLE QUOTATION MARK"
I get:
Traceback (most recent call last):
File "<pyshell#154>", line 1, in <module>
c.decode('utf-32')
File "C:\Python27\lib\encodings\utf_32.py", line 11, in decode
return codecs.utf_32_decode(input, errors, True)
UnicodeEncodeError: 'ascii' codec can't encode character u'\u201d' in position 0: ordinal not in range(128)
You already have a unicode string, there is no need to decode it to a unicode string again.
What happens in that case is that python helpfully tries to first encode it for you, so that you can then decode it from utf-32. It uses the default encoding to do so, which happens to be ASCII. Here is an explicit encode to show you the exception raised in that case:
>>> u'\u201d'.encode('ASCII')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\u201d' in position 0: ordinal not in range(128)
In short, when you have a unicode literal like u'', there is no need to decode it.
Read up on the unicode, encodings, and default settings in the Python Unicode HOWTO. Another invaluable article on the subject is Joel Spolsky's Minimun Unicode knowledge post.