Python 3 can't decode certain characters when readin - python

I have some super simple code trying to open a file, but it contains some Chinese/Arabic characters which I believe are stopping me from being able to open it. I'm not sure how to modify the file in order to allow it to open these characters. My code is simply
a_file = open("test2.txt")
lines = a_file.readlines()
print(lines)
and my error message is
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 2948: character maps to <undefined>
How do I fix this? Thanks!

The error message
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 2948: character maps to
is telling you that the bytes in the file cannot be decoded using the system's default encoding (the "'charmap' codec can't decode" message typically appears on Windows systems using legacy 8-bit encodings.)
If the file contains chinese or arabic characters it's more likely that the correct encoding to use when opening the file is UTF-8 or UTF-16.
Note that ISO-5589-1 / latin-1 encoding will decode any bytes, but the result may be meaningless, because it's an 8-bit encoding that can only represent 256 characters.
>>> s = '你好,世界'
>>> bs = s.encode('utf-8')
>>> print(bs.decode('ISO-8859-1'))
你好ï¼ä¸ç

Related

Can't read data using read_csv due to encoding errors

So, I am facing a huge issue. I am trying to read a csv file which has '|' as delimiters. If I use utf-8 or utf-sig-8 as encoders then I get the following error:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc0 in position 0: invalid start byte
but I use the unicode_escape encoding then I get this error:
UnicodeDecodeError: 'unicodeescape' codec can't decode byte 0x5c in position 13: \ at end of string
Is it an issue with the dataset?
it worked after I 'Saved with Encoding - utf-8' in Sublime Text Editor. I think the data had some issues.

'utf-8' codec can't decode byte 0xa0 in position 24: invalid start byte

I am trying to read a csv file using the following lines of Python code:
crimes = pd.read_csv('C:/Users/usuario1/Desktop/python/csv/001 Boston crimes/crime.csv', encoding = 'utf8')
crimes.head(5)
But I am getting decode error as follws:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa0 in position 24: invalid start byte
What is going wrong?
May be your file does not support utf-8 codec or has a character that does not support utf-8. You can try other encodings like ISO-8859-1. But it is best to check your file encoding first. To do so, something like the following should work:
1.
with open('Your/file/path') as f:
print(f)
This should print file details with encoding.
Or you can just open the csv and when you go to File -> Save As this should show your encoding.
If those don't help, you can ignore the rows that are causing problems by using `error_bad_lines=False'
crimes = pd.read_csv('Your/file/path', encoding='utf8', error_bad_lines=False)
Hope these will help

UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 1010494: character maps to <undefined>

please I need help with this:
url ='https://www.sec.gov/Archives/edgar/data/1437750/0001477932-13-004416.txt'
with open('file', 'wb') as f:
f.write(requests.get('%s' % url).content)
with open('file', 'r') as t:
words= t.read()
The above gives me the following error:
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 1010494: character maps to < undefined>
Thank you!
I just experienced the same problem. When I was trying to read the file, one of my strings had a double space: ' '. Removing that double space fixed the 0x9d problem.
Why are you writing your file as a binary, and then reading it as a unicode string? Python doesn't know how to decode some bytes from the original stream until you tell it what codec to use. Since the file you've streamed in your first command is not utf-8 encoded, try decoding your file to latin-1 when reading it:
with open('file', 'r', encoding='latin-1') as t:
words = t.read()

UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 2860: ordinal not in range(128)

I have the following record from JSON file that is giving me the error-
{"categoryId":"mpc-pc-optimization",
"categoryName":"PC Optimization",
"productMap":
{"mpp-aol-computer-checkup":"AOL Computer Checkup®",
"mpp-assist-by-aol-free-scan":"Assist by AOL Free Scan",
"mpp-mybenefits":"Monthly Statement of Benefits",
"mpp-perfectspeed":"PerfectSpeed",
"mpp-system-checkup":"System Checkup™","mpp-system-mechanic":"System Mechanic®"}}
The highlighted portion is causing the error.
How do I fix it?
The error comes from that ™ (trademark symbol), which is not part of the ascii code.
The byte 0xe2 is 11100010 in binary, which is outside the range of 128 (01111111 in binary).
The problem is that you are trying to decode with ascii, and instead should decode with unicode (e.g. UTF-8).
You could use a try-catch-block to catch the exception and then handle it by decoding as UTF-8.
try:
unicode(my_json_string, "ascii")
except UnicodeError:
value = unicode(my_json_string, "utf-8")

Unicode Using sqlite3 in Python 2.7.3

I'm trying to insert into a table, but it seems that the file I opened has non-ascii characters in it. This is the error I got:
sqlite3.ProgrammingError: You must not use 8-bit bytestrings unless you use a text_factory that can interpret 8-bit bytestrings (like text_factory = str). It is highly recommended that you instead just switch your application to Unicode strings.
So after doing some research, I tried putting this in my code:
encode("utf8","ignore")
Which then gave me this error:
UnicodeDecodeError: 'ascii' codec can't decode byte 0x92 in position 9: ordinal not in range(128)
So then I tried using the codecs library and open the file like this:
codecs.open(fileName, encoding='utf-8')
which gave me this error:
newchars, decodedbytes = self.decode(data, self.errors)
UnicodeDecodeError: 'utf8' codec can't decode byte 0x92 in position 0: invalid start byte
Then instead of utf-8, I used utf-16 to see if that would do anything and I got this error:
raise UnicodeError,"UTF-16 stream does not start with BOM"
UnicodeError: UTF-16 stream does not start with BOM
I'm all out of ideas...
Also I'm using Ubuntu, if it helps.

Categories