'utf-8' codec can't decode byte 0x89

'utf-8' codec can't decode byte 0x89 - python

I want to read a csv file and process some columns but I keep getting issues.
Stuck with the following error:
Traceback (most recent call last):
File "C:\Users\Sven\Desktop\Python\read csv.py", line 5, in <module>
for row in reader:
File "C:\Python34\lib\codecs.py", line 313, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x89 in position 446: invalid start byte
>>>
My Code
import csv
with open("c:\\Users\\Sven\\Desktop\\relaties 24112014.csv",newline='', encoding="utf8") as f:
reader = csv.reader(f,delimiter=';',quotechar='|')
#print(sum(1 for row in reader))
for row in reader:
print(row)
if row:
value = row[6]
value = value.replace('(', '')
value = value.replace(')', '')
value = value.replace(' ', '')
value = value.replace('.', '')
value = value.replace('0032', '0')
if len(value) > 0:
print(value + ' Length: ' + str(len(value)))
I'm a beginner with Python, tried googling, but hard to find the right solution.
Can anyone help me out?

The first byte of a .PNG file is 0x89. Not saying that is your problem, but the .PNG header is specifically designed so that it is NOT accidentally interpreted as text.
Why you would have a .csv file that is actually a .png I don't know. But it definitely could happen if someone accidentally renamed the file. On windows 10 every once and a while I accidentally mass-rename files by accident because of their stupid checkbox feature. Why Microsoft decided desktop machines having identical UI controls to tablets was I good idea... I don't know.

This is the most important clue:
invalid start byte
\x89 is not, as suggested in the comments, an invalid UTF-8 byte. It is a completely valid continuation byte. Meaning if it follows the correct byte value, it codes UTF-8 correctly:
http://hexutf8.com/?q=0xc90x89
So either you (1) do not have UTF-8 data as you expect, or (2) you have some malformed UTF-8 data. The Python codec is simply letting you know that it encountered \x89 in the wrong order in the sequence.
(More on continuation bytes here: http://en.wikipedia.org/wiki/UTF-8#Codepage_layout)

I was also getting the similar error when trying to read or upload the following kinds of files:
CSV File
JPEG File
PNG File
Zip File
The best way to avoid error like:
'utf-8' codec can't decode byte 0x89
'utf-8' codec can't decode byte 0xff
is to read these files as Bytes. When you treat them as byte then you need not provide any encoding value here. So when you open them you should specify:
with open(file_path, 'rb') as file:
Or in your case, the code should be something like:
import csv
with open("c:\\Users\\Sven\\Desktop\\relaties 24112014.csv", newline='', 'rb') as f:
reader = csv.reader(f,delimiter=';',quotechar='|')

Related

pd.read_excel throws UnicodeDecodeError

I am trying to read data from excel to pandas. The file I get comes from api and is not saved (the access to the file needs special permissions, so I don't want to save it). When I try to read excel from file
with open('path_to_file') as file:
re = pd.read_excel(file)
I get the error
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x98 in position
10: invalid start byte
When I input path in palce of file everythng works fine
re = pd.read_excel('path-to-exactly-the-same-file')
Is there a way to read excel by pandas without saving it and inputting path?

the part that was missing was 'rb' in open
with open('path_to_file', 'rb') as file:
re = pd.read_excel(file)
to treat the file as binary. Idea taken from error UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte

'utf-8' codec can't decode byte 0xa0 in position 24: invalid start byte

I am trying to read a csv file using the following lines of Python code:
crimes = pd.read_csv('C:/Users/usuario1/Desktop/python/csv/001 Boston crimes/crime.csv', encoding = 'utf8')
crimes.head(5)
But I am getting decode error as follws:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa0 in position 24: invalid start byte
What is going wrong?

May be your file does not support utf-8 codec or has a character that does not support utf-8. You can try other encodings like ISO-8859-1. But it is best to check your file encoding first. To do so, something like the following should work:
1.
with open('Your/file/path') as f:
print(f)
This should print file details with encoding.
Or you can just open the csv and when you go to File -> Save As this should show your encoding.
If those don't help, you can ignore the rows that are causing problems by using `error_bad_lines=False'
crimes = pd.read_csv('Your/file/path', encoding='utf8', error_bad_lines=False)
Hope these will help

Python 3 UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 0: ordinal not in range(128)

I'm implementing this notebook on Windows with Python 3.5.3 and got the follow error on load_vectors() call. I've tried different solutions posted but none worked.
<ipython-input-86-dd4c123b0494> in load_vectors(loc)
1 def load_vectors(loc):
2 return (load_array(loc+'.dat'),
----> 3 pickle.load(open(loc+'_words.pkl','rb')),
4 pickle.load(open(loc+'_idx.pkl','rb')))
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 0: ordinal not in range(128)

I solved this issue by copying and pasting the entire csv file into text and reading it with:
with open(self.path + "/review_collection.txt", "r", encoding="utf-8") as f:
read = f.read().splitlines()
for row in read:
print(row)

You should probably give encoding for pickle.load(f, encoding='latin1'), but please make sure all the characters in your file will follow the encoding.
By default, your pickle code is trying to decode the file with 'ASCII' which fails. Instead you can explicitly tell which one to use. See this from Documentation.
If latin1 doesn't solve, try with encoding='bytes' and then decode all the keys and values later on.

I got the same error as well. I realized that I copy and pasted text from a file that had left and right double-quotes (curly quotes). Once I changed it to the standard straight double-quotes (") the issue was fixed!
See this link for the difference between the quotes: https://www.cl.cam.ac.uk/~mgk25/ucs/quotes.html

UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 13: ordinal not in range(128), regarding reading in files

I always get this error UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 13: ordinal not in range(128) when ever I try to read in a file to my python program that has an 's. For example the word "It's" would crash my program and I would get this error. Why does it do this?
def readInFile(fileName):
inputFile = open(fileName, 'r')
SomeInput = inputFile.read()
inputFile.close()
return SomeInput

I'm in a python class right now and kept running into the same problem the other night when doing exercises involving file IO. It wouldn't be a problem if I were to create the text file using IDLE and saving it as a .txt file instead of .py. I believe it has to do with the encoding of whatever program you are using to create the file not being compatible with python. It's most likely saving things like the ' character in an area that python cant access. My suggestion is to start a new file from IDLE (or whatever program you're using), put your stuff there to create the file.

Recode bytes which cannot be decoded in utf-8 in python

reading in from txt files - there is one byte which is causing me issues to encode:
with open(input_filename_and_director, 'rb') as f:
r = unicodecsv.reader(f, delimiter="|")
Results in an error message:
UnicodeDecodeError: 'utf8' codec can't decode byte 0xc3 in position 26: invalid continuation byte
Is there anyway to specify how I want these bytes handled (i.e. to read this byte in as another character?)

Depending upon what you want, try using unicodecsv.reader(f, delimiter="|", errors='replace') or unicodecsv.reader(f, delimiter="|", errors='ignore'). unicodecsv passes through the errors parameter to the unicode encoding. See the help for unicode or here for more information.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

'utf-8' codec can't decode byte 0x89 - python

Related

pd.read_excel throws UnicodeDecodeError

'utf-8' codec can't decode byte 0xa0 in position 24: invalid start byte

Python 3 UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 0: ordinal not in range(128)

UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 13: ordinal not in range(128), regarding reading in files

Recode bytes which cannot be decoded in utf-8 in python

Categories

Resources