pd.read_excel throws UnicodeDecodeError

pd.read_excel throws UnicodeDecodeError - python

I am trying to read data from excel to pandas. The file I get comes from api and is not saved (the access to the file needs special permissions, so I don't want to save it). When I try to read excel from file
with open('path_to_file') as file:
re = pd.read_excel(file)
I get the error
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x98 in position
10: invalid start byte
When I input path in palce of file everythng works fine
re = pd.read_excel('path-to-exactly-the-same-file')
Is there a way to read excel by pandas without saving it and inputting path?

the part that was missing was 'rb' in open
with open('path_to_file', 'rb') as file:
re = pd.read_excel(file)
to treat the file as binary. Idea taken from error UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte

Related

Trying to load a csv file which is encoded binarily in python

I am trying to load a csv file which is encoded binarily in python. When using pd.read_csv(), I get the following error:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xfe in position 16: invalid start byte
I have tried adding "encoding = 'utf-8'" and tried adding delimiters but that did not help.

How to find the character which is throwing UnicodeDecodeError when reading any file with 'utf-8' encoding

When reading a file from pandas read_csv , got UnicodeDecodeError.
Syntax:
df = pd.read_csv("file_name.csv", sep='|')
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa0 in position 0: invalid start byte
How do I get which charcter is throwing the error at which location in file.

If the whole file is small enough to read into memory, you can read it in binary mode and decode it yourself. The error message will then tell you the exact byte offset.
with open("file_name.csv", "rb") as f:
f.read().decode("utf-8")

read_csv takes an encoding option to deal with files in different formats.You will not get error if you include the unicode decode
import pandas as pd
df = pd.read_csv("file_name.csv", sep='|',encoding = "utf-8")
Alternatively you can also use
import pandas as pd
df = pd.read_csv('file_name.csv', engine='python')

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb4 in position 4: invalid start byte

I exported a csv file from Microsoft Excel. It showed properly in Jupyter notebook with pandas and numpy as below:
import pandas as pd
pd1 = pd.read_csv('test1.csv', encoding='utf-8')
There were no error messages the first time, but I just opened the csv file then just saved as a new name.
all the time I got a unicodeerror message
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb4 in position 4: invalid start byte
The data has strange letters as shown below. Even if there were strange letters, there was no problem at first.
2 columns, 6 rows
I have to handle all languages, so I really want to know how to encode them. How to solve this problem?

When you save as, there will be a selection of the encoding format
Try to save as and see if it works.👍

'utf-8' codec can't decode byte 0xa0 in position 24: invalid start byte

I am trying to read a csv file using the following lines of Python code:
crimes = pd.read_csv('C:/Users/usuario1/Desktop/python/csv/001 Boston crimes/crime.csv', encoding = 'utf8')
crimes.head(5)
But I am getting decode error as follws:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa0 in position 24: invalid start byte
What is going wrong?

May be your file does not support utf-8 codec or has a character that does not support utf-8. You can try other encodings like ISO-8859-1. But it is best to check your file encoding first. To do so, something like the following should work:
1.
with open('Your/file/path') as f:
print(f)
This should print file details with encoding.
Or you can just open the csv and when you go to File -> Save As this should show your encoding.
If those don't help, you can ignore the rows that are causing problems by using `error_bad_lines=False'
crimes = pd.read_csv('Your/file/path', encoding='utf8', error_bad_lines=False)
Hope these will help

'utf-8' codec can't decode byte 0x89

I want to read a csv file and process some columns but I keep getting issues.
Stuck with the following error:
Traceback (most recent call last):
File "C:\Users\Sven\Desktop\Python\read csv.py", line 5, in <module>
for row in reader:
File "C:\Python34\lib\codecs.py", line 313, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x89 in position 446: invalid start byte
>>>
My Code
import csv
with open("c:\\Users\\Sven\\Desktop\\relaties 24112014.csv",newline='', encoding="utf8") as f:
reader = csv.reader(f,delimiter=';',quotechar='|')
#print(sum(1 for row in reader))
for row in reader:
print(row)
if row:
value = row[6]
value = value.replace('(', '')
value = value.replace(')', '')
value = value.replace(' ', '')
value = value.replace('.', '')
value = value.replace('0032', '0')
if len(value) > 0:
print(value + ' Length: ' + str(len(value)))
I'm a beginner with Python, tried googling, but hard to find the right solution.
Can anyone help me out?

The first byte of a .PNG file is 0x89. Not saying that is your problem, but the .PNG header is specifically designed so that it is NOT accidentally interpreted as text.
Why you would have a .csv file that is actually a .png I don't know. But it definitely could happen if someone accidentally renamed the file. On windows 10 every once and a while I accidentally mass-rename files by accident because of their stupid checkbox feature. Why Microsoft decided desktop machines having identical UI controls to tablets was I good idea... I don't know.

This is the most important clue:
invalid start byte
\x89 is not, as suggested in the comments, an invalid UTF-8 byte. It is a completely valid continuation byte. Meaning if it follows the correct byte value, it codes UTF-8 correctly:
http://hexutf8.com/?q=0xc90x89
So either you (1) do not have UTF-8 data as you expect, or (2) you have some malformed UTF-8 data. The Python codec is simply letting you know that it encountered \x89 in the wrong order in the sequence.
(More on continuation bytes here: http://en.wikipedia.org/wiki/UTF-8#Codepage_layout)

I was also getting the similar error when trying to read or upload the following kinds of files:
CSV File
JPEG File
PNG File
Zip File
The best way to avoid error like:
'utf-8' codec can't decode byte 0x89
'utf-8' codec can't decode byte 0xff
is to read these files as Bytes. When you treat them as byte then you need not provide any encoding value here. So when you open them you should specify:
with open(file_path, 'rb') as file:
Or in your case, the code should be something like:
import csv
with open("c:\\Users\\Sven\\Desktop\\relaties 24112014.csv", newline='', 'rb') as f:
reader = csv.reader(f,delimiter=';',quotechar='|')

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

pd.read_excel throws UnicodeDecodeError - python

the part that was missing was 'rb' in open with open('path_to_file', 'rb') as file: re = pd.read_excel(file) to treat the file as binary. Idea taken from error UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte

Related

Trying to load a csv file which is encoded binarily in python

How to find the character which is throwing UnicodeDecodeError when reading any file with 'utf-8' encoding

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb4 in position 4: invalid start byte

'utf-8' codec can't decode byte 0xa0 in position 24: invalid start byte

'utf-8' codec can't decode byte 0x89

Categories

Resources