Getting UnicodeDecodeError while reading excel in Tornado,Python

Getting UnicodeDecodeError while reading excel in Tornado,Python - python

i'm using postman to send an excel file which i am reading in tornado.
Tornado code
self.request.files['1'][0]['body'].decode()
here if i send .csv than, the above code works.
if i send .xlsx file than i am stuck with this error.
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x87 in position 10: invalid start byte
request.files will fetch the file but the type would be byte. so to convert byte to str i've used decode(), which works only for .csv and not for .xlsx
i tried decode('utf-8') but still no luck.
i've tried searching but didn't find any issue mentioning 0x87 problem?

The reason is that the .xlsx file has a different encoding, not utf-8. You'll need to use the original encoding to decode the file.
There's no guaranteed way of finding out the encoding of a file programmatically. I'm guessing you're making this application for general users and so you will keep encountering files with different and unexpected encodings.
A good way to deal with this is by trying to decode using multiple encodings, in case one fails. Example:
encodings = ['utf-8', 'iso-8859-1', 'windows-1251', 'windows-1252']
for encoding in encodings:
try:
decoded_file = self.request.files['1'][0]['body'].decode(encoding)
except UnicodeDecodeError:
# this will run when the current encoding fails
# just ignore the error and try the next one
pass
else:
# this will run when an encoding passes
# break the loop
# it is also a good idea to re-encode the
# decoded files to utf-8 for your purpose
decoded_file = decoded_file.encode("utf8")
break
else:
# this will run when the for loop ends
# without successfully decoding the file
# now you can return an error message
# to the user asking them to change
# the file encoding and re upload
self.write("Error: Unidentified file encoding. Re-upload with UTF-8 encoding")
return
# when the program reaches here, it means
# you have successfully decoded the file
# and you can access it from `decoded_file` variable
Here's a list of some common encodings: What is the most common encoding of each language?

I faced the same issue and this worked for me.
import io
df = pd.read_excel(io.BytesIO(self.request.files['1'][0]['body']))

try this one, following suggestions provided here:
self.request.files['1'][0]['body'].decode('iso-8859-1').encode('utf-8')

Related

Python not able to read "–" character from text file

Using Python, I am fetching some text data from an API and storing it in a text file after some transformations and then reading this text file from a different process.
There are no problems while reading data from API, but I am getting this error while reading the text file:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x96 in position 907: invalid start byte
The byte being read as '0x96' is actually "–" character in API data and this error occurs only when encoding argument is explicitly specified as 'utf-8'. It doesn't occur when encoding is not explicitly passed to open function while opening the text file.
My questions:
Why do we get this error only when encoding is specified? I think, we should get the same error in other case as well since default encoding is also 'UTF-8'. (Please correct me if I am wrong)
Is it possible to resolve this issue without changing the way I am reading the text file? (i.e. Can I make any changes to the stage where I am creating this text file from API data?)
Really appreciate you looking into it. Thanks!

In open() the default encoding is platform dependent, you can find out what is the default for your system by checking what locale.getpreferredencoding() returns. This is from the documentation
For the 2nd part of your question, since you are not getting an error when you do not specify utf-8 as encoding, you could just use the output for locale.getpreferredencoding() as the encoding method.

You could do this for each line of the text if you are doing it this way. Since 0x96 is considered a "non-printable".
import re
...
line = re.sub(r'\x96',r'\x2D', line)

I keep getting the "UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 3131: invalid start byte" error when I write simple code [duplicate]

https://github.com/affinelayer/pix2pix-tensorflow/tree/master/tools
An error occurred when compiling "process.py" on the above site.
python tools/process.py --input_dir data -- operation resize --outp
ut_dir data2/resize
data/0.jpg -> data2/resize/0.png
Traceback (most recent call last):
File "tools/process.py", line 235, in <module>
main()
File "tools/process.py", line 167, in main
src = load(src_path)
File "tools/process.py", line 113, in load
contents = open(path).read()
File"/home/user/anaconda3/envs/tensorflow_2/lib/python3.5/codecs.py", line 321, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte
What is the cause of the error?
Python's version is 3.5.2.

Python tries to convert a byte-array (a bytes which it assumes to be a utf-8-encoded string) to a unicode string (str). This process of course is a decoding according to utf-8 rules. When it tries this, it encounters a byte sequence which is not allowed in utf-8-encoded strings (namely this 0xff at position 0).
Since you did not provide any code we could look at, we only could guess on the rest.
From the stack trace we can assume that the triggering action was the reading from a file (contents = open(path).read()). I propose to recode this in a fashion like this:
with open(path, 'rb') as f:
contents = f.read()
That b in the mode specifier in the open() states that the file shall be treated as binary, so contents will remain a bytes. No decoding attempt will happen this way.

Use this solution it will strip out (ignore) the characters and return the string without them. Only use this if your need is to strip them not convert them.
with open(path, encoding="utf8", errors='ignore') as f:
Using errors='ignore'
You'll just lose some characters. but if your don't care about them as they seem to be extra characters originating from a the bad formatting and programming of the clients connecting to my socket server.
Then its a easy direct solution.
reference

Use encoding format ISO-8859-1 to solve the issue.

Had an issue similar to this, Ended up using UTF-16 to decode. my code is below.
with open(path_to_file,'rb') as f:
contents = f.read()
contents = contents.rstrip("\n").decode("utf-16")
contents = contents.split("\r\n")
this would take the file contents as an import, but it would return the code in UTF format. from there it would be decoded and seperated by lines.

I've come across this thread when suffering the same error, after doing some research I can confirm, this is an error that happens when you try to decode a UTF-16 file with UTF-8.
With UTF-16 the first characther (2 bytes in UTF-16) is a Byte Order Mark (BOM), which is used as a decoding hint and doesn't appear as a character in the decoded string. This means the first byte will be either FE or FF and the second, the other.
Heavily edited after I found out the real answer

It simply means that one chose the wrong encoding to read the file.
On Mac, use file -I file.txt to find the correct encoding. On Linux, use file -i file.txt.

I had a similar issue with PNG files. and I tried the solutions above without success.
this one worked for me in python 3.8
with open(path, "rb") as f:

use only
base64.b64decode(a)
instead of
base64.b64decode(a).decode('utf-8')

This is due to the different encoding method when read the file. In python, it defaultly
encode the data with unicode. However, it may not works in various platforms.
I propose an encoding method which can help you solve this if 'utf-8' not works.
with open(path, newline='', encoding='cp1252') as csvfile:
reader = csv.reader(csvfile)
It should works if you change the encoding method here. Also, you can find other encoding method here standard-encodings , if above doesn't work for you.

Those getting similar errors while handling Pandas for data frames use the following solution.
example solution.
df = pd.read_csv("File path", encoding='cp1252')

I had this UnicodeDecodeError while trying to read a '.csv' file using pandas.read_csv(). In my case, I could not manage to overcome this issue using other encoder types. But instead of using
pd.read_csv(filename, delimiter=';')
I used:
pd.read_csv(open(filename, 'r'), delimiter=';')
which just seems working fine for me.
Note that: In open() function, use 'r' instead of 'rb'. Because 'rb' returns bytes object that causes to happen this decoder error in the first place, that is the same problem in the read_csv(). But 'r' returns str which is needed since our data is in .csv, and using the default encoding='utf-8' parameter, we can easily parse the data using read_csv() function.

if you are receiving data from a serial port, make sure you are using the right baudrate (and the other configs ) : decoding using (utf-8) but the wrong config will generate the same error
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte
to check your serial port config on linux use : stty -F /dev/ttyUSBX -a

I had a similar issue and searched all the internet for this problem
if you have this problem just copy your HTML code in a new HTML file and use the normal <meta charset="UTF-8">
and it will work....
just create a new HTML file in the same location and use a different name

Check the path of the file to be read. My code kept on giving me errors until I changed the path name to present working directory. The error was:
newchars, decodedbytes = self.decode(data, self.errors)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte

If you are on a mac check if you for a hidden file, .DS_Store. After removing the file my program worked.

I had a similar problem.
Solved it by:
import io
with io.open(filename, 'r', encoding='utf-8') as fn:
lines = fn.readlines()
However, I had another problem. Some html files (in my case) were not utf-8, so I received a similar error. When I excluded those html files, everything worked smoothly.
So, except from fixing the code, check also the files you are reading from, maybe there is an incompatibility there indeed.

You have to use the encoding as latin1 to read this file as there are some special character in this file, use the below code snippet to read the file.
The problem here is the encoding type. When Python can't convert the data to be read, it gives an error.
You can you latin1 or other encoding values.
I say try and test to find the right one for your dataset.

I have the same issue when processing a file generated from Linux. It turns out it was related with files containing question marks..

Following code worked in my case:
df = pd.read_csv(filename,sep = '\t', encoding='cp1252')

If possible, open the file in a text editor and try to change the encoding to UTF-8. Otherwise do it programatically at the OS level.

Pandas read _excel: 'utf-8' codec can't decode byte 0xa8 in position 14: invalid start byte

Trying to read MS Excel file, version 2016. File contains several lists with data. File downloaded from DataBase and it can be opened in MS Office correctly. In example below I changed the file name.
EDIT: file contains russian and english words. Most probably used the Latin-1 encoding, but encoding='latin-1' does not help
import pandas as pd
with open('1.xlsx', 'r', encoding='utf8') as f:
data = pd.read_excel(f)
Result:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa8 in position 14: invalid start byte
Without encoding ='utf8'
'charmap' codec can't decode byte 0x9d in position 622: character maps to <undefined>
P.S. Task is to process 52 files, to merge data in every sheet with corresponded sheets in the 52 files. So, please no handle work advices.

The problem is that the original requester is calling read_excel with a filehandle as the first argument. As demonstrated by the last responder, the first argument should be a string containing the filename.
I ran into this same error using:
df = pd.read_excel(open("file.xlsx",'r'))
but correct is:
df = pd.read_excel("file.xlsx")

Most probably you're using Python3. In Python2 this wouldn't happen.
xlsx files are binary (actually they're an xml, but it's compressed), so you need to open them in binary mode. Use this call to open:
open('1.xlsx', 'rb')
There's no full traceback, but I imagine the UnicodeDecodeError comes from the file object, not from read_excel(). That happens because the stream of bytes can contain anything, but we don't want decoding to happen too soon; read_excel() must receive raw bytes and be able to process them.

Most probably the problem is in Russian symbols.
Charmap is default decoding method used in case no encoding is beeing noticed.
As I see if utf-8 and latin-1 do not help then try to read this file not as
pd.read_excel(f)
but
pd.read_table(f)
or even just
f.readline()
in order to check what is a symbol raise an exeception and delete this symbol/symbols.

Panda support encoding feature to read your excel
In your case you can use:
df=pd.read_excel('your_file.xlsx',encoding='utf-8')
or if you want in more of system specific without any surpise you can use:
df=pd.read_excel('your_file.xlsx',encoding='sys.getfilesystemencoding()')

Reading erroneous data form csv file using read_csv from pandas

I am trying to read data from a huge csv file I have. I is showing me this error UnicodeDecodeError: 'utf-8' codec can't decode byte 0xae in position 13: invalid start byte. Is there any way to just skip through the lines that cause this exception to be thrown? From the millions of lines these are just a handful and I can't manually delete them. I tried adding error_bad_lines=False, but that did not solve the problem. I am using Python 3.6.1 that I got through Anaconda 4.4.0. I am also using a Mac if that helps. Please help me I am new to this.

Seems to me that there are some non-ascii characters in your file that cannot be decoded. Pandas accepts an encoding as an argument for read_csv (if that helps):
my_file = pd.read_csv('Path/to/file.csv', encoding = 'encoding')
The default encoding is None, which is why you might be getting those errors.Here is a link to the standard Python encodings - Try "ISO-8859-1" (aka 'latin1') or maybe 'utf8' to start.
Pandas does allow you to specify rows to skip when reading a csv, but you would need to know the index of those rows, which in your case would be very difficult.

UnicodeDecodeError: 'utf8' codec can't decode byte 0xea [duplicate]

This question already has answers here:
How to determine the encoding of text
(16 answers)
Closed 6 years ago.
I have a CSV file that I'm uploading via an HTML form to a Python API
The API looks like this:
#app.route('/add_candidates_to_db', methods=['GET','POST'])
def add_candidates():
file = request.files['csv_file']
x = io.StringIO(file.read().decode('UTF8'), newline=None)
csv_input = csv.reader(x)
for row in csv_input:
print(row)
I found the part of the file that causes the issue. In my file it has Í character.
I get this error: UnicodeDecodeError: 'utf8' codec can't decode byte 0xea in position 1317: invalid continuation byte
I thought I was decoding it with .decode('UTF8') or is the error happening before that with file.read()?
How do I fix this?
**
**
Edit: I have control of the file. I am creating the CSV file myself by pulling data (sometimes this data has strange characters).
One the server side, I'm reading each row in the file and inserting into a database.

Your data is not UTF-8, it contains errors. You say that you are generating the data, so the ideal solution is to generate better data.
Unfortunately, sometimes we are unable to get high-quality data, or we have servers that give us garbage and we have to sort it out. For these situations, we can use less strict error handling when decoding text.
Instead of:
file.read().decode('UTF8')
You can use:
file.read().decode('UTF8', 'replace')
This will make it so that any “garbage” characters (anything which is not correctly encoded as UTF-8) will get replaced with U+FFFD, which looks like this:
�
You say that your file has the Í character, but you are probably viewing the file using an encoding other than UTF-8. Is your file supposed to contain Í, or is it just mojibake? Maybe you can figure out what the character is supposed to be, and from that, you can figure out what encoding your data uses if it's not UTF-8.

It seems that your file is not encoded in utf8. You can try reading the file with all the encodings that Python understand and check which lets you read the entire content of the file. Try this script:
from codecs import open
encodings = [
"ascii",
"big5",
"big5hkscs",
"cp037",
"cp424",
"cp437",
"cp500",
"cp720",
"cp737",
"cp775",
"cp850",
"cp852",
"cp855",
"cp856",
"cp857",
"cp858",
"cp860",
"cp861",
"cp862",
"cp863",
"cp864",
"cp865",
"cp866",
"cp869",
"cp874",
"cp875",
"cp932",
"cp949",
"cp950",
"cp1006",
"cp1026",
"cp1140",
"cp1250",
"cp1251",
"cp1252",
"cp1253",
"cp1254",
"cp1255",
"cp1256",
"cp1257",
"cp1258",
"euc_jp",
"euc_jis_2004",
"euc_jisx0213",
"euc_kr",
"gb2312",
"gbk",
"gb18030",
"hz",
"iso2022_jp",
"iso2022_jp_1",
"iso2022_jp_2",
"iso2022_jp_2004",
"iso2022_jp_3",
"iso2022_jp_ext",
"iso2022_kr",
"latin_1",
"iso8859_2",
"iso8859_3",
"iso8859_4",
"iso8859_5",
"iso8859_6",
"iso8859_7",
"iso8859_8",
"iso8859_9",
"iso8859_10",
"iso8859_13",
"iso8859_14",
"iso8859_15",
"iso8859_16",
"johab",
"koi8_r",
"koi8_u",
"mac_cyrillic",
"mac_greek",
"mac_iceland",
"mac_latin2",
"mac_roman",
"mac_turkish",
"ptcp154",
"shift_jis",
"shift_jis_2004",
"shift_jisx0213",
"utf_32",
"utf_32_be",
"utf_32_le",
"utf_16",
"utf_16_be",
"utf_16_le",
"utf_7",
"utf_8",
"utf_8_sig",
]
for encoding in encodings:
try:
with open(file, encoding=encoding) as f:
f.read()
print('Seemingly working encoding: {}'.format(encoding))
except:
pass
where file is again the filename of your file.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Getting UnicodeDecodeError while reading excel in Tornado,Python - python

I faced the same issue and this worked for me. import io df = pd.read_excel(io.BytesIO(self.request.files['1'][0]['body']))

try this one, following suggestions provided here: self.request.files['1'][0]['body'].decode('iso-8859-1').encode('utf-8')

Related

Python not able to read "–" character from text file

I keep getting the "UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 3131: invalid start byte" error when I write simple code [duplicate]

Pandas read _excel: 'utf-8' codec can't decode byte 0xa8 in position 14: invalid start byte

Reading erroneous data form csv file using read_csv from pandas

UnicodeDecodeError: 'utf8' codec can't decode byte 0xea [duplicate]

Categories

Resources