UnicodeDecodeError: 'utf8' codec can't decode byte 0xea [duplicate] - python

This question already has answers here:
How to determine the encoding of text
(16 answers)
Closed 6 years ago.
I have a CSV file that I'm uploading via an HTML form to a Python API
The API looks like this:
#app.route('/add_candidates_to_db', methods=['GET','POST'])
def add_candidates():
file = request.files['csv_file']
x = io.StringIO(file.read().decode('UTF8'), newline=None)
csv_input = csv.reader(x)
for row in csv_input:
print(row)
I found the part of the file that causes the issue. In my file it has Í character.
I get this error: UnicodeDecodeError: 'utf8' codec can't decode byte 0xea in position 1317: invalid continuation byte
I thought I was decoding it with .decode('UTF8') or is the error happening before that with file.read()?
How do I fix this?
**
**
Edit: I have control of the file. I am creating the CSV file myself by pulling data (sometimes this data has strange characters).
One the server side, I'm reading each row in the file and inserting into a database.

Your data is not UTF-8, it contains errors. You say that you are generating the data, so the ideal solution is to generate better data.
Unfortunately, sometimes we are unable to get high-quality data, or we have servers that give us garbage and we have to sort it out. For these situations, we can use less strict error handling when decoding text.
Instead of:
file.read().decode('UTF8')
You can use:
file.read().decode('UTF8', 'replace')
This will make it so that any “garbage” characters (anything which is not correctly encoded as UTF-8) will get replaced with U+FFFD, which looks like this:
�
You say that your file has the Í character, but you are probably viewing the file using an encoding other than UTF-8. Is your file supposed to contain Í, or is it just mojibake? Maybe you can figure out what the character is supposed to be, and from that, you can figure out what encoding your data uses if it's not UTF-8.

It seems that your file is not encoded in utf8. You can try reading the file with all the encodings that Python understand and check which lets you read the entire content of the file. Try this script:
from codecs import open
encodings = [
"ascii",
"big5",
"big5hkscs",
"cp037",
"cp424",
"cp437",
"cp500",
"cp720",
"cp737",
"cp775",
"cp850",
"cp852",
"cp855",
"cp856",
"cp857",
"cp858",
"cp860",
"cp861",
"cp862",
"cp863",
"cp864",
"cp865",
"cp866",
"cp869",
"cp874",
"cp875",
"cp932",
"cp949",
"cp950",
"cp1006",
"cp1026",
"cp1140",
"cp1250",
"cp1251",
"cp1252",
"cp1253",
"cp1254",
"cp1255",
"cp1256",
"cp1257",
"cp1258",
"euc_jp",
"euc_jis_2004",
"euc_jisx0213",
"euc_kr",
"gb2312",
"gbk",
"gb18030",
"hz",
"iso2022_jp",
"iso2022_jp_1",
"iso2022_jp_2",
"iso2022_jp_2004",
"iso2022_jp_3",
"iso2022_jp_ext",
"iso2022_kr",
"latin_1",
"iso8859_2",
"iso8859_3",
"iso8859_4",
"iso8859_5",
"iso8859_6",
"iso8859_7",
"iso8859_8",
"iso8859_9",
"iso8859_10",
"iso8859_13",
"iso8859_14",
"iso8859_15",
"iso8859_16",
"johab",
"koi8_r",
"koi8_u",
"mac_cyrillic",
"mac_greek",
"mac_iceland",
"mac_latin2",
"mac_roman",
"mac_turkish",
"ptcp154",
"shift_jis",
"shift_jis_2004",
"shift_jisx0213",
"utf_32",
"utf_32_be",
"utf_32_le",
"utf_16",
"utf_16_be",
"utf_16_le",
"utf_7",
"utf_8",
"utf_8_sig",
]
for encoding in encodings:
try:
with open(file, encoding=encoding) as f:
f.read()
print('Seemingly working encoding: {}'.format(encoding))
except:
pass
where file is again the filename of your file.

Related

I keep getting the "UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 3131: invalid start byte" error when I write simple code [duplicate]

https://github.com/affinelayer/pix2pix-tensorflow/tree/master/tools
An error occurred when compiling "process.py" on the above site.
python tools/process.py --input_dir data -- operation resize --outp
ut_dir data2/resize
data/0.jpg -> data2/resize/0.png
Traceback (most recent call last):
File "tools/process.py", line 235, in <module>
main()
File "tools/process.py", line 167, in main
src = load(src_path)
File "tools/process.py", line 113, in load
contents = open(path).read()
File"/home/user/anaconda3/envs/tensorflow_2/lib/python3.5/codecs.py", line 321, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte
What is the cause of the error?
Python's version is 3.5.2.
Python tries to convert a byte-array (a bytes which it assumes to be a utf-8-encoded string) to a unicode string (str). This process of course is a decoding according to utf-8 rules. When it tries this, it encounters a byte sequence which is not allowed in utf-8-encoded strings (namely this 0xff at position 0).
Since you did not provide any code we could look at, we only could guess on the rest.
From the stack trace we can assume that the triggering action was the reading from a file (contents = open(path).read()). I propose to recode this in a fashion like this:
with open(path, 'rb') as f:
contents = f.read()
That b in the mode specifier in the open() states that the file shall be treated as binary, so contents will remain a bytes. No decoding attempt will happen this way.
Use this solution it will strip out (ignore) the characters and return the string without them. Only use this if your need is to strip them not convert them.
with open(path, encoding="utf8", errors='ignore') as f:
Using errors='ignore'
You'll just lose some characters. but if your don't care about them as they seem to be extra characters originating from a the bad formatting and programming of the clients connecting to my socket server.
Then its a easy direct solution.
reference
Use encoding format ISO-8859-1 to solve the issue.
Had an issue similar to this, Ended up using UTF-16 to decode. my code is below.
with open(path_to_file,'rb') as f:
contents = f.read()
contents = contents.rstrip("\n").decode("utf-16")
contents = contents.split("\r\n")
this would take the file contents as an import, but it would return the code in UTF format. from there it would be decoded and seperated by lines.
I've come across this thread when suffering the same error, after doing some research I can confirm, this is an error that happens when you try to decode a UTF-16 file with UTF-8.
With UTF-16 the first characther (2 bytes in UTF-16) is a Byte Order Mark (BOM), which is used as a decoding hint and doesn't appear as a character in the decoded string. This means the first byte will be either FE or FF and the second, the other.
Heavily edited after I found out the real answer
It simply means that one chose the wrong encoding to read the file.
On Mac, use file -I file.txt to find the correct encoding. On Linux, use file -i file.txt.
I had a similar issue with PNG files. and I tried the solutions above without success.
this one worked for me in python 3.8
with open(path, "rb") as f:
use only
base64.b64decode(a)
instead of
base64.b64decode(a).decode('utf-8')
This is due to the different encoding method when read the file. In python, it defaultly
encode the data with unicode. However, it may not works in various platforms.
I propose an encoding method which can help you solve this if 'utf-8' not works.
with open(path, newline='', encoding='cp1252') as csvfile:
reader = csv.reader(csvfile)
It should works if you change the encoding method here. Also, you can find other encoding method here standard-encodings , if above doesn't work for you.
Those getting similar errors while handling Pandas for data frames use the following solution.
example solution.
df = pd.read_csv("File path", encoding='cp1252')
I had this UnicodeDecodeError while trying to read a '.csv' file using pandas.read_csv(). In my case, I could not manage to overcome this issue using other encoder types. But instead of using
pd.read_csv(filename, delimiter=';')
I used:
pd.read_csv(open(filename, 'r'), delimiter=';')
which just seems working fine for me.
Note that: In open() function, use 'r' instead of 'rb'. Because 'rb' returns bytes object that causes to happen this decoder error in the first place, that is the same problem in the read_csv(). But 'r' returns str which is needed since our data is in .csv, and using the default encoding='utf-8' parameter, we can easily parse the data using read_csv() function.
if you are receiving data from a serial port, make sure you are using the right baudrate (and the other configs ) : decoding using (utf-8) but the wrong config will generate the same error
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte
to check your serial port config on linux use : stty -F /dev/ttyUSBX -a
I had a similar issue and searched all the internet for this problem
if you have this problem just copy your HTML code in a new HTML file and use the normal <meta charset="UTF-8">
and it will work....
just create a new HTML file in the same location and use a different name
Check the path of the file to be read. My code kept on giving me errors until I changed the path name to present working directory. The error was:
newchars, decodedbytes = self.decode(data, self.errors)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte
If you are on a mac check if you for a hidden file, .DS_Store. After removing the file my program worked.
I had a similar problem.
Solved it by:
import io
with io.open(filename, 'r', encoding='utf-8') as fn:
lines = fn.readlines()
However, I had another problem. Some html files (in my case) were not utf-8, so I received a similar error. When I excluded those html files, everything worked smoothly.
So, except from fixing the code, check also the files you are reading from, maybe there is an incompatibility there indeed.
You have to use the encoding as latin1 to read this file as there are some special character in this file, use the below code snippet to read the file.
The problem here is the encoding type. When Python can't convert the data to be read, it gives an error.
You can you latin1 or other encoding values.
I say try and test to find the right one for your dataset.
I have the same issue when processing a file generated from Linux. It turns out it was related with files containing question marks..
Following code worked in my case:
df = pd.read_csv(filename,sep = '\t', encoding='cp1252')
If possible, open the file in a text editor and try to change the encoding to UTF-8. Otherwise do it programatically at the OS level.

'utf-8' codec can't decode byte - Python

My Django application is working with both .txt and .doc filetypes. And this application opens a file, compares it with other files in db and prints out some report.
Now the problem is that, when file type is .txt, I get 'utf-8' codec can't decode byte error (here I'm using encoding='utf-8'). When I switch encoding='utf-8' to encoding='ISO-8859-1' error changes to 'latin-1' codec can't decode byte.
I want to find such encoding format that works with every type of a file. This is a small part of my function:
views.py:
#login_required(login_url='sign_in')
def result(request):
last_uploaded = OriginalDocument.objects.latest('id')
original = open(str(last_uploaded.document), 'r', encoding='utf-8')
original_words = original.read().lower().split()
words_count = len(original_words)
open_original = open(str(last_uploaded.document), "r")
read_original = open_original.read()
report_fives = open("static/report_documents/" + str(last_uploaded.student_name) +
"-" + str(last_uploaded.document_title) + "-5.txt", 'w')
# Path to the documents with which original doc is comparing
path = 'static/other_documents/doc*.txt'
files = glob.glob(path)
rows, found_count, fives_count, rounded_percentage_five, percentage_for_chart_five, fives_for_report, founded_docs_for_report = search_by_five(last_uploaded, 5, original_words, report_fives, files)
context = {
...
}
return render(request, 'result.html', context)
There is no general encoding which automatically knows how to decode an already encoded file in a specific encoding.
UTF-8 is a good option with many compatibilities with other encodings. You can e.g. simply ignore or replace characters which aren't decodable like this:
from codecs import open
original = open(str(last_uploaded.document), encoding="utf-8", errors="ignore")
original_words = original.read().lower().split()
...
original.close()
Or even using a context manager (with statement) who closes the file for you:
with open(str(last_uploaded.document), encoding="utf-8", errors="ignore") as fr:
original_words = fr.read().lower().split()
...
(Note: You do not need to use the codecs library if you're using Python 3, but you have tagged your question with python-2.7.)
You can see advantages and disadvantages of using different error handlers here and here. You have to know that not using an error handler will default to using errors="strict" which you probably do not want. Other options may be nearly self-explaining, e.g.:
using errors="replace" will replace an undecodable character with a suitable replacement marker
using errors="ignore" will simply ignore the character and continues reading the file data.
What you should use depends on your needs and usecase(s).
You're saying that you also have encoding problems not only with plain text files, but also with proprietary doc files:
The .doc format is not a plain text file which you can simply read with open() or codecs.open() since there are many information stored in binary format, see this site for more information. So you need a special reader for .doc files to get the text from it. Which library you are using depends on your Python version and maybe also on the operating system you are using. Maybe here is a good starting point for you.
Unfortunately, using a library does not prevent you completely from encoding errors. (Maybe yes, but I'm not sure if the encoding is saved in the file itself like in a .docx file.) You maybe also have the chance to figure out the encoding of the file. How you can handle encoding errors likely depends on the library itself.
So I just guess that you are trying opening .doc files as simple text files. Then you will get decoding errors, because it's not saved as human readable text. And even if you get rid of the error, you only will see the non human readable text: (I've created a simple text file with LibreOffice in doc-format (Microsoft Word 1997-2003)):
In [1]: open("./test.doc", "r").read()
UnicodeDecodeError: 'utf-8' codec can`t decode byte 0xd0 in position 0: invalid continuation byte
In [2]: open("./test.doc", "r", errors="replace").read() # or open("./test.doc", "rb").read()
'��\x11\u0871\x1a�\x00\x00\x00' ...

Getting UnicodeDecodeError while reading excel in Tornado,Python

i'm using postman to send an excel file which i am reading in tornado.
Tornado code
self.request.files['1'][0]['body'].decode()
here if i send .csv than, the above code works.
if i send .xlsx file than i am stuck with this error.
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x87 in position 10: invalid start byte
request.files will fetch the file but the type would be byte. so to convert byte to str i've used decode(), which works only for .csv and not for .xlsx
i tried decode('utf-8') but still no luck.
i've tried searching but didn't find any issue mentioning 0x87 problem?
The reason is that the .xlsx file has a different encoding, not utf-8. You'll need to use the original encoding to decode the file.
There's no guaranteed way of finding out the encoding of a file programmatically. I'm guessing you're making this application for general users and so you will keep encountering files with different and unexpected encodings.
A good way to deal with this is by trying to decode using multiple encodings, in case one fails. Example:
encodings = ['utf-8', 'iso-8859-1', 'windows-1251', 'windows-1252']
for encoding in encodings:
try:
decoded_file = self.request.files['1'][0]['body'].decode(encoding)
except UnicodeDecodeError:
# this will run when the current encoding fails
# just ignore the error and try the next one
pass
else:
# this will run when an encoding passes
# break the loop
# it is also a good idea to re-encode the
# decoded files to utf-8 for your purpose
decoded_file = decoded_file.encode("utf8")
break
else:
# this will run when the for loop ends
# without successfully decoding the file
# now you can return an error message
# to the user asking them to change
# the file encoding and re upload
self.write("Error: Unidentified file encoding. Re-upload with UTF-8 encoding")
return
# when the program reaches here, it means
# you have successfully decoded the file
# and you can access it from `decoded_file` variable
Here's a list of some common encodings: What is the most common encoding of each language?
I faced the same issue and this worked for me.
import io
df = pd.read_excel(io.BytesIO(self.request.files['1'][0]['body']))
try this one, following suggestions provided here:
self.request.files['1'][0]['body'].decode('iso-8859-1').encode('utf-8')

Ignore UnicodeEncodeError when saving utf8 file

I have a large string of a novel that I downloaded from Project Gutenberg. I am trying to save it to my computer, but I'm getting a UnicodeEncodeError and I don't know how to fix or ignore it.
from urllib import request
# Get the text
response = request.urlopen('http://www.gutenberg.org/files/2701/2701-0.txt')
# Decode it using utf8
raw = response.read().decode('utf8')
# Save the file
file = open('corpora/canon_texts/' + 'test', 'w')
file.write(raw)
file.close()
This gives me the following error:
UnicodeEncodeError: 'charmap' codec can't encode character '\ufeff' in position 0: character maps to <undefined>
First, I tried to remove the BOM at the beginning of the file:
# We have to get rid of the pesky Byte Order Mark before we save it
raw = raw.replace(u'\ufeff', '')
but I get the same error, just with a different position number:
UnicodeEncodeError: 'charmap' codec can't encode characters in position 7863-7864: character maps to <undefined>
If I look in that area I can't find the offending characters, so I don't know what to remove:
raw[7850:7900]
just prints out:
' BALLENA, Spanish.\r\n PEKEE-NUEE-'
which doesn't look like it would be a problem.
So then I tried to skip the bad lines with a try statement:
file = open('corpora/canon_texts/' + 'test', 'w')
try:
file.write(raw)
except UnicodeEncodeError:
pass
file.close()
but this skips the entire text, giving me a file of 0 size.
How can I fix this?
EDIT:
A couple people have noted that '\ufeff' is utf16. I tried switching to utf16:
# Get the text
response = request.urlopen('http://www.gutenberg.org/files/2701/2701-0.txt')
# Decode it using utf16
raw = response.read().decode('utf-16')
But I can't even download the data before I get this error:
UnicodeDecodeError: 'utf-16-le' codec can't decode byte 0x0a in position 1276798: truncated data
SECOND EDIT:
I also tried decoding with utf-8-sig as suggested in u'\ufeff' in Python string because that includes BOM, but then I'm back to this error when I try to save it:
UnicodeEncodeError: 'charmap' codec can't encode characters in position 7863-7864: character maps to <undefined>
Decoding and re-encoding a file just to save it to disk is pointless. Just write out the bytes you have downloaded, and you will have the file on disk:
raw = response.read()
with open('corpora/canon_texts/' + 'test', 'wb') as outfile:
outfile.write(raw)
This is the only reliable way to write to disk exactly what you downloaded.
Sooner or later you'll want to read in the file and work with it, so let's consider your error. You didn't provide a full stack trace (always a bad idea), but your error is during encoding, not decoding. The decoding step succeeded. The error must be arising on the line file.write(raw), which is where the text gets encoded for saving. But to what encoding is it being converted? Nobody knows, because you opened file without specifying an encoding! The encoding you're getting depends on your location, OS, and probably the tides and weather forecast. In short: Specify the encoding.
text = response.read().decode('utf8')
with open('corpora/canon_texts/' + 'test', 'w', encoding="utf-8") as outfile:
outfile.write(text)
U + feff is for UTF-16. Try that instead.
.decode(encoding="utf-8", errors="strict") offers error handling as a built-in feature:
The default for errors is 'strict', meaning that encoding errors raise a UnicodeError. Other possible values are 'ignore', 'replace' and any other name registered via codecs.register_error(), see section Error Handlers.
Probably the safest option is
decode("utf8", errors='backslashreplace')
which will escape encoding errors with a backslash, so you have a record of what failed to decode.
Conveniently, your Moby Dick text contains no backslashes, so it will be quite easy to check what characters are failing to decode.
What is strange about this text is the website says it is in utf-8, but \efeff is the BOM for utf-16. Decoding in utf-16, it looks like your just having trouble with the very last character 0x0a (which is a utf-8 line ending), which can probably safely be dropped with
decode("utf-16", errors='ignore')

What kind of Encoding does a standard midi file use?

Here's what brought this question up:
with open(path + "/OneChance1.mid") as f:
for line in f.readline():
print(line)
Here I am simply trying to read a midi file to scour its contents. I then receive this error message: UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 153: character maps to <undefined>
If I use open()'s second param like so: with open(path + "/OneChance1.mid"m encoding='utf-8) as f: then I receive this error: UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 13: invalid start byte
If I change the encoding param to ascii I get another error about an ordinal being out of range. Lastly I tried utf-16 and it said that the file didn't start with BOM (which made me smile for some reason). Also, if I ignore the errors I get characters that resemble nothing of the kind of data I am expecting. My expectations are based on this source: http://www.sonicspot.com/guide/midifiles.html
Anyway, does anyone know what kind of encoding a midi file uses? My research is coming up short in that regard so I thought it would be worth asking on SO. Or maybe someone can point out some other possibilities or blunders?
MIDI files are binary content. By opening the file as a text file however, Python applies the default system encoding in trying to decode the text as Unicode.
Open the file in binary mode instead:
with open(midifile, 'rb') as mfile:
leader = mfile.read(4)
if leader != b'MThd':
raise ValueError('Not a MIDI file!')
You'd have to study the MIDI standard file format if you wanted to learn more from the file. Also see What is the structure of a MIDI file?
It's a binary file, it's not text using a text encoding like you seem to expect.
To open a file in binary mode in Python, pass a string containing "b" as the second argument to open().
This page contains a description of the format.

Categories