Here's what brought this question up:
with open(path + "/OneChance1.mid") as f:
for line in f.readline():
print(line)
Here I am simply trying to read a midi file to scour its contents. I then receive this error message: UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 153: character maps to <undefined>
If I use open()'s second param like so: with open(path + "/OneChance1.mid"m encoding='utf-8) as f: then I receive this error: UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 13: invalid start byte
If I change the encoding param to ascii I get another error about an ordinal being out of range. Lastly I tried utf-16 and it said that the file didn't start with BOM (which made me smile for some reason). Also, if I ignore the errors I get characters that resemble nothing of the kind of data I am expecting. My expectations are based on this source: http://www.sonicspot.com/guide/midifiles.html
Anyway, does anyone know what kind of encoding a midi file uses? My research is coming up short in that regard so I thought it would be worth asking on SO. Or maybe someone can point out some other possibilities or blunders?
MIDI files are binary content. By opening the file as a text file however, Python applies the default system encoding in trying to decode the text as Unicode.
Open the file in binary mode instead:
with open(midifile, 'rb') as mfile:
leader = mfile.read(4)
if leader != b'MThd':
raise ValueError('Not a MIDI file!')
You'd have to study the MIDI standard file format if you wanted to learn more from the file. Also see What is the structure of a MIDI file?
It's a binary file, it's not text using a text encoding like you seem to expect.
To open a file in binary mode in Python, pass a string containing "b" as the second argument to open().
This page contains a description of the format.
Related
https://github.com/affinelayer/pix2pix-tensorflow/tree/master/tools
An error occurred when compiling "process.py" on the above site.
python tools/process.py --input_dir data -- operation resize --outp
ut_dir data2/resize
data/0.jpg -> data2/resize/0.png
Traceback (most recent call last):
File "tools/process.py", line 235, in <module>
main()
File "tools/process.py", line 167, in main
src = load(src_path)
File "tools/process.py", line 113, in load
contents = open(path).read()
File"/home/user/anaconda3/envs/tensorflow_2/lib/python3.5/codecs.py", line 321, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte
What is the cause of the error?
Python's version is 3.5.2.
Python tries to convert a byte-array (a bytes which it assumes to be a utf-8-encoded string) to a unicode string (str). This process of course is a decoding according to utf-8 rules. When it tries this, it encounters a byte sequence which is not allowed in utf-8-encoded strings (namely this 0xff at position 0).
Since you did not provide any code we could look at, we only could guess on the rest.
From the stack trace we can assume that the triggering action was the reading from a file (contents = open(path).read()). I propose to recode this in a fashion like this:
with open(path, 'rb') as f:
contents = f.read()
That b in the mode specifier in the open() states that the file shall be treated as binary, so contents will remain a bytes. No decoding attempt will happen this way.
Use this solution it will strip out (ignore) the characters and return the string without them. Only use this if your need is to strip them not convert them.
with open(path, encoding="utf8", errors='ignore') as f:
Using errors='ignore'
You'll just lose some characters. but if your don't care about them as they seem to be extra characters originating from a the bad formatting and programming of the clients connecting to my socket server.
Then its a easy direct solution.
reference
Use encoding format ISO-8859-1 to solve the issue.
Had an issue similar to this, Ended up using UTF-16 to decode. my code is below.
with open(path_to_file,'rb') as f:
contents = f.read()
contents = contents.rstrip("\n").decode("utf-16")
contents = contents.split("\r\n")
this would take the file contents as an import, but it would return the code in UTF format. from there it would be decoded and seperated by lines.
I've come across this thread when suffering the same error, after doing some research I can confirm, this is an error that happens when you try to decode a UTF-16 file with UTF-8.
With UTF-16 the first characther (2 bytes in UTF-16) is a Byte Order Mark (BOM), which is used as a decoding hint and doesn't appear as a character in the decoded string. This means the first byte will be either FE or FF and the second, the other.
Heavily edited after I found out the real answer
It simply means that one chose the wrong encoding to read the file.
On Mac, use file -I file.txt to find the correct encoding. On Linux, use file -i file.txt.
I had a similar issue with PNG files. and I tried the solutions above without success.
this one worked for me in python 3.8
with open(path, "rb") as f:
use only
base64.b64decode(a)
instead of
base64.b64decode(a).decode('utf-8')
This is due to the different encoding method when read the file. In python, it defaultly
encode the data with unicode. However, it may not works in various platforms.
I propose an encoding method which can help you solve this if 'utf-8' not works.
with open(path, newline='', encoding='cp1252') as csvfile:
reader = csv.reader(csvfile)
It should works if you change the encoding method here. Also, you can find other encoding method here standard-encodings , if above doesn't work for you.
Those getting similar errors while handling Pandas for data frames use the following solution.
example solution.
df = pd.read_csv("File path", encoding='cp1252')
I had this UnicodeDecodeError while trying to read a '.csv' file using pandas.read_csv(). In my case, I could not manage to overcome this issue using other encoder types. But instead of using
pd.read_csv(filename, delimiter=';')
I used:
pd.read_csv(open(filename, 'r'), delimiter=';')
which just seems working fine for me.
Note that: In open() function, use 'r' instead of 'rb'. Because 'rb' returns bytes object that causes to happen this decoder error in the first place, that is the same problem in the read_csv(). But 'r' returns str which is needed since our data is in .csv, and using the default encoding='utf-8' parameter, we can easily parse the data using read_csv() function.
if you are receiving data from a serial port, make sure you are using the right baudrate (and the other configs ) : decoding using (utf-8) but the wrong config will generate the same error
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte
to check your serial port config on linux use : stty -F /dev/ttyUSBX -a
I had a similar issue and searched all the internet for this problem
if you have this problem just copy your HTML code in a new HTML file and use the normal <meta charset="UTF-8">
and it will work....
just create a new HTML file in the same location and use a different name
Check the path of the file to be read. My code kept on giving me errors until I changed the path name to present working directory. The error was:
newchars, decodedbytes = self.decode(data, self.errors)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte
If you are on a mac check if you for a hidden file, .DS_Store. After removing the file my program worked.
I had a similar problem.
Solved it by:
import io
with io.open(filename, 'r', encoding='utf-8') as fn:
lines = fn.readlines()
However, I had another problem. Some html files (in my case) were not utf-8, so I received a similar error. When I excluded those html files, everything worked smoothly.
So, except from fixing the code, check also the files you are reading from, maybe there is an incompatibility there indeed.
You have to use the encoding as latin1 to read this file as there are some special character in this file, use the below code snippet to read the file.
The problem here is the encoding type. When Python can't convert the data to be read, it gives an error.
You can you latin1 or other encoding values.
I say try and test to find the right one for your dataset.
I have the same issue when processing a file generated from Linux. It turns out it was related with files containing question marks..
Following code worked in my case:
df = pd.read_csv(filename,sep = '\t', encoding='cp1252')
If possible, open the file in a text editor and try to change the encoding to UTF-8. Otherwise do it programatically at the OS level.
Trying to read MS Excel file, version 2016. File contains several lists with data. File downloaded from DataBase and it can be opened in MS Office correctly. In example below I changed the file name.
EDIT: file contains russian and english words. Most probably used the Latin-1 encoding, but encoding='latin-1' does not help
import pandas as pd
with open('1.xlsx', 'r', encoding='utf8') as f:
data = pd.read_excel(f)
Result:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa8 in position 14: invalid start byte
Without encoding ='utf8'
'charmap' codec can't decode byte 0x9d in position 622: character maps to <undefined>
P.S. Task is to process 52 files, to merge data in every sheet with corresponded sheets in the 52 files. So, please no handle work advices.
The problem is that the original requester is calling read_excel with a filehandle as the first argument. As demonstrated by the last responder, the first argument should be a string containing the filename.
I ran into this same error using:
df = pd.read_excel(open("file.xlsx",'r'))
but correct is:
df = pd.read_excel("file.xlsx")
Most probably you're using Python3. In Python2 this wouldn't happen.
xlsx files are binary (actually they're an xml, but it's compressed), so you need to open them in binary mode. Use this call to open:
open('1.xlsx', 'rb')
There's no full traceback, but I imagine the UnicodeDecodeError comes from the file object, not from read_excel(). That happens because the stream of bytes can contain anything, but we don't want decoding to happen too soon; read_excel() must receive raw bytes and be able to process them.
Most probably the problem is in Russian symbols.
Charmap is default decoding method used in case no encoding is beeing noticed.
As I see if utf-8 and latin-1 do not help then try to read this file not as
pd.read_excel(f)
but
pd.read_table(f)
or even just
f.readline()
in order to check what is a symbol raise an exeception and delete this symbol/symbols.
Panda support encoding feature to read your excel
In your case you can use:
df=pd.read_excel('your_file.xlsx',encoding='utf-8')
or if you want in more of system specific without any surpise you can use:
df=pd.read_excel('your_file.xlsx',encoding='sys.getfilesystemencoding()')
I have a large string of a novel that I downloaded from Project Gutenberg. I am trying to save it to my computer, but I'm getting a UnicodeEncodeError and I don't know how to fix or ignore it.
from urllib import request
# Get the text
response = request.urlopen('http://www.gutenberg.org/files/2701/2701-0.txt')
# Decode it using utf8
raw = response.read().decode('utf8')
# Save the file
file = open('corpora/canon_texts/' + 'test', 'w')
file.write(raw)
file.close()
This gives me the following error:
UnicodeEncodeError: 'charmap' codec can't encode character '\ufeff' in position 0: character maps to <undefined>
First, I tried to remove the BOM at the beginning of the file:
# We have to get rid of the pesky Byte Order Mark before we save it
raw = raw.replace(u'\ufeff', '')
but I get the same error, just with a different position number:
UnicodeEncodeError: 'charmap' codec can't encode characters in position 7863-7864: character maps to <undefined>
If I look in that area I can't find the offending characters, so I don't know what to remove:
raw[7850:7900]
just prints out:
' BALLENA, Spanish.\r\n PEKEE-NUEE-'
which doesn't look like it would be a problem.
So then I tried to skip the bad lines with a try statement:
file = open('corpora/canon_texts/' + 'test', 'w')
try:
file.write(raw)
except UnicodeEncodeError:
pass
file.close()
but this skips the entire text, giving me a file of 0 size.
How can I fix this?
EDIT:
A couple people have noted that '\ufeff' is utf16. I tried switching to utf16:
# Get the text
response = request.urlopen('http://www.gutenberg.org/files/2701/2701-0.txt')
# Decode it using utf16
raw = response.read().decode('utf-16')
But I can't even download the data before I get this error:
UnicodeDecodeError: 'utf-16-le' codec can't decode byte 0x0a in position 1276798: truncated data
SECOND EDIT:
I also tried decoding with utf-8-sig as suggested in u'\ufeff' in Python string because that includes BOM, but then I'm back to this error when I try to save it:
UnicodeEncodeError: 'charmap' codec can't encode characters in position 7863-7864: character maps to <undefined>
Decoding and re-encoding a file just to save it to disk is pointless. Just write out the bytes you have downloaded, and you will have the file on disk:
raw = response.read()
with open('corpora/canon_texts/' + 'test', 'wb') as outfile:
outfile.write(raw)
This is the only reliable way to write to disk exactly what you downloaded.
Sooner or later you'll want to read in the file and work with it, so let's consider your error. You didn't provide a full stack trace (always a bad idea), but your error is during encoding, not decoding. The decoding step succeeded. The error must be arising on the line file.write(raw), which is where the text gets encoded for saving. But to what encoding is it being converted? Nobody knows, because you opened file without specifying an encoding! The encoding you're getting depends on your location, OS, and probably the tides and weather forecast. In short: Specify the encoding.
text = response.read().decode('utf8')
with open('corpora/canon_texts/' + 'test', 'w', encoding="utf-8") as outfile:
outfile.write(text)
U + feff is for UTF-16. Try that instead.
.decode(encoding="utf-8", errors="strict") offers error handling as a built-in feature:
The default for errors is 'strict', meaning that encoding errors raise a UnicodeError. Other possible values are 'ignore', 'replace' and any other name registered via codecs.register_error(), see section Error Handlers.
Probably the safest option is
decode("utf8", errors='backslashreplace')
which will escape encoding errors with a backslash, so you have a record of what failed to decode.
Conveniently, your Moby Dick text contains no backslashes, so it will be quite easy to check what characters are failing to decode.
What is strange about this text is the website says it is in utf-8, but \efeff is the BOM for utf-16. Decoding in utf-16, it looks like your just having trouble with the very last character 0x0a (which is a utf-8 line ending), which can probably safely be dropped with
decode("utf-16", errors='ignore')
I'm using python to read a text file with the segment below
(can't post a screenshot since i'm a noob) but this is what is looks like in notepad++:
NULSOHSOHNULNULNULSUBMesssage-ID:
error:
Traceback (most recent call last):
File "<pyshell#3>", line 1, in <module>
print(f.readline())
File "C:\Python32\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x8f in position 7673: character maps to <undefined>
Opening the file as binary:
f = open('file.txt','rb')
f.readline()
gives me the text as binary
b'\x00\x01\x01\x00\x00\x00\x1a\xb7Message-ID:
but how do I get the text as ascii ? And whats the easiest/pythonic way of handling this ?
The problem is with "byte 0x8f in position 7673", not with "byte 0x00 in position 1". I.e., your NUL is not the problem. If you look at the cp-1252 codepage on wikipedia, you can see that 0x8f has no corresponding character.
The larger issue is that your file is not in a single encoding: it appears to be a mix of binary framing of text segments. What you really need to do is figure out the format of this file and parse it into binary pieces (or perhaps some richer data structure, like a tuple, list, dict, object, etc), then decode the text pieces into unicode if you need to process further.
When opening a file in text mode, you can specifically tell which encoding to use:
f = open('file.txt','r',encoding='ascii')
However your real problem is different: the binary piece that you cited can not be read as ASCII, because the byte \xb7 is outside of ASCII range (0-127). The exception traceback tells that Python is using cp1252 codec by default, which cannot decode your file either.
You need either to figure out which encoding the file has, or to handle it as binary all the time.
Perhaps open it in read the correct read mode?
f = open('file.txt','r')
f.readline()
I'm working on a series of parsers where I get a bunch of tracebacks from my unit tests like:
File "c:\Python31\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 112: character maps to <undefined>
The files are opened with open() with no extra arguemnts. Can I pass extra arguments to open() or use something in the codec module to open these differently?
This came up with code that was written in Python 2 and converted to 3 with the 2to3 tool.
UPDATE: it turns out this is a result of feeding a zipfile into the parser. The unit test actually expects this to happen. The parser should recognize it as something that can't be parsed. So, I need to change my exception handling. In the process of doing that now.
Position 0x81 is unassigned in Windows-1252 (aka cp1252). It is assigned to U+0081 HIGH OCTET PRESET (HOP) control character in Latin-1 (aka ISO 8859-1). I can reproduce your error in Python 3.1 like this:
>>> b'\x81'.decode('cp1252')
Traceback (most recent call last):
...
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 0: character maps to <undefined>
or with an actual file:
>>> open('test.txt', 'wb').write(b'\x81\n')
2
>>> open('test.txt').read()
Traceback (most recent call last):
...
UnicodeDecodeError: 'utf8' codec can't decode byte 0x81 in position 0: unexpected code byte
Now to treat this file as Latin-1 you pass the encoding argument, like codeape suggested:
>>> open('test.txt', encoding='latin-1').read()
'\x81\n'
Beware that there are differences between Windows-1257 and Latin-1 encodings, e.g. Latin-1 doesn't have “smart quotes”. If the file you're processing is a text file, ask yourself what that \x81 is doing in it.
You can relax the error handling.
For instance:
f = open(filename, encoding="...", errors="replace")
Or:
f = open(filename, encoding="...", errors="ignore")
See the docs.
EDIT:
But are you certain that the problem is in reading the file? Could it be that the exception happens when something is written to the console? Check http://wiki.python.org/moin/PrintFails
All files are "not Unicode". Unicode is an internal representation which must be encoded. You need to determine for each file what encoding has been used, and specify that where necessary when the file is opened.
As the traceback and error message indicate, the file in question is NOT encoded in cp1252.
If it is encoded in latin1, the "\x81" that it is complaining about is a C1 control character that doesn't even have a name (in Unicode). Consider latin1 extremely unlikely to be valid.
You say "some of the files are parsed with xml.dom.minidom" -- parsed successfully or unsuccessfully?
A valid XML file should declare its encoding (default is UTF-8) in the first line, and you should not need to specify an encoding in your code. Show us the code that you are using to do the xml.dom.minidom parsing.
"others read directly as iterables" -- sample code please.
Suggestion: try opening some of each type of file in your browser. Then click View and click Character Encoding (Firefox) or Encoding (Internet Explorer). What encoding has the browser guessed [usually reliably]?
Other possible encoding clues: What languages are used in the text in the files? Where did you get the files from?
Note: please edit your question with clarifying information; don't answer in the comments.