Ignore UnicodeEncodeError when saving utf8 file - python

I have a large string of a novel that I downloaded from Project Gutenberg. I am trying to save it to my computer, but I'm getting a UnicodeEncodeError and I don't know how to fix or ignore it.
from urllib import request
# Get the text
response = request.urlopen('http://www.gutenberg.org/files/2701/2701-0.txt')
# Decode it using utf8
raw = response.read().decode('utf8')
# Save the file
file = open('corpora/canon_texts/' + 'test', 'w')
file.write(raw)
file.close()
This gives me the following error:
UnicodeEncodeError: 'charmap' codec can't encode character '\ufeff' in position 0: character maps to <undefined>
First, I tried to remove the BOM at the beginning of the file:
# We have to get rid of the pesky Byte Order Mark before we save it
raw = raw.replace(u'\ufeff', '')
but I get the same error, just with a different position number:
UnicodeEncodeError: 'charmap' codec can't encode characters in position 7863-7864: character maps to <undefined>
If I look in that area I can't find the offending characters, so I don't know what to remove:
raw[7850:7900]
just prints out:
' BALLENA, Spanish.\r\n PEKEE-NUEE-'
which doesn't look like it would be a problem.
So then I tried to skip the bad lines with a try statement:
file = open('corpora/canon_texts/' + 'test', 'w')
try:
file.write(raw)
except UnicodeEncodeError:
pass
file.close()
but this skips the entire text, giving me a file of 0 size.
How can I fix this?
EDIT:
A couple people have noted that '\ufeff' is utf16. I tried switching to utf16:
# Get the text
response = request.urlopen('http://www.gutenberg.org/files/2701/2701-0.txt')
# Decode it using utf16
raw = response.read().decode('utf-16')
But I can't even download the data before I get this error:
UnicodeDecodeError: 'utf-16-le' codec can't decode byte 0x0a in position 1276798: truncated data
SECOND EDIT:
I also tried decoding with utf-8-sig as suggested in u'\ufeff' in Python string because that includes BOM, but then I'm back to this error when I try to save it:
UnicodeEncodeError: 'charmap' codec can't encode characters in position 7863-7864: character maps to <undefined>

Decoding and re-encoding a file just to save it to disk is pointless. Just write out the bytes you have downloaded, and you will have the file on disk:
raw = response.read()
with open('corpora/canon_texts/' + 'test', 'wb') as outfile:
outfile.write(raw)
This is the only reliable way to write to disk exactly what you downloaded.
Sooner or later you'll want to read in the file and work with it, so let's consider your error. You didn't provide a full stack trace (always a bad idea), but your error is during encoding, not decoding. The decoding step succeeded. The error must be arising on the line file.write(raw), which is where the text gets encoded for saving. But to what encoding is it being converted? Nobody knows, because you opened file without specifying an encoding! The encoding you're getting depends on your location, OS, and probably the tides and weather forecast. In short: Specify the encoding.
text = response.read().decode('utf8')
with open('corpora/canon_texts/' + 'test', 'w', encoding="utf-8") as outfile:
outfile.write(text)

U + feff is for UTF-16. Try that instead.

.decode(encoding="utf-8", errors="strict") offers error handling as a built-in feature:
The default for errors is 'strict', meaning that encoding errors raise a UnicodeError. Other possible values are 'ignore', 'replace' and any other name registered via codecs.register_error(), see section Error Handlers.
Probably the safest option is
decode("utf8", errors='backslashreplace')
which will escape encoding errors with a backslash, so you have a record of what failed to decode.
Conveniently, your Moby Dick text contains no backslashes, so it will be quite easy to check what characters are failing to decode.
What is strange about this text is the website says it is in utf-8, but \efeff is the BOM for utf-16. Decoding in utf-16, it looks like your just having trouble with the very last character 0x0a (which is a utf-8 line ending), which can probably safely be dropped with
decode("utf-16", errors='ignore')

Related

Encoding text with multiple encodings

I am trying to open a txt file in python and reading it using open() and read(), the problem is that some of the text is not UTF-8. Here is the error:
UnicodeDecodeError: 'charmap' codec can't decode byte 0x8f in position
1911885: character maps to
How can I read this document?
You might wanna check all the answers in this question as it seems pretty similar to yours: UnicodeDecodeError: 'charmap' codec can't decode byte X in position Y: character maps to <undefined>
As said in the site, try:
file = open(filename, encoding="utf8")
Was planning to share this as a comment but I don't have enough reputations for that :)
EDIT: After reading your comment as a response to my previous answer and as suggested by Cett to improve it:
Probably the best way to deal with encoding errors is by using the errors argument. As said in your question if only some characters are not decoded then this should be fine to use.
file = open(filename, encoding="utf8", errors = "ignore")
NOTE: using this argument will lead to Python ignoring that special character. So I would recommend this only if you are fine with losing some data.

UnicodeEncodeError: 'charmap' codec can't encode character '\ufeff' in position 0: character maps to <undefined>

I'm working on an application which is using utf-8 encoding. For debugging purposes I need to print the text. If I use print() directly with variable containing my unicode string, ex- print(pred_str).
I get this error:
UnicodeEncodeError: 'charmap' codec can't encode character '\ufeff' in position 0: character maps to
So I tried print(pred_str.encode('utf-8')) and my output looks like this:
b'\xef\xbb\xbfpudgala-dharma-nair\xc4\x81tmyayo\xe1\xb8\xa5 apratipanna-vipratipann\xc4\x81n\xc4\x81m'
b'avipar\xc4\xabta-pudgala-dharma-nair\xc4\x81tmya-pratip\xc4\x81dana-artham'
b'tri\xe1\xb9\x83\xc5\x9bik\xc4\x81-vij\xc3\xb1apti-prakara\xe1\xb9\x87a-\xc4\x81rambha\xe1\xb8\xa5'
b'pudgala-dharma-nair\xc4\x81tmya-pratip\xc4\x81danam punar kle\xc5\x9ba-j\xc3\xb1eya-\xc4\x81vara\xe1\xb9\x87a-prah\xc4\x81\xe1\xb9\x87a-artham'
But, I want my output to look like this:
pudgala-dharma-nairātmyayoḥ apratipanna-vipratipannānām
aviparīta-pudgala-dharma-nairātmya-pratipādana-artham
triṃśikā-vijñapti-prakaraṇa-ārambhaḥ
pudgala-dharma-nairātmya-pratipādanam punar kleśa-jñeya-āvaraṇa-prahāṇa-artham
If i save my string in file using:
with codecs.open('out.txt', 'w', 'UTF-8') as f:
f.write(pred_str)
it saves string as expected.
Your data is encoded with the "UTF-8-SIG" codec, which is sometimes used in Microsoft environments.
This variant of UTF-8 prefixes encoded text with a byte order mark '\xef\xbb\xbf', to make it easier for applications to detect UTF-8 encoded text vs other encodings.
You can decode such bytestrings like this:
>>> bs = b'\xef\xbb\xbfpudgala-dharma-nair\xc4\x81tmyayo\xe1\xb8\xa5 apratipanna-vipratipann\xc4\x81n\xc4\x81m'
>>> text = bs.decode('utf-8-sig')
>>> print(text)
pudgala-dharma-nairātmyayoḥ apratipanna-vipratipannānām
To read such data from a file:
with open('myfile.txt', 'r', encoding='utf-8-sig') as f:
text = f.read()
Note that even after decoding from UTF-8-SIG, you may still be unable to print your data because your console's default code page may not be able to encode other non-ascii characters in the data. In that case you will need to adjust your console settings to support UTF-8.
try this code:
if pred_str.startswith('\ufeff'):
pred_str = pred_str.split('\ufeff')[1]

UnicodeDecodeError: 'utf8' codec can't decode byte 0xea [duplicate]

This question already has answers here:
How to determine the encoding of text
(16 answers)
Closed 6 years ago.
I have a CSV file that I'm uploading via an HTML form to a Python API
The API looks like this:
#app.route('/add_candidates_to_db', methods=['GET','POST'])
def add_candidates():
file = request.files['csv_file']
x = io.StringIO(file.read().decode('UTF8'), newline=None)
csv_input = csv.reader(x)
for row in csv_input:
print(row)
I found the part of the file that causes the issue. In my file it has Í character.
I get this error: UnicodeDecodeError: 'utf8' codec can't decode byte 0xea in position 1317: invalid continuation byte
I thought I was decoding it with .decode('UTF8') or is the error happening before that with file.read()?
How do I fix this?
**
**
Edit: I have control of the file. I am creating the CSV file myself by pulling data (sometimes this data has strange characters).
One the server side, I'm reading each row in the file and inserting into a database.
Your data is not UTF-8, it contains errors. You say that you are generating the data, so the ideal solution is to generate better data.
Unfortunately, sometimes we are unable to get high-quality data, or we have servers that give us garbage and we have to sort it out. For these situations, we can use less strict error handling when decoding text.
Instead of:
file.read().decode('UTF8')
You can use:
file.read().decode('UTF8', 'replace')
This will make it so that any “garbage” characters (anything which is not correctly encoded as UTF-8) will get replaced with U+FFFD, which looks like this:
�
You say that your file has the Í character, but you are probably viewing the file using an encoding other than UTF-8. Is your file supposed to contain Í, or is it just mojibake? Maybe you can figure out what the character is supposed to be, and from that, you can figure out what encoding your data uses if it's not UTF-8.
It seems that your file is not encoded in utf8. You can try reading the file with all the encodings that Python understand and check which lets you read the entire content of the file. Try this script:
from codecs import open
encodings = [
"ascii",
"big5",
"big5hkscs",
"cp037",
"cp424",
"cp437",
"cp500",
"cp720",
"cp737",
"cp775",
"cp850",
"cp852",
"cp855",
"cp856",
"cp857",
"cp858",
"cp860",
"cp861",
"cp862",
"cp863",
"cp864",
"cp865",
"cp866",
"cp869",
"cp874",
"cp875",
"cp932",
"cp949",
"cp950",
"cp1006",
"cp1026",
"cp1140",
"cp1250",
"cp1251",
"cp1252",
"cp1253",
"cp1254",
"cp1255",
"cp1256",
"cp1257",
"cp1258",
"euc_jp",
"euc_jis_2004",
"euc_jisx0213",
"euc_kr",
"gb2312",
"gbk",
"gb18030",
"hz",
"iso2022_jp",
"iso2022_jp_1",
"iso2022_jp_2",
"iso2022_jp_2004",
"iso2022_jp_3",
"iso2022_jp_ext",
"iso2022_kr",
"latin_1",
"iso8859_2",
"iso8859_3",
"iso8859_4",
"iso8859_5",
"iso8859_6",
"iso8859_7",
"iso8859_8",
"iso8859_9",
"iso8859_10",
"iso8859_13",
"iso8859_14",
"iso8859_15",
"iso8859_16",
"johab",
"koi8_r",
"koi8_u",
"mac_cyrillic",
"mac_greek",
"mac_iceland",
"mac_latin2",
"mac_roman",
"mac_turkish",
"ptcp154",
"shift_jis",
"shift_jis_2004",
"shift_jisx0213",
"utf_32",
"utf_32_be",
"utf_32_le",
"utf_16",
"utf_16_be",
"utf_16_le",
"utf_7",
"utf_8",
"utf_8_sig",
]
for encoding in encodings:
try:
with open(file, encoding=encoding) as f:
f.read()
print('Seemingly working encoding: {}'.format(encoding))
except:
pass
where file is again the filename of your file.

What kind of Encoding does a standard midi file use?

Here's what brought this question up:
with open(path + "/OneChance1.mid") as f:
for line in f.readline():
print(line)
Here I am simply trying to read a midi file to scour its contents. I then receive this error message: UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 153: character maps to <undefined>
If I use open()'s second param like so: with open(path + "/OneChance1.mid"m encoding='utf-8) as f: then I receive this error: UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 13: invalid start byte
If I change the encoding param to ascii I get another error about an ordinal being out of range. Lastly I tried utf-16 and it said that the file didn't start with BOM (which made me smile for some reason). Also, if I ignore the errors I get characters that resemble nothing of the kind of data I am expecting. My expectations are based on this source: http://www.sonicspot.com/guide/midifiles.html
Anyway, does anyone know what kind of encoding a midi file uses? My research is coming up short in that regard so I thought it would be worth asking on SO. Or maybe someone can point out some other possibilities or blunders?
MIDI files are binary content. By opening the file as a text file however, Python applies the default system encoding in trying to decode the text as Unicode.
Open the file in binary mode instead:
with open(midifile, 'rb') as mfile:
leader = mfile.read(4)
if leader != b'MThd':
raise ValueError('Not a MIDI file!')
You'd have to study the MIDI standard file format if you wanted to learn more from the file. Also see What is the structure of a MIDI file?
It's a binary file, it's not text using a text encoding like you seem to expect.
To open a file in binary mode in Python, pass a string containing "b" as the second argument to open().
This page contains a description of the format.

Parse file in robust way with python 3

I have a log file that I need to go through line by line, and apparently it contains some "bad bytes". I get an error message along the following lines:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb0 in position 9: invalid start byte
I have been able to strip down the problem to a file "log.test" containing the following line:
Message: \260
(At least this is how it shows up in my Emacs.)
I have a file "demo_error.py" which looks like this:
import sys
with open(sys.argv[1], 'r') as lf:
for i, l in enumerate(lf):
print(i, l.strip())
I then run, from the command line:
$ python3 demo_error.py log.test
The full traceback is:
Traceback (most recent call last):
File "demo_error.py", line 5, in <module>
for i, l in enumerate(lf):
File "/usr/local/Cellar/python3/3.4.0/Frameworks/Python.framework/Versions/3.4/lib/python3.4/codecs.py", line 313, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb0 in position 13: invalid start byte
My hunch is that I have to somehow specify a more general codec ("raw ascii" for instance) - but I'm not quite sure how to do this.
Note that this is not really a problem in Python 2.7.
And just to make my point clear: I don't mind getting an exception for the line in question - then I can simply discard the line. The problem is that the exception seems to happen on the "for" loop itself, which makes special handling of that particular line impossible.
You can also use the codecs module. When you use the codecs.open() function, you can specify how it handles errors using the errors argument:
codecs.open(filename, mode[, encoding[, errors[, buffering]]])
The errors argument can be one of several different keywords that specify how you want Python to behave when it attempts to decode a character that is invalid for the current encoding. You'll probably be most interested in codecs.ignore_errors or codecs.replace_errors, which cause invalid characters to be either ignored or replaced with a default character, respectively.
This method can be a good alternative when you know you have corrupt data that will cause the UnicodeDecodeError to be raised even when you specify the correct encoding.
Example:
with codecs.open('file.txt', mode='r', errors='ignore'):
# ...stuff...
# Even if there is corrupt data and invalid characters for the default
# encoding, this open() will still succeed
So apparently your file does not contain valid UTF-8 (which is the default encoding).
If you know, what encoding is used (e.g. iso-8859-1 which was afaik the python2 default), you can specify it when opening by using
open(sys.argv[1], mode='r', encoding='iso-8859-1')
If the encoding is unknown or not valid as all, you can open the file as binary.
open(sys.argv[1], mode='rb')
This will make the content accessible as bytes rather than trying to interpret them as characters.
In python <=2.7, strings (str) are arrays of 8 bits characters. So when reading a file composed of 8 bits characters or bytes, you get the bytes without problem, no matter what the actual encoding is. Simply, you may read them with a wrong representation, but it will never throw any exception.
In python >=3,strings are unicode strings (16 bits per character). So when reading a file python has to decode the file, and by default it uses system encoding - not necessarily UTF-8. In your case, it seems to assume UTF-8 encoding, when your log file is not UTF-8 encoding so the exception.
If not sure of the encoding you may reasonably try to use ISO-8859-1 with
open(sys.argv[1], 'r', encoding='iso-8859-1')

Categories