Python: UnicodeDecodeError: 'utf8' codec can't decode byte

Python: UnicodeDecodeError: 'utf8' codec can't decode byte - python

I'm reading a bunch of RTF files into python strings.
On SOME texts, I get this error:
Traceback (most recent call last):
File "11.08.py", line 47, in <module>
X = vectorizer.fit_transform(texts)
File "C:\Python27\lib\site-packages\sklearn\feature_extraction\text.py", line
716, in fit_transform
X = super(TfidfVectorizer, self).fit_transform(raw_documents)
File "C:\Python27\lib\site-packages\sklearn\feature_extraction\text.py", line
398, in fit_transform
term_count_current = Counter(analyze(doc))
File "C:\Python27\lib\site-packages\sklearn\feature_extraction\text.py", line
313, in <lambda>
tokenize(preprocess(self.decode(doc))), stop_words)
File "C:\Python27\lib\site-packages\sklearn\feature_extraction\text.py", line
224, in decode
doc = doc.decode(self.charset, self.charset_error)
File "C:\Python27\lib\encodings\utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0x92 in position 462: invalid
start byte
I've tried:
Copying and pasting the text of the files to new files
saving the rtf files as txt files
Openin the txt files in Notepad++ and choosing 'convert to utf-8' and also setting the encoding to utf-8
Opening the files with Microsoft Word and saving them as new files
Nothing works. Any ideas?
It's probably not related, but here's the code incase you are wondering:
f = open(dir+location, "r")
doc = Rtf15Reader.read(f)
t = PlaintextWriter.write(doc).getvalue()
texts.append(t)
f.close()
vectorizer = TfidfVectorizer(sublinear_tf=True, max_df=0.5, stop_words='english')
X = vectorizer.fit_transform(texts)

This will solve your issues:
import codecs
f = codecs.open(dir+location, 'r', encoding='utf-8')
txt = f.read()
from that moment txt is in unicode format and you can use it everywhere in your code.
If you want to generate UTF-8 files after your processing do:
f.write(txt.encode('utf-8'))

as I said on the mailinglist, it is probably easiest to use the charset_error option and set it to ignore.
If the file is actually utf-16, you can also set the charset to utf-16 in the Vectorizer.
See the docs.

You can dump the csv file rows in json file without any encoding error as follows:
json.dump(row,jsonfile, encoding="ISO-8859-1")

Keep this line :
vectorizer = TfidfVectorizer(encoding='latin-1',sublinear_tf=True, max_df=0.5, stop_words='english')
encoding = 'latin-1' worked for me.

Related

How do I determine which text in a corpus contains an error generated by the NLTK suite in Python?

I am trying to do some rudimentary corpus analysis with Python. I am getting the following error message(s):
Traceback (most recent call last):
File "<pyshell#28>", line 2, in <module>
print(len(poems.words(f)), f)
File "C:\Python38-32\lib\site-packages\nltk\corpus\reader\util.py", line 240, in __len__
for tok in self.iterate_from(self._toknum[-1]):
File "C:\Python38-32\lib\site-packages\nltk\corpus\reader\util.py", line 306, in iterate_from
tokens = self.read_block(self._stream)
File "C:\Python38-32\lib\site-packages\nltk\corpus\reader\plaintext.py", line 134, in _read_word_block
words.extend(self._word_tokenizer.tokenize(stream.readline()))
File "C:\Python38-32\lib\site-packages\nltk\data.py", line 1220, in readline
new_chars = self._read(readsize)
File "C:\Python38-32\lib\site-packages\nltk\data.py", line 1458, in _read
chars, bytes_decoded = self._incr_decode(bytes)
File "C:\Python38-32\lib\site-packages\nltk\data.py", line 1489, in _incr_decode
return self.decode(bytes, 'strict')
File "C:\Python38-32\lib\encodings\utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x97 in position 12: invalid start byte
My assumption is that there is a UTF error in one of the 202 text files I am looking at.
Is there any way of telling, from the error messages, which file or files have the problem?

Assuming that you know the files ids (the paths of your corpus files) you can open all of them with encoding="utf-8"
If you don't know the paths, assuming that you are using the nltk corpus loader, you can get them by using:
poems.fileids()
After that, for every file in your list of files (for example fileids) you can try:
for file_ in fileids:
try:
with open(file_, encoding="utf-8") a f_i:
f_i.readlines()
except:
print("You got problems with the file: ", file_)
Anyway, your loader has also a parameter named "encoding" that you can use for the correct encoding of your corpus. By default is set to "utf-8"
More details here: nltk corpus loader

Python Unicode Byte Decoding From File

I feel like an absolute idiot for posting this...
So, I'm making a file crypter that reads a text file, outputs it to an encrypted file, and then allows you to turn that file back into plaintext. I've got writing the file down, but reading it is a problem.
From the encryption:
newf.write(bytes(result[0], "utf-8"))
newf.write(bytes('{[:|:;:|:]}'))
newf.write(bytes(result[1], "utf-8"))
newf.close()
And also the decryption:
name = fudder.askopenfilename(defaultextension =("Text Files","*.txt"),title = "Choose a file to decrypt.")
with open(name,'rb') as Usefile:
filecont = bytes(Usefile.read(),'utf-8')
It brings up this error:
File "C:\STUFF\FILE.py", line 93, in <lambda>
self.fileO = Button(text = 'Decrypt File', command = lambda: cryptFile())
File "C:\STUFF\FILE.py", line 60, in cryptFile
filecont = Usefile.read()
File "C:\Program Files (x86)\Python35-32\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 68: character maps to <undefined>

The traceback shows that in your real code, the error occurs in the cryptFile function on this line:
filecont = UseFile.read()
The UnicodeDecodeError indicates that UseFile is a file-like object that has probably been opened in text mode without specifying an encoding. This means it will try to use the default encoding of cp1252 (on Windows) to decode a file that has actually been encoded as UTF-8. Obviously, this will fail when the codec encounters any unmapped bytes (such as 0x81).
The solution is to specify the correct encoding when opening the file:
with open(name, 'rt', encoding='utf-8') as Usefile:
filecont = UseFile.read()
This will result in filecont being a unicode string object.

Tokenizing text with scikit-learn

I have the following code to extract features from a set of files (folder name is the category name) for text classification.
import sklearn.datasets
from sklearn.feature_extraction.text import TfidfVectorizer
train = sklearn.datasets.load_files('./train', description=None, categories=None, load_content=True, shuffle=True, encoding=None, decode_error='strict', random_state=0)
print len(train.data)
print train.target_names
vectorizer = TfidfVectorizer()
X_train = vectorizer.fit_transform(train.data)
It throws the following stack trace:
Traceback (most recent call last):
File "C:\EclipseWorkspace\TextClassifier\main.py", line 16, in <module>
X_train = vectorizer.fit_transform(train.data)
File "C:\Python27\lib\site-packages\sklearn\feature_extraction\text.py", line 1285, in fit_transform
X = super(TfidfVectorizer, self).fit_transform(raw_documents)
File "C:\Python27\lib\site-packages\sklearn\feature_extraction\text.py", line 804, in fit_transform
self.fixed_vocabulary_)
File "C:\Python27\lib\site-packages\sklearn\feature_extraction\text.py", line 739, in _count_vocab
for feature in analyze(doc):
File "C:\Python27\lib\site-packages\sklearn\feature_extraction\text.py", line 236, in <lambda>
tokenize(preprocess(self.decode(doc))), stop_words)
File "C:\Python27\lib\site-packages\sklearn\feature_extraction\text.py", line 113, in decode
doc = doc.decode(self.encoding, self.decode_error)
File "C:\Python27\lib\encodings\utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xff in position 32054: invalid start byte
I run Python 2.7. How can I get this to work?
EDIT:
I have just discovered that this works perfectly well for files with utf-8 encoding (my files are ANSI encoded). Is there any way I can get sklearn.datasets.load_files() to work with ANSI encoding?

ANSI is a strict subset of UTF-8, so it should work just fine. However, from the stack trace, it seems that your input contains the byte 0xFF somewhere, which is not a valid ANSI character.

I fix the problem by change the errors setting from 'strict' into 'ignore'
vectorizer = CountVectorizer(binary = True, decode_error = u'ignore')
word_tokenizer = vectorizer.build_tokenizer()
doc_terms_list_train = [word_tokenizer(str(doc_str, encoding = 'utf-8', errors = 'ignore')) for doc_str in doc_str_list_train]
doc_train_vec = vectorizer.fit_transform(doc_str_list_train)
here is the detailed explanation of countvectorizer fucntion

Python ignores encoding argument in favor of cp1252

I have a lengthy json file that contains utf-8 characters (and is encoded in utf-8). I want to read it in python using the built-in json module.
My code looks like this:
dat = json.load(open("data.json"), "utf-8")
Though I understand the "utf-8" argument should be unnecessary as it is assumed as the default. However, I get this error:
Traceback (most recent call last):
File "winratio.py", line 9, in <module>
dat = json.load(open("data.json"), "utf-8")
File "C:\Python33\lib\json\__init__.py", line 271, in load
return loads(fp.read(),
File "C:\Python33\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 28519: ch
aracter maps to <undefined>
My question is: Why does python seem to ignore my encoding specification and try to load the file in cp1252?

Try this:
import codecs
dat = json.load(codecs.open("data.json", "r", "utf-8"))
Also here are described some tips about a writing mode in context of the codecs library: Write to UTF-8 file in Python

Unicode problems in Python again

I have a strange problem. A friend of mine sent me a text file. If I copy/paste the text and then paste it into my text editor and save it, the following code works. If I choose the option to save the file directly from the browser, the following code breaks. What's going on? Is it the browser's fault for saving invalid characters?
This is an example line.
When I save it, the line says
What�s going on?
When I copy/paste it, the line says
What’s going on?
This is the code:
import codecs
def do_stuff(filename):
with codecs.open(filename, encoding='utf-8') as f:
def process_line(line):
return line.strip()
lines = f.readlines()
for line in lines:
line = process_line(line)
print line
do_stuff('stuff.txt')
This is the traceback I get:
Traceback (most recent call last):
File "test-encoding.py", line 13, in <module>
do_stuff('stuff.txt')
File "test-encoding.py", line 8, in do_stuff
lines = f.readlines()
File "/home/somebody/.python/lib64/python2.7/codecs.py", line 679, in readlines
return self.reader.readlines(sizehint)
File "/home/somebody/.python/lib64/python2.7/codecs.py", line 588, in readlines
data = self.read()
File "/home/somebody/.python/lib64/python2.7/codecs.py", line 477, in read
newchars, decodedbytes = self.decode(data, self.errors)
UnicodeDecodeError: 'utf8' codec can't decode byte 0x92 in position 4: invalid start byte
What can I do in such cases?
How can I distribute the script if I don't know what encoding the user who runs it will use?
Fixed:
codecs.open(filename, encoding='utf-8', errors='ignore') as f:

The "file-oriented" part of the browser works with raw bytes, not characters. The specific encoding used by the page should be specified either in the HTTP headers or in the HTML itself. You must use this encoding instead of assuming that you have UTF-8 data.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python: UnicodeDecodeError: 'utf8' codec can't decode byte - python

This will solve your issues: import codecs f = codecs.open(dir+location, 'r', encoding='utf-8') txt = f.read() from that moment txt is in unicode format and you can use it everywhere in your code. If you want to generate UTF-8 files after your processing do: f.write(txt.encode('utf-8'))

as I said on the mailinglist, it is probably easiest to use the charset_error option and set it to ignore. If the file is actually utf-16, you can also set the charset to utf-16 in the Vectorizer. See the docs.

You can dump the csv file rows in json file without any encoding error as follows: json.dump(row,jsonfile, encoding="ISO-8859-1")

Keep this line : vectorizer = TfidfVectorizer(encoding='latin-1',sublinear_tf=True, max_df=0.5, stop_words='english') encoding = 'latin-1' worked for me.

Related

How do I determine which text in a corpus contains an error generated by the NLTK suite in Python?

Python Unicode Byte Decoding From File

Tokenizing text with scikit-learn

Python ignores encoding argument in favor of cp1252

Unicode problems in Python again

Categories

Resources