Terrible "invalid start byte" Unicode Error with Opening a CSV file - python

Please please please help. I've been strugglign with this for a while and ran into problem after problem. I'm just trying to make a loop that opens every csv file in a folder. Here's the loop:
folder = '/Users/jolijttamanaha/Documents/Senior/Thesis/Python/TextAnalysis/datedmatchedngrams2/'
for file in os.listdir (folder):
with codecs.open(file, mode='rU', encoding='utf-8') as f:
m=min(int(line[1]) for line in csv.reader(f))
f.seek(0)
for line in csv.reader(f):
if int(line[1])==m:
print line
Here's the error:
Traceback (most recent call last):
File "findfirsttrigram.py", line 11, in <module>
m=min(int(line[1]) for line in csv.reader(f))
File "findfirsttrigram.py", line 11, in <genexpr>
m=min(int(line[1]) for line in csv.reader(f))
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/codecs.py", line 684, in next
return self.reader.next()
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/codecs.py", line 615, in next
line = self.readline()
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/codecs.py", line 530, in readline
data = self.read(readsize, firstline=True)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/codecs.py", line 477, in read
newchars, decodedbytes = self.decode(data, self.errors)
UnicodeDecodeError: 'utf8' codec can't decode byte 0x87 in position 0: invalid start byte
I got here because first I had a "Null Byte" error, which I solved with this post: "Line contains NULL byte" in CSV reader (Python)
Then I got an integer error, which I solved with this post "an integer is required" when open()'ing a file as utf-8?
Then I got an error that said: 'UnicodeException: UTF-16 stream does not start with BOM' which I solved with this post utf-16 file seeking in python. how?
Then I realized that the csv module requires utf-8 so here I am.
But I've finally hit the limit of the existing questions. I can't figure out what is going on. Please please help.

I'm not sure why but this ultimately worked:
import csv
import os
import unicodecsv
folder = '/Users/jolijttamanaha/Documents/Senior/Thesis/Python/TextAnalysis/datedmatchedngrams3/'
for file in os.listdir (folder):
with open(os.path.join(folder, file), mode='rU') as f:
try:
m=min(int(line[1]) for line in unicodecsv.reader(f, encoding='utf-8', errors='replace'))
except:
print "one no work"
continue
f.seek(0)
for line in unicodecsv.reader(f):
if int(line[1])==m:
print line

Perhaps try using a os.walk along with using for files in files?
folder = '/Users/jolijttamanaha/Documents/Senior/Thesis/Python/TextAnalysis/datedmatchedngrams2/'
for subdir, dirs, files in os.walk(folder):
for file in files:
with codecs.open(file, mode='rU', encoding='utf-16-be') as f:
#Your code here

Clearly then your file is not encoded in UTF-8. Try another encoding. If you're using Windows, 'mbcs' will use the default encoding for your version of Windows.

Related

How do I determine which text in a corpus contains an error generated by the NLTK suite in Python?

I am trying to do some rudimentary corpus analysis with Python. I am getting the following error message(s):
Traceback (most recent call last):
File "<pyshell#28>", line 2, in <module>
print(len(poems.words(f)), f)
File "C:\Python38-32\lib\site-packages\nltk\corpus\reader\util.py", line 240, in __len__
for tok in self.iterate_from(self._toknum[-1]):
File "C:\Python38-32\lib\site-packages\nltk\corpus\reader\util.py", line 306, in iterate_from
tokens = self.read_block(self._stream)
File "C:\Python38-32\lib\site-packages\nltk\corpus\reader\plaintext.py", line 134, in _read_word_block
words.extend(self._word_tokenizer.tokenize(stream.readline()))
File "C:\Python38-32\lib\site-packages\nltk\data.py", line 1220, in readline
new_chars = self._read(readsize)
File "C:\Python38-32\lib\site-packages\nltk\data.py", line 1458, in _read
chars, bytes_decoded = self._incr_decode(bytes)
File "C:\Python38-32\lib\site-packages\nltk\data.py", line 1489, in _incr_decode
return self.decode(bytes, 'strict')
File "C:\Python38-32\lib\encodings\utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x97 in position 12: invalid start byte
My assumption is that there is a UTF error in one of the 202 text files I am looking at.
Is there any way of telling, from the error messages, which file or files have the problem?
Assuming that you know the files ids (the paths of your corpus files) you can open all of them with encoding="utf-8"
If you don't know the paths, assuming that you are using the nltk corpus loader, you can get them by using:
poems.fileids()
After that, for every file in your list of files (for example fileids) you can try:
for file_ in fileids:
try:
with open(file_, encoding="utf-8") a f_i:
f_i.readlines()
except:
print("You got problems with the file: ", file_)
Anyway, your loader has also a parameter named "encoding" that you can use for the correct encoding of your corpus. By default is set to "utf-8"
More details here: nltk corpus loader

Python - UnicodeDecodeError when trying to parse HTML file which is ASCII

Am using Python 2.7.6.
Have an HTML file which contains values prepended with "$". Wrote a program which takes in JSON data and replaces the values prepended with $ with the JSON values.
This was working fine until someone opened up the set of HTML files with a different editor and changed it from UTF-8 to ASCII.
class FileUtil:
#staticmethod
def replace_all(output_file, data):
homedir = os.path.expanduser("~")
dest_dir = homedir + "/dest_dir"
with open(output_file, "r") as my_file:
contents = my_file.read()
destination_file = dest_dir + "/" + data["filename"]
fp = open(destination_file, "w")
for key, value in data.iteritems():
contents = contents.replace("$" + str(key), value)
fp.write(contents)
fp.close()
Whenever my program encounters a file which is in ASCII it throws this error:
Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/web.py-0.37-py2.7.egg/web/application.py", line 239, in process
return self.handle()
File "/usr/local/lib/python2.7/dist-packages/web.py-0.37-py2.7.egg/web/application.py", line 230, in handle
return self._delegate(fn, self.fvars, args)
File "/usr/local/lib/python2.7/dist-packages/web.py-0.37-py2.7.egg/web/application.py", line 420, in _delegate
return handle_class(cls)
File "/usr/local/lib/python2.7/dist-packages/web.py-0.37-py2.7.egg/web/application.py", line 396, in handle_class
return tocall(*args)
FileUtil.replace_all(output_file, data)
File "/home/devuser/demo/utils/fileutils.py", line 11, in replace_all
contents = contents.replace("$" + str(key), value)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 54826: ordinal not in range(128)
Question(s):
Is there a way to make the contents value to be strictly UTF-8 in python?
Is it better to use a command line utility in Ubuntu Linux to convert the file before running this python script?
Is the error an encoding problem (e.g. file is ASCII and not UTF8)?
#Apalala
Thank you very much regarding chardet! It was a very useful tool.
#Ulrich Ekhardt
You are right, it is UTF-8 and not ASCII.
This was the solution:
iconv --from-code UTF-8 --to-code US-ASCII -c hello.htm > hello.html

Unicode problems in Python again

I have a strange problem. A friend of mine sent me a text file. If I copy/paste the text and then paste it into my text editor and save it, the following code works. If I choose the option to save the file directly from the browser, the following code breaks. What's going on? Is it the browser's fault for saving invalid characters?
This is an example line.
When I save it, the line says
What�s going on?
When I copy/paste it, the line says
What’s going on?
This is the code:
import codecs
def do_stuff(filename):
with codecs.open(filename, encoding='utf-8') as f:
def process_line(line):
return line.strip()
lines = f.readlines()
for line in lines:
line = process_line(line)
print line
do_stuff('stuff.txt')
This is the traceback I get:
Traceback (most recent call last):
File "test-encoding.py", line 13, in <module>
do_stuff('stuff.txt')
File "test-encoding.py", line 8, in do_stuff
lines = f.readlines()
File "/home/somebody/.python/lib64/python2.7/codecs.py", line 679, in readlines
return self.reader.readlines(sizehint)
File "/home/somebody/.python/lib64/python2.7/codecs.py", line 588, in readlines
data = self.read()
File "/home/somebody/.python/lib64/python2.7/codecs.py", line 477, in read
newchars, decodedbytes = self.decode(data, self.errors)
UnicodeDecodeError: 'utf8' codec can't decode byte 0x92 in position 4: invalid start byte
What can I do in such cases?
How can I distribute the script if I don't know what encoding the user who runs it will use?
Fixed:
codecs.open(filename, encoding='utf-8', errors='ignore') as f:
The "file-oriented" part of the browser works with raw bytes, not characters. The specific encoding used by the page should be specified either in the HTTP headers or in the HTML itself. You must use this encoding instead of assuming that you have UTF-8 data.

Python: UnicodeDecodeError: 'utf8' codec can't decode byte

I'm reading a bunch of RTF files into python strings.
On SOME texts, I get this error:
Traceback (most recent call last):
File "11.08.py", line 47, in <module>
X = vectorizer.fit_transform(texts)
File "C:\Python27\lib\site-packages\sklearn\feature_extraction\text.py", line
716, in fit_transform
X = super(TfidfVectorizer, self).fit_transform(raw_documents)
File "C:\Python27\lib\site-packages\sklearn\feature_extraction\text.py", line
398, in fit_transform
term_count_current = Counter(analyze(doc))
File "C:\Python27\lib\site-packages\sklearn\feature_extraction\text.py", line
313, in <lambda>
tokenize(preprocess(self.decode(doc))), stop_words)
File "C:\Python27\lib\site-packages\sklearn\feature_extraction\text.py", line
224, in decode
doc = doc.decode(self.charset, self.charset_error)
File "C:\Python27\lib\encodings\utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0x92 in position 462: invalid
start byte
I've tried:
Copying and pasting the text of the files to new files
saving the rtf files as txt files
Openin the txt files in Notepad++ and choosing 'convert to utf-8' and also setting the encoding to utf-8
Opening the files with Microsoft Word and saving them as new files
Nothing works. Any ideas?
It's probably not related, but here's the code incase you are wondering:
f = open(dir+location, "r")
doc = Rtf15Reader.read(f)
t = PlaintextWriter.write(doc).getvalue()
texts.append(t)
f.close()
vectorizer = TfidfVectorizer(sublinear_tf=True, max_df=0.5, stop_words='english')
X = vectorizer.fit_transform(texts)
This will solve your issues:
import codecs
f = codecs.open(dir+location, 'r', encoding='utf-8')
txt = f.read()
from that moment txt is in unicode format and you can use it everywhere in your code.
If you want to generate UTF-8 files after your processing do:
f.write(txt.encode('utf-8'))
as I said on the mailinglist, it is probably easiest to use the charset_error option and set it to ignore.
If the file is actually utf-16, you can also set the charset to utf-16 in the Vectorizer.
See the docs.
You can dump the csv file rows in json file without any encoding error as follows:
json.dump(row,jsonfile, encoding="ISO-8859-1")
Keep this line :
vectorizer = TfidfVectorizer(encoding='latin-1',sublinear_tf=True, max_df=0.5, stop_words='english')
encoding = 'latin-1' worked for me.

UnicodeDecodeError when reading dictionary words file with simple Python script

First time doing Python in a while, and I'm having trouble doing a simple scan of a file when I run the following script with Python 3.0.1,
with open("/usr/share/dict/words", 'r') as f:
for line in f:
pass
I get this exception:
Traceback (most recent call last):
File "/home/matt/install/test.py", line 2, in <module>
for line in f:
File "/home/matt/install/root/lib/python3.0/io.py", line 1744, in __next__
line = self.readline()
File "/home/matt/install/root/lib/python3.0/io.py", line 1817, in readline
while self._read_chunk():
File "/home/matt/install/root/lib/python3.0/io.py", line 1565, in _read_chunk
self._set_decoded_chars(self._decoder.decode(input_chunk, eof))
File "/home/matt/install/root/lib/python3.0/io.py", line 1299, in decode
output = self.decoder.decode(input, final=final)
File "/home/matt/install/root/lib/python3.0/codecs.py", line 300, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 1689-1692: invalid data
The line in the file it blows up on is "Argentinian", which doesn't seem to be unusual in any way.
Update: I added,
encoding="iso-8559-1"
to the open() call, and it fixed the problem.
How have you determined from "position 1689-1692" what line in the file it has blown up on? Those numbers would be offsets in the chunk that it's trying to decode. You would have had to determine what chunk it was -- how?
Try this at the interactive prompt:
buf = open('the_file', 'rb').read()
len(buf)
ubuf = buf.decode('utf8')
# splat ... but it will give you the byte offset into the file
buf[offset-50:60] # should show you where/what the problem is
# By the way, from the error message, looks like a bad
# FOUR-byte UTF-8 character ... interesting
Can you check to make sure it is valid UTF-8? A way to do that is given at this SO question:
iconv -f UTF-8 /usr/share/dict/words -o /dev/null
There are other ways to do the same thing.

Categories