Tokenizing text with scikit-learn

Tokenizing text with scikit-learn - python

I have the following code to extract features from a set of files (folder name is the category name) for text classification.
import sklearn.datasets
from sklearn.feature_extraction.text import TfidfVectorizer
train = sklearn.datasets.load_files('./train', description=None, categories=None, load_content=True, shuffle=True, encoding=None, decode_error='strict', random_state=0)
print len(train.data)
print train.target_names
vectorizer = TfidfVectorizer()
X_train = vectorizer.fit_transform(train.data)
It throws the following stack trace:
Traceback (most recent call last):
File "C:\EclipseWorkspace\TextClassifier\main.py", line 16, in <module>
X_train = vectorizer.fit_transform(train.data)
File "C:\Python27\lib\site-packages\sklearn\feature_extraction\text.py", line 1285, in fit_transform
X = super(TfidfVectorizer, self).fit_transform(raw_documents)
File "C:\Python27\lib\site-packages\sklearn\feature_extraction\text.py", line 804, in fit_transform
self.fixed_vocabulary_)
File "C:\Python27\lib\site-packages\sklearn\feature_extraction\text.py", line 739, in _count_vocab
for feature in analyze(doc):
File "C:\Python27\lib\site-packages\sklearn\feature_extraction\text.py", line 236, in <lambda>
tokenize(preprocess(self.decode(doc))), stop_words)
File "C:\Python27\lib\site-packages\sklearn\feature_extraction\text.py", line 113, in decode
doc = doc.decode(self.encoding, self.decode_error)
File "C:\Python27\lib\encodings\utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xff in position 32054: invalid start byte
I run Python 2.7. How can I get this to work?
EDIT:
I have just discovered that this works perfectly well for files with utf-8 encoding (my files are ANSI encoded). Is there any way I can get sklearn.datasets.load_files() to work with ANSI encoding?

ANSI is a strict subset of UTF-8, so it should work just fine. However, from the stack trace, it seems that your input contains the byte 0xFF somewhere, which is not a valid ANSI character.

I fix the problem by change the errors setting from 'strict' into 'ignore'
vectorizer = CountVectorizer(binary = True, decode_error = u'ignore')
word_tokenizer = vectorizer.build_tokenizer()
doc_terms_list_train = [word_tokenizer(str(doc_str, encoding = 'utf-8', errors = 'ignore')) for doc_str in doc_str_list_train]
doc_train_vec = vectorizer.fit_transform(doc_str_list_train)
here is the detailed explanation of countvectorizer fucntion

Related

How do I determine which text in a corpus contains an error generated by the NLTK suite in Python?

I am trying to do some rudimentary corpus analysis with Python. I am getting the following error message(s):
Traceback (most recent call last):
File "<pyshell#28>", line 2, in <module>
print(len(poems.words(f)), f)
File "C:\Python38-32\lib\site-packages\nltk\corpus\reader\util.py", line 240, in __len__
for tok in self.iterate_from(self._toknum[-1]):
File "C:\Python38-32\lib\site-packages\nltk\corpus\reader\util.py", line 306, in iterate_from
tokens = self.read_block(self._stream)
File "C:\Python38-32\lib\site-packages\nltk\corpus\reader\plaintext.py", line 134, in _read_word_block
words.extend(self._word_tokenizer.tokenize(stream.readline()))
File "C:\Python38-32\lib\site-packages\nltk\data.py", line 1220, in readline
new_chars = self._read(readsize)
File "C:\Python38-32\lib\site-packages\nltk\data.py", line 1458, in _read
chars, bytes_decoded = self._incr_decode(bytes)
File "C:\Python38-32\lib\site-packages\nltk\data.py", line 1489, in _incr_decode
return self.decode(bytes, 'strict')
File "C:\Python38-32\lib\encodings\utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x97 in position 12: invalid start byte
My assumption is that there is a UTF error in one of the 202 text files I am looking at.
Is there any way of telling, from the error messages, which file or files have the problem?

Assuming that you know the files ids (the paths of your corpus files) you can open all of them with encoding="utf-8"
If you don't know the paths, assuming that you are using the nltk corpus loader, you can get them by using:
poems.fileids()
After that, for every file in your list of files (for example fileids) you can try:
for file_ in fileids:
try:
with open(file_, encoding="utf-8") a f_i:
f_i.readlines()
except:
print("You got problems with the file: ", file_)
Anyway, your loader has also a parameter named "encoding" that you can use for the correct encoding of your corpus. By default is set to "utf-8"
More details here: nltk corpus loader

TypeError: write() argument must be str, not bytes while saving .npy file

I tried running the code in a keras blog post.
The code writes to a .npy file as follows:
bottleneck_features_train = model.predict_generator(generator, nb_train_samples // batch_size)
np.save(open('bottleneck_features_train.npy', 'w'),bottleneck_features_train)
It then reads from this file:
def train_top_model():
train_data = np.load(open('bottleneck_features_train.npy'))
Now I get an error saying:
Found 2000 images belonging to 2 classes.
Traceback (most recent call last):
File "kerasbottleneck.py", line 103, in <module>
save_bottlebeck_features()
File "kerasbottleneck.py", line 69, in save_bottlebeck_features
np.save(open('bottleneck_features_train.npy', 'w'),bottleneck_features_train)
File "/opt/anaconda3/lib/python3.6/site-packages/numpy/lib/npyio.py", line 511, in save
pickle_kwargs=pickle_kwargs)
File "/opt/anaconda3/lib/python3.6/site-packages/numpy/lib/format.py", line 565, in write_array
version)
File "/opt/anaconda3/lib/python3.6/site-packages/numpy/lib/format.py", line 335, in _write_array_header
fp.write(header_prefix)
TypeError: write() argument must be str, not bytes
After this, I tried changing the file mode from 'w' to 'wb'. This resulted in an error while reading the file:
Found 2000 images belonging to 2 classes.
Found 800 images belonging to 2 classes.
Traceback (most recent call last):
File "kerasbottleneck.py", line 104, in <module>
train_top_model()
File "kerasbottleneck.py", line 82, in train_top_model
train_data = np.load(open('bottleneck_features_train.npy'))
File "/opt/anaconda3/lib/python3.6/site-packages/numpy/lib/npyio.py", line 404, in load
magic = fid.read(N)
File "/opt/anaconda3/lib/python3.6/codecs.py", line 321, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x93 in position 0: invalid start byte
How can I fix this error?

The code in the blog post is aimed at Python 2, where writing to and reading from a file works with bytestrings. In Python 3, you need to open the file in binary mode, both for writing and then reading again:
np.save(
open('bottleneck_features_train.npy', 'wb'),
bottleneck_features_train)
And when reading:
train_data = np.load(open('bottleneck_features_train.npy', 'rb'))
Note the b character in the mode arguments there.
I'd use the file as a context manager to ensure it is cleanly closed:
with open('bottleneck_features_train.npy', 'wb') as features_train_file
np.save(features_train_file, bottleneck_features_train)
and
with open('bottleneck_features_train.npy', 'wb') as features_train_file:
train_data = np.load(features_train_file)
The code in the blog post should use both of these changes anyway, because in Python 2, without the b flag in the mode text files have platform-specific newline conventions translated, and on Windows certain characters in the stream will have specific meaning (including causing the file to appear shorter than it really is if a EOF characte appears). With binary data that could be a real problem.

UnicodeDecodeError When I use cuda to train dataset

I used chainer to train some images but there is an error.
I don't know whether its UnicodeDecodeError or the error of installation of cupy.
P:\dcgans\chainer-DCGAN\chainer-DCGAN>python DCGAN.py
Traceback (most recent call last):
File "DCGAN.py", line 279, in <module>
train_dcgan_labeled(gen, dis)
File "DCGAN.py", line 171, in train_dcgan_labeled
zvis = (xp.random.uniform(-1, 1, (100, nz), dtype=np.float32))
File "P:\Python35\lib\site-packages\cupy\random\distributions.py", line 132, in uniform
return rs.uniform(low, high, size=size, dtype=dtype)
File "P:\Python35\lib\site-packages\cupy\random\generator.py", line 235, in uniform
rand = self.random_sample(size=size, dtype=dtype)
File "P:\Python35\lib\site-packages\cupy\random\generator.py", line 153, in random_sample
RandomState._1m_kernel(out)
File "cupy/core/elementwise.pxi", line 552, in cupy.core.core.ElementwiseKernel.__call__ (cupy\core\core.cpp:43810)
File "cupy/util.pyx", line 39, in cupy.util.memoize.decorator.ret (cupy\util.cpp:1480)
File "cupy/core/elementwise.pxi", line 409, in cupy.core.core._get_elementwise_kernel (cupy\core\core.cpp:42156)
File "cupy/core/elementwise.pxi", line 12, in cupy.core.core._get_simple_elementwise_kernel (cupy\core\core.cpp:34787)
File "cupy/core/elementwise.pxi", line 32, in cupy.core.core._get_simple_elementwise_kernel (cupy\core\core.cpp:34609)
File "cupy/core/carray.pxi", line 87, in cupy.core.core.compile_with_cache (cupy\core\core.cpp:34264)
File "P:\Python35\lib\site-packages\cupy\cuda\compiler.py", line 133, in compile_with_cache
base = _empty_file_preprocess_cache[env] = preprocess('', options)
File "P:\Python35\lib\site-packages\cupy\cuda\compiler.py", line 99, in preprocess
pp_src = pp_src.decode('utf-8')
UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 27-28: invalid continuation byte

It seems nvcc generated non-UTF8 output and CuPy failed to decode it.
This is a bug of CuPy (I posted an issue: #378).
A possible solution for the time being is to replace 'utf-8' in cupy/cuda/compiler.py at the line pp_src = pp_src.decode('utf-8') with something that match your environment. For example, in Japanese environment, 'cp932' should work, and 'cp936' should perhaps work for simplified Chinese.
You could also try locale.getdefaultlocale()[1] as a universal solution (be sure to import locale).
Update: The fix has been merged. It should be fixed in upcoming CuPy v1.0.3.

Umlauts in JSON files lead to errors in Python code created by ANTLR4

I've created python modules from the JSON grammar on github / antlr4 with
antlr4 -Dlanguage=Python3 JSON.g4
I've written a main program "JSON2.py" following this guide: https://github.com/antlr/antlr4/blob/master/doc/python-target.md
and downloaded the example1.json also from github.
python3 ./JSON2.py example1.json # works perfectly, but
python3 ./JSON2.py bookmarks-2017-05-24.json # the bookmarks contain German Umlauts like "ü"
...
File "/home/xyz/lib/python3.5/site-packages/antlr4/FileStream.py", line 27, in readDataFrom
return codecs.decode(bytes, encoding, errors)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 227: ordinal not in range(128)
The offending line in JSON2.py is
input = FileStream(argv[1])
I've searched stackoverflow and tried this instead of using the above FileStream:
fp = codecs.open(argv[1], 'rb', 'utf-8')
try:
input = fp.read()
finally:
fp.close()
lexer = JSONLexer(input)
stream = CommonTokenStream(lexer)
parser = JSONParser(stream)
tree = parser.json() # This is line 39, mentioned in the error message
Execution of this program ends with an error message, even if the input file doesn't contain Umlauts:
python3 ./JSON2.py example1.json
Traceback (most recent call last):
File "./JSON2.py", line 46, in <module>
main(sys.argv)
File "./JSON2.py", line 39, in main
tree = parser.json()
File "/home/x/Entwicklung/antlr/links/JSONParser.py", line 108, in json
self.enterRule(localctx, 0, self.RULE_json)
File "/home/xyz/lib/python3.5/site-packages/antlr4/Parser.py", line 358, in enterRule
self._ctx.start = self._input.LT(1)
File "/home/xyz/lib/python3.5/site-packages/antlr4/CommonTokenStream.py", line 61, in LT
self.lazyInit()
File "/home/xyz/lib/python3.5/site-packages/antlr4/BufferedTokenStream.py", line 186, in lazyInit
self.setup()
File "/home/xyz/lib/python3.5/site-packages/antlr4/BufferedTokenStream.py", line 189, in setup
self.sync(0)
File "/home/xyz/lib/python3.5/site-packages/antlr4/BufferedTokenStream.py", line 111, in sync
fetched = self.fetch(n)
File "/home/xyz/lib/python3.5/site-packages/antlr4/BufferedTokenStream.py", line 123, in fetch
t = self.tokenSource.nextToken()
File "/home/xyz/lib/python3.5/site-packages/antlr4/Lexer.py", line 111, in nextToken
tokenStartMarker = self._input.mark()
AttributeError: 'str' object has no attribute 'mark'
This parses correctly:
javac *.java
grun JSON json -gui bookmarks-2017-05-24.json
So the grammar itself is not the problem.
So finally the question: How should I process the input file in python, so that lexer and parser can digest it?
Thanks in advance.

Make sure your input file is actually encoded as UTF-8. Many problems with character recognition by the lexer are caused by using other encodings. I just took a testbed application, added ëto the list of available characters for an IDENTIFIER and it works again. UTF-8 is the key -- and make sure your grammar also allows these characters where you want to accept them.

I solved it by passing the encoding info:
input = FileStream(sys.argv[1], encoding = 'utf8')
If without the encoding info, I will have the same issue as yours.
Traceback (most recent call last):
File "test.py", line 20, in <module>
main()
File "test.py", line 9, in main
input = FileStream(sys.argv[1])
File ".../lib/python3.5/site-packages/antlr4/FileStream.py", line 20, in __init__
super().__init__(self.readDataFrom(fileName, encoding, errors))
File ".../lib/python3.5/site-packages/antlr4/FileStream.py", line 27, in readDataFrom
return codecs.decode(bytes, encoding, errors)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe4 in position 1: ordinal not in range(128)
Where my input data is
[今明]天(台南|高雄)的？天氣如何

Python: UnicodeDecodeError: 'utf8' codec can't decode byte

I'm reading a bunch of RTF files into python strings.
On SOME texts, I get this error:
Traceback (most recent call last):
File "11.08.py", line 47, in <module>
X = vectorizer.fit_transform(texts)
File "C:\Python27\lib\site-packages\sklearn\feature_extraction\text.py", line
716, in fit_transform
X = super(TfidfVectorizer, self).fit_transform(raw_documents)
File "C:\Python27\lib\site-packages\sklearn\feature_extraction\text.py", line
398, in fit_transform
term_count_current = Counter(analyze(doc))
File "C:\Python27\lib\site-packages\sklearn\feature_extraction\text.py", line
313, in <lambda>
tokenize(preprocess(self.decode(doc))), stop_words)
File "C:\Python27\lib\site-packages\sklearn\feature_extraction\text.py", line
224, in decode
doc = doc.decode(self.charset, self.charset_error)
File "C:\Python27\lib\encodings\utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0x92 in position 462: invalid
start byte
I've tried:
Copying and pasting the text of the files to new files
saving the rtf files as txt files
Openin the txt files in Notepad++ and choosing 'convert to utf-8' and also setting the encoding to utf-8
Opening the files with Microsoft Word and saving them as new files
Nothing works. Any ideas?
It's probably not related, but here's the code incase you are wondering:
f = open(dir+location, "r")
doc = Rtf15Reader.read(f)
t = PlaintextWriter.write(doc).getvalue()
texts.append(t)
f.close()
vectorizer = TfidfVectorizer(sublinear_tf=True, max_df=0.5, stop_words='english')
X = vectorizer.fit_transform(texts)

This will solve your issues:
import codecs
f = codecs.open(dir+location, 'r', encoding='utf-8')
txt = f.read()
from that moment txt is in unicode format and you can use it everywhere in your code.
If you want to generate UTF-8 files after your processing do:
f.write(txt.encode('utf-8'))

as I said on the mailinglist, it is probably easiest to use the charset_error option and set it to ignore.
If the file is actually utf-16, you can also set the charset to utf-16 in the Vectorizer.
See the docs.

You can dump the csv file rows in json file without any encoding error as follows:
json.dump(row,jsonfile, encoding="ISO-8859-1")

Keep this line :
vectorizer = TfidfVectorizer(encoding='latin-1',sublinear_tf=True, max_df=0.5, stop_words='english')
encoding = 'latin-1' worked for me.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Tokenizing text with scikit-learn - python

ANSI is a strict subset of UTF-8, so it should work just fine. However, from the stack trace, it seems that your input contains the byte 0xFF somewhere, which is not a valid ANSI character.

Related

How do I determine which text in a corpus contains an error generated by the NLTK suite in Python?

TypeError: write() argument must be str, not bytes while saving .npy file

UnicodeDecodeError When I use cuda to train dataset

Umlauts in JSON files lead to errors in Python code created by ANTLR4

Python: UnicodeDecodeError: 'utf8' codec can't decode byte

Categories

Resources