how can I complete the text classification task using less memory - python

(1)My goal:
I am trying to use SVM to classify 10000 documents(each with 400 words) into 10 classes(evenly distributed). The features explored in my work include word n-gram (n=1~4),character n-gram(n=1~6).
(2)My approach: I am representing each document using vectors of frequency values for each feature in the document. And using TF-IDF to formalize the vectors. parts of my code are below:
def commonVec(dicts,count1,count2):
''' put features with frequency between count1 and count2 into a common vector used for SVM training'''
global_vector = []
master = {}
for i, d in enumerate(dicts):
for k in d:
master.setdefault(k, []).append(i)
for key in master:
if (len(master[key])>=count1 and len(master[key])<=count2):
global_vector.append(key)
global_vector1 = sorted(global_vector)
return global_vector1
def featureComb(mix,count1,count2,res1):
'''combine word n-gram and character n-gram into a vector'''
if mix[0]:
common_vector1 = []
for i in mix[0]:
dicts1 = []
for res in res1: #I stored all documents into database. res1 is the document result set and res is each document.
dicts1.append(ngram.characterNgrams(res[1], i)) # characterNgrams()will return a dictionary with feature name as the key, frequency as the value.
common_vector1.extend(commonVec(dicts1, count1, count2))
else:
common_vector1 = []
if mix[1]:
common_vector2 = []
for j in mix[1]:
dicts2 = []
for res in res1:
dicts2.append(ngram.wordNgrams(res[1], j))
common_vector2.extend(commonVec(dicts2, count1, count2))
else:
common_vector2 = []
return common_vector1+common_vector2
def svmCombineVector(mix,global_combine,label,X,y,res1):
'''Construct X vector that can be used to train SVM'''
lstm = []
for res in res1:
y.append(label[res[0]]) # insert class label into y
dici1 = {}
dici2 = {}
freq_term_vector = []
for i in mix[0]:
dici1.update(ngram.characterNgrams(res[1], i))
freq_term_vector.extend(dici1[gram] if gram in dici1 else 0 for gram in global_combine)
for j in mix[1]:
dici2.update(ngram.wordNgrams(res[1], j))
freq_term_vector.extend(dici2[gram] if gram in dici2 else 0 for gram in global_combine)
lstm.append(freq_term_vector)
freq_term_matrix = np.matrix(lstm)
transformer = TfidfTransformer(norm="l2")
tfidf = transformer.fit_transform(freq_term_matrix)
X.extend(tfidf.toarray())
X = []
y = []
character = [1,2,3,4,5,6]
word = [1,2,3,4]
mix = [character,word]
global_vector_combine = featureComb(mix, 2, 5000, res1)
print len(global_vector_combine) # 542401
svmCombineVector(mix,global_vector_combine,label,X,y,res1)
clf1 = svm.LinearSVC()
clf1.fit(X, y)
(3)My problem: However, when I ran the source code, a memory error occurred.
Traceback (most recent call last):
File "svm.py", line 110, in <module>
functions.svmCombineVector(mix,global_vector_combine,label,X,y,res1)
File "/home/work/functions.py", line 201, in svmCombineVector
X.extend(tfidf.toarray())
File "/home/anaconda/lib/python2.7/site-packages/scipy/sparse/compressed.py", line 901, in toarray
return self.tocoo(copy=False).toarray(order=order, out=out)
File "/home/anaconda/lib/python2.7/site-packages/scipy/sparse/coo.py", line 269, in toarray
B = self._process_toarray_args(order, out)
File "/home/anaconda/lib/python2.7/site-packages/scipy/sparse/base.py", line 789, in _process_toarray
_args
return np.zeros(self.shape, dtype=self.dtype, order=order)
MemoryError
I really have a hard time with it and need help from stackoverflow.
Could anyone explain some details and give me some idea how to solve it?
could anyone check my source code and show me some other methods to make use of memory more effectively?

The main problem you're facing is that you're using far too many features. It's actually quite extraordinary that you've managed to generate 542401 features from documents that contain just 400 words! I've seen SVM classifiers separate spam from non-spam with high accuracy using just 150 features -- word counts of selected words that say a lot about whether the document is spam. These use stemming and other normalization tricks to make the features more effective.
You need to spend some time thinning out your features. Think about which features are most likely to contain information useful for this task. Experiment with different features. As long as you keep throwing everything but the kitchen sink in, you'll get memory errors. Right now you're trying to pass 10000 data points with 542401 dimensions each to your SVM. That's 542401 * 10000 * 4 = 21 gigabytes (conservatively) of data. My computer only has 4 gigabytes of RAM. You've got to pare this way down.1
A first step towards doing so would be to think about how big your total vocabulary size is. Each document has only 400 words, but let's say those 400 words are taken from a vocabulary of 5000 words. That means there will be 5000 ** 4 = 6.25 * 10 ** 14 possible 4-grams. That's half a quadrillion possible 4-grams. Of course not all those 4-grams will appear in your documents, but this goes a long way towards explaining why you're running out of memory. Do you really need these 4-grams? Could you get away with 2-grams only? There are a measly 5000 ** 2 = 25 million possible 2-grams. That will fit much more easily in memory, even if all possible 2-grams appear (unlikely).
Also keep in mind that even if the SVM could handle quadrillions of datapoints, it would probably give bad results, because when you give any learning algorithm too many features, it will tend to overfit, picking up on irrelevant patterns and overgeneralizing from them. There are ways of dealing with this, but it's best not to deal with it at all if you can help it.
I will also mention that these are not "newbie" problems. These are problems that machine learning specialists with PhDs have to deal with. They come up with lots of clever solutions, but we're not so clever that way, so we have to be clever a different way.
Although I can't offer you specific suggestions for cleverness without knowing more, I would say that, first, stemming is a good idea in at least some cases. Stemming simply removes grammatical inflection, so that different forms of the same word ("swim" and "swimming") are treated as identical. This will probably reduce your vocabulary size significantly, at least if you're dealing with English text. A common choice is the porter stemmer, which is included in nltk, as well as in a number of other packages. Also, if you aren't already, you should probably strip punctuation and reduce all words to lower-case. From there, it really depends. Stylometry (identifying authors) sometimes requires only particles ("a", "an", "the"), conjunctions ("and", "but") and other very common words; spam, on the other hand, has its own oddball vocabularies of interest. At this level, it is very difficult to say in advance what will work; you'll almost certainly need to try different approaches to see which is most effective. As always, testing is crucial!
1. Well, possibly you have a huge amount of RAM at your disposal. For example, I have access to a machine with 48G of RAM at my current workplace. But I doubt it could handle this either, because the SVM will have its own internal representation of the data, which means there will be at least one copy at some point; if a second copy is needed at any point -- kaboom.

Related

cosine similarity doc vectors and word vectors for topical prevalence using doc2vec

I have a corpus of 250k Dutch news articles 2010-2020 to which I've applied word2vec models to uncover relationships between sets of neutral words and dimensions (e.g. good-bad). Since my aim is also to analyze the prevalence of certain topics over time, I was thinking of using doc2vec instead so as to simultaneously learn word and document embeddings. The 'prevalence' of topics in a document could then be calculated as the cosine similarities between doc vectors and word embeddings (or combinations of word vectors). In this way, I can calculate the annual topical prevalence in the corpus and see whether there's any changes over time. An example of such an approach can be found here.
My issue is that the avg. yearly cosine similarities yield really strange results. As an example, the cosine similarities between document vectors and a mixture of keywords related to covid-19/coronavirus show a decrease in topical prevalence since 2016 (which obviously cannot be the case).
My question is whether the approach that I'm following is actually valid. Or that maybe there's something that I'm missing. A 250k documents and 100k + vocabulary should be sufficient enough?
Below is the code that I've written:
# Doc2Vec model
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
docs = [TaggedDocument(doc, [i]) for i, doc in enumerate(tokenized_docs)]
d2vmodel = Doc2Vec(docs, min_count = 5, vector_size = 200, window = 10, dm = 1)
docvecs = d2vmodel.docvecs
wordvecs = d2vmodel.wv
# normalize vector
from numpy.linalg import norm
def nrm(x):
return x/norm(x)
# topical prevalence per doc
def topicalprevalence(topic, docvecs, wordvecs):
proj_lst = []
for i in range(0, len(docvecs)):
topic_lst = []
for j in topic:
cossim = nrm(docvecs[i]) # nrm(wordvecs[j])
topic_lst.append(cossim)
topic_avg = sum(topic_lst) / len(topic_lst)
proj_lst.append(topic_avg)
topicsyrs = {
'topic': proj_lst,
'year': df['datetime'].dt.year
}
return pd.DataFrame(topicsyrs)
# avg topic prevalence per year
def avgtopicyear(topic, docvecs, wordvecs):
docs = topicalprevalence(topic, docvecs, wordvecs)
return pd.DataFrame(docs.groupby("year")["topic"].mean())
# run
covid = ['corona', 'coronapandemie', 'coronacrisis', 'covid', 'pandemie']
covid_scores = topicalprevalence(covid, docvecs, wordvecs)
The word-vec-to-doc-vec relatioships in modes that train both are interesting, but a bit hard to characterize as to what they really mean. In a sense the CBOW-like mode of dm=1 (PV-DM) mixes doc-vectors in as one equal word among the whole window, when training to predict the 'target' word. But in the skip-gram-mixed mode dm=0, dbow_words=1, there'll be window count context-word-vec-to-target-word pair cycles to every 1 doc-vec-to-target-word pair cycle, changing the relative weight.
So if you saw a big improvement in dm=0, dbow_words=1, it might also be because that made the model relatively more word-to-word trained. Varying window is another way to change that balance, or increase epochs, in plain dm=1 mode – which should also result in doc/word compatible training, though perhaps not at the same rate/balance.
Whether a single topicalprevalence() mean vector for a full year would actually be reflective of individual word occurrences for a major topic may or may not be a valid conjecture, depending on possible other changes in the training data. Something like a difference in the relative mix of other major categories in the corpus might swamp even a giant new news topic. (EG: what if in y2020 some new section or subsidiary with a different focus, like entertainment, launched? It might swamp the effects of other words, especially when compressing down to a single vector of some particular dimensionality.)
Someting like a clustering of the year's articles, and identification of the closest 1 or N clusters to the target-words, with their similarities, might be more reflective even if the population of articles in changing. Or, a plot of each year's full set of articles as a histogram-of-similarities to the target-words - which might show a 'lump' of individual articles (not losing their distinctiveness to a full-year average) developing, over time, closer to the new phenomenon.
Turns out that setting parameters to dm=0, dbow_words=1 allows for training documents and words in the same space, now yielding valid results.

ValueError: cannot reshape array of size 3800 into shape (1,200)

I am trying to apply word embedding on tweets. I was trying to create a vector for each tweet by taking the average of the vectors of the words present in the tweet as follow:
def word_vector(tokens, size):
vec = np.zeros(size).reshape((1, size))
count = 0.
for word in tokens:
try:
vec += model_w2v[word].reshape((1, size))
count += 1.
except KeyError: # handling the case where the token is not in vocabulary
continue
if count != 0:
vec /= count
return vec
Next, when I try to Prepare word2vec feature set as follow:
wordvec_arrays = np.zeros((len(tokenized_tweet), 200))
#the length of the vector is 200
for i in range(len(tokenized_tweet)):
wordvec_arrays[i,:] = word_vector(tokenized_tweet[i], 200)
wordvec_df = pd.DataFrame(wordvec_arrays)
wordvec_df.shape
I get the following error inside the loop:
ValueError Traceback (most recent call last)
<ipython-input-32-72aee891e885> in <module>
4 # wordvec_arrays.reshape(1,200)
5 for i in range(len(tokenized_tweet)):
----> 6 wordvec_arrays[i,:] = word_vector(tokenized_tweet[i], 200)
7
8 wordvec_df = pd.DataFrame(wordvec_arrays)
<ipython-input-31-9e6501810162> in word_vector(tokens, size)
4 for word in tokens:
5 try:
----> 6 vec += model_w2v.wv.__getitem__(word).reshape((1, size))
7 count += 1.
8 except KeyError: # handling the case where the token is not in vocabulary
ValueError: cannot reshape array of size 3800 into shape (1,200)
I checked all the available posts in stackOverflow but non of them really helped me.
I tried reshaping the array and it still give me the same error.
My model is:
tokenized_tweet = df['tweet'].apply(lambda x: x.split()) # tokenizing
model_w2v = gensim.models.Word2Vec(
tokenized_tweet,
size=200, # desired no. of features/independent variables
window=5, # context window size
min_count=2,
sg = 1, # 1 for skip-gram model
hs = 0,
negative = 10, # for negative sampling
workers= 2, # no.of cores
seed = 34)
model_w2v.train(tokenized_tweet, total_examples= len(df['tweet']), epochs=20)
any suggestions please?
It looks like the intent of your word_vector() method is to take a list of words, and then with respect to a given Word2Vec model, return the average of all those words' vectors (when present).
To do that, you shouldn't need to do any explicit re-shaping of vectors – or even specification of size, because that's forced by what the model already provides. You could use utility methods from numpy to simplify the code a lot. For example, the gensim n_similarity() method, as part of its comparision of two lists-of-words, already does an averaging much like what you're trying, and you can look at its source as a model:
https://github.com/RaRe-Technologies/gensim/blob/f97d0e793faa57877a2bbedc15c287835463eaa9/gensim/models/keyedvectors.py#L996
So, while I haven't tested this code, I think your word_vector() method could be essentially replaced with:
import numpy as np
def average_words_vectors(tokens, wv_model):
vectors = [wv_model[word] for word in tokens
if word in wv_model] # avoiding KeyError
return np.array(vectors).mean(axis=0)
(It's sometimes the case that it makes sense to work with vectors that have been normalized to unit-length - as the linked gensim code via applying gensim.matutils.unitvec() to the average. I haven't done this here, as your method hadn't taken that step – but it is something to consider.)
Separate observations about your Word2Vec training code:
typically words with just 1, 2, or a few occurrences don't get good vectors (due to limited number & variety of examples), but do interfere with the improvement of other more-common-word vectors. That's why the default is min_count=5. So just be aware: your surviving vectors may get better if you use a default (or even larger) value here, discarding more of the rarer words.
the dimensions of a "dense embedding" like word2vec-vectors aren't really "independent variables" (or standalone individually-interpretable "features") as implied by your code-comment, even though they may seem that way as separate values/slots in the data. For example, you can't pick one dimension out and conclude, "that's the foo-ness of this sample" (like 'coldness' or 'hardness' or 'positiveness' etc). Rather, any of those human-describable meanings tend to be other directions in the combined-space, not perfectly aligned with any of the individual dimensions. You can sort-of tease those out by comparing vectors, and downstream ML algorithms can make use of those complicated/entangled multi-dimensional interactions. But if you think of each dimensions as its own "feature" – in any way other than yes, it's technically a single number associated with the item – you may be prone to misinterpreting the vector-space.

Memory Error when trying to create numpy matrix

text = codecs.open("lith.txt", encoding= 'utf-8')
text = text.read().lower().replace('"','').replace('?','').replace(',','').replace('!','').replace('.','')
text = text.split()
words = sorted(list(set(text)))
Unigram = np.zeros([len(words)])
ind = range(len(words))
Lexicon = dict(zip(words,ind))
Bigram = np.zeros([len(words),len(words)])
I keep running into major issues with the last line of this portion of the program. The text file is maybe about 7,000,000 words long. Currently, the number of words/length is about 200,000. When I cut the text file to a point where the length of words become 40,000 or so, the program works. Is there anyway to get around this memory limitation? Thanks for any help. The results I get in later parts of the program really seem to suffer if I just keep cutting out portions of the text until the memory errors goes away.
for n in range(len(text)-1):
Unigram[Lexicon[text[n]]] = Unigram[Lexicon[text[n]]] + 1
Bigram[Lexicon[text[n]]][Lexicon[text[n+1]]] = Bigram[Lexicon[text[n]]][Lexicon[text[n+1]]] + 1
Unigram_sorted = np.argsort(Unigram)
Unigram_sorted = Unigram_sorted[::-1]
Unigram_sorted = Unigram_sorted[0:4999]
I assume that the line that raises the exception in:
Bigram = np.zeros([len(words),len(words)])
If len(words) is 200,000, then the size of the matrix is 200,000^2 integers. Assuming int64, this requires 320gb of memory.
Assuming most entries will remain zeros, sparse matrices could be helpful. For example, scipy's sparse matrices. In the case of counting joint pairs, this snippet could be of help:
from scipy.sparse.dok import dok_matrix
Bigrams = dok_matrix((len(words), len(words)))
# Bigrams[i, j] += 1
Regarding the code itself, the first part may have a relatively similar implementation at scikit-learn text vectorizers.

Splitting long string without breaking words fulfilling lines

Before you think that it's duplicated (there are many question asking how to split long strings without breaking words) take in mind that my problem is a bit different: order is not important and I've to fit the words in order to use every line as much as possible.
I've a unordered set of words and I want to combine them without using more than 253 characters.
def compose(words):
result = " ".join(words)
if len(result) > 253:
pass # this should not happen!
return result
My problem is that I want to try to fill the line as much as possible. For example:
words = "a bc def ghil mno pq r st uv"
limit = 5 # max 5 characters
# This is good because it's the shortest possible list,
# but I don't know how could I get it
# Note: order is not important
good = ["a def", "bc pq", "ghil", "mno r", "st uv"]
# This is bad because len(bad) > len(good)
# even if the limit of 5 characters is respected
# This is equivalent to:
# bad = ["a bc", "def", "ghil", "mno", "pq r", "st uv"]
import textwrap
bad = textwrap.wrap(words, limit)
How could I do?
This is the bin packing problem; the solution is NP-hard, although there exist non-optimal heuristic algorithms, principally first fit decreasing and best fit decreasing. See https://github.com/type/Bin-Packing for implementations.
Non-optimal offline fast 1D bin packing Python algorithm
def binPackingFast(words, limit, sep=" "):
if max(map(len, words)) > limit:
raise ValueError("limit is too small")
words.sort(key=len, reverse=True)
res, part, others = [], words[0], words[1:]
for word in others:
if len(sep)+len(word) > limit-len(part):
res.append(part)
part = word
else:
part += sep+word
if part:
res.append(part)
return res
Performance
Tested over /usr/share/dict/words (provided by words-3.0-20.fc18.noarch) it can do half million words in a second on my slow dual core laptop, with an efficiency of at least 90% with those parameters:
limit = max(map(len, words))
sep = ""
With limit *= 1.5 I get 92%, with limit *= 2 I get 96% (same execution time).
Optimal (theoretical) value is calculated with: math.ceil(len(sep.join(words))/limit)
no efficient bin-packing algorithm can be guaranteed to do better
Source: http://mathworld.wolfram.com/Bin-PackingProblem.html
Moral of the story
While it's interesting to find the best solution, I think that for the most cases it would be much better to use this algorithm for 1D offline bin packing problems.
Resources
http://mathworld.wolfram.com/Bin-PackingProblem.html
https://github.com/hudora/pyShipping/
Notes
I didn't use textwrap for my implementation because it's slower than my simple Python code.
Maybe it's related with: Why are textwrap.wrap() and textwrap.fill() so slow?
It seems to work perfectly even if the sorting is not reversed.

Implementation of Naive Bayes - accuracy issues

EDIT: The correct version of the code which works can be found at :
https://github.com/a7x/NaiveBayes-Classifier
I used data from openClassroom and started working on a small version of Naive Bayes in Python. Steps were the usual training and then prediction . I have a few questions and want to know why the accuracy is quite bad.
For training, I calculated the log likelihood by the formula :
log( P ( word | spam ) +1 ) /( spamSize + vocabSize .)
My question is: why did we add the vocabSize in this case :( and is this the correct way of going about it? Code used is below:
#This is for training. Calculate all probabilities and store them in a vector. Better to store it in a file for easier access
from __future__ import division
import sys,os
'''
1. The spam and non-spam is already 50% . So they by default are 0.5
2. Now we need to calculate probability of each word , in spam and non-spam separately
2.1 we can make two dictionaries, defaultdicts basically, for spam and non-spam
2.2 When time comes to calculate probabilities, we just need to substitute values
'''
from collections import *
from math import *
spamDict = defaultdict(int)
nonspamDict = defaultdict(int)
spamFolders = ["spam-train"]
nonspamFolders = ["nonspam-train"]
path = sys.argv[1] #Base path
spamVector = open(sys.argv[2],'w') #WRite all spam values into this
nonspamVector = open(sys.argv[3],'w') #Non-spam values
#Go through all files in spam and iteratively add values
spamSize = 0
nonspamSize = 0
vocabSize = 264821
for f in os.listdir(os.path.join(path,spamFolders[0])):
data = open(os.path.join(path,spamFolders[0],f),'r')
for line in data:
words = line.split(" ")
spamSize = spamSize + len(words)
for w in words:
spamDict[w]+=1
for f in os.listdir(os.path.join(path,nonspamFolders[0])):
data = open(os.path.join(path,nonspamFolders[0],f),'r')
for line in data:
words = line.split(" ")
nonspamSize = nonspamSize + len(words)
for w in words:
nonspamDict[w]+=1
logProbspam = {}
logProbnonSpam = {} #This is to store the log probabilities
for k in spamDict.keys():
#Need to calculate P(x | y = 1)
numerator = spamDict[k] + 1 # Frequency
print 'Word',k,' frequency',spamDict[k]
denominator = spamSize + vocabSize
p = log(numerator/denominator)
logProbspam[k] = p
for k in nonspamDict.keys():
numerator = nonspamDict[k] + 1 #frequency
denominator = nonspamSize + vocabSize
p = log(numerator/denominator)
logProbnonSpam[k] = p
for k in logProbnonSpam.keys():
nonspamVector.write(k+" "+str(logProbnonSpam[k])+"\n")
for k in logProbspam.keys():
spamVector.write(k+" "+str(logProbspam[k])+"\n")
For prediction, I just took a mail , split it into words, added all the probabilities, separately for
spam/non-spam, and multiplied them by 0.5. Whichever was higher was the class label. Code is below:
http://pastebin.com/8Y6Gm2my ( Stackoverflow was again playing games for some reason :-/)
EDIT : I have removed the spam = spam + 1 thing. Instead, I just ignore those words
Problem : My accuracy is quite bad . As noted below.
No of files in spam is 130
No. of spam in ../NaiveBayes/spam-test is 53 no. of non-spam 77
No of files in non-spam is 130
No. of spam in ../NaiveBayes/nonspam-test/ is 6 no. of non-spam 124
Please tell me where all I am going wrong. I think an accuracy of less than 50% means there must be some glaring error(s) in the implementation.
There are multiple errors and bad assumptions in your program - in both parts of it. Here are several.
You hardcode into your program the fact that you have the same number of spam and non-spam emails. I'd recommend not to hardcode this assumption. This is not absolutely essential, but in a more general case you will need to remove it.
You hadrcode into your program some number that you treat as the vocabulary size. I'd not recommend doing this as this number may change on any modification of the training set. Furthermore, actually it's incorrect. I'd recommend to calculate it during learning.
This may be not a mistake, but you seem to have a vocabulary of all the words in the training set. This may be suboptimal; actually the page you refer to recommends to take into account only the top-2500 words over all the emails. However, that's not essential for obtaining correct results - even without this filtering, my implementation is getting only several emails unclassified.
You incorrectly account for words that have been observed in spam or non-spam only. The log-probability for them being found in the other subset is not 1 which you add, but log(1/(spamSize+vocabSize)) or log(1/(nonspamSize+vocabSize)) depending on its group. This is actually very important - you need to store this probability with your data for the program to function correctly.
You do not ignore words never observed in the training set. Actually these may be treated in different ways, but you should take them into account.
Due to incorrect indentation in the prediction function, you predict using not the whole message, but only the first line of the message. Just a programming bug.
Update. You have fixed 6. Also 1 is not strictly nessesary to fix while you're working with this dataset, as well as 3 is not required.
Your modification did not correctly fix either 4 or 5. First, if the word has never been observed in some set, the probability of the message in it should decrease. Ignoring the word is not a good idea, you need to account for it as a highly unprobable one.
Second, your current code is asymmetric as the word being absent in spam cancels the check for non-spam (but not the other way). If you need to do nothing in an exception handler, use pass, not continue, as the latter immediately goes to the next for w in words: iteration.
The problem number 2 is also still in place - the vocabulary size you use does not match the real one. It must be the number of different words observed in the training set, not the total number of words in all the messages together.
Here's at least one of the errors you're making: you're storing log-probabilities in your model file (as you should), but then in the prediction code you're pretending that they are straight probabilities:
totalSpamP = spamP * 0.5
should be
totalSpamP = spamP + math.log(0.5)
Also, I don't get what this line is doing:
spamP = spamP + 1
It seems to be making up for a feature not found in the spam part of the training set, but those words should simply be ignored. Right now, it's adding e (exp(1)) to a probability, which by definition is invalid.
(As an aside, I just tried classification on this training set with my own implementation of Naive Bayes and got 97.6% accuracy, so that's the figure you should be aiming at :)

Categories