I'm using the nltk book - Natural Language Processing with Python(2009) and looking at the Naive Bayes classifier. In particular, Example 6-3 on Pg 228 in my version.
The training set is movie reviews.
classifier = nltk.NaiveBayesClassifier.train(train_set)
I peek at the most informative features -
classifier.show_most_informative_features(5)
and I get 'outstanding', 'mulan' and 'wonderfully' among the top ranking ones for the sentence to be tagged 'positive'.
So, I try the following -
in1 = 'wonderfully mulan'
classifier.classify(document_features(in1.split()))
And I get 'neg'. Now this makes no sense. These were supposed to be the top features.
the document_features function is taken directly from the book -
def document_features(document):
document_words = set(document)
features = {}
for word in word_features:
features['contains(%s)' % word] = (word in document_words)
return features
Note that the feature vector in that example is comprised of the "2000 most frequent words in the overall corpus." So assuming that the corpus is comprehensive, a regular review will probably have quite a few of those words. (In real-world reviews of the latest Jackass movie and Dallas Buyers Club, I get 26/2000 and 28/2000 features respectively.)
If you feed it a review containing only "wonderfully mulan", the resulting feature vector only has 2/2000 features set to True. Basically, you're giving it a pseudoreview with little to no information that it knows about or that it can do anything with. For that vector, it's hard to tell what it will predict.
The feature vector should be healthily populated with vectors leaning in a positive direction for it to output pos. Maybe look at the most informative, say, 500 features, look at which ones lean positively and then create a string with only those? That might get you closer to pos, but not necessarily.
Some feature vectors in the train_set classify as pos. (Anecdotally, I found one of them to have 417 features equal to True). However, in my tests, no documents from the neg or pos training set partitions classified to pos, so while you may be right that the classifier doesn't seem to be doing a great job - at least the pos training examples should classify to pos - the example you're giving it is not a great measure of that.
There are at least two different flavors of the naive Bayes classifier. In a quick search, it appears that NLTK implements the Bernoulli flavor: Different results between the Bernoulli Naive Bayes in NLTK and in scikit-learn . In any case, some flavors of naive Bayes pay attention to words/features missing from a document as much as the visible words. So, if you try to classify a document containing a few positive words but that document is also lacking many words that indicate a negative document when they are missing, it is very reasonable that the document will be categorized as negative. So, the bottom line is, pay attention to not only the visible features but also the missing features (depending on the details of the naive Bayes implementation).
Related
I am doing a project on multi-class text classification and could do with some advice.
I have a dataset of reviews which are classified into 7 product categories.
Firstly, I create a term document matrix using TF-IDF (tfidfvectorizer from sklearn). This generates a matrix of n x m where n in the number of reviews in my dataset and m is the number of features.
Then after splitting term document matrix into 80:20 train:test, I pass it through the K-Nearest Neighbours (KNN) algorithm and achieve an accuracy of 53%.
In another experiment, I used the Google News Word2Vec pretrained embedding (300 dimensional) and averaged all the word vectors for each review. So, each review consists of x words and each of the words has a 300 dimensional vector. Each of the vectors are averaged to produce one 300 dimensional vector per review.
Then I pass this matrix through KNN. I get an accuracy of 72%.
As for other classifiers that I tested on the same dataset, all of them performed better on the TF-IDF method of vectorization. However, KNN performed better on word2vec.
Can anyone help me understand why there is a jump in accuracy for KNN in using the word2vec method as compared to when using the tfidf method?
By using the external word-vectors, you've introduced extra info about the words to the word2vec-derived features – info that simply may not be deducible at all to the plain word-occurrece (TF-IDF) model.
For example, imagine just a single review in your train set, and another single review in your test set, use some less-common word for car like jalopy – but then zero other car-associated words.
A TFIDF model will have a weight for that unique term in a particular slot - but may have no other hints in the training dataset that jalopy is related to cars at all. In TFIDF space, that weight will just make those 2 reviews more-distant from all other reviews (which have a 0.0 in that dimension). It doesn't help or hurt much. A review 'nice jalopy' will be no closer to 'nice car' than it is to 'nice movie'.
On the other hand, if the GoogleNews has a vector for that word, and that vector is fairly close to car, auto, wheels, etc, then reviews with all those words will be shifted a little in the same direction in the word2vec-space, giving an extra hint to some classifiers, especially, perhaps the KNN one. Now, 'nice jalopy' is quite a bit closer to 'nice car' than to 'nice movie' or most other 'nice [X]' reviews.
Using word-vectors from an outside source may not have great coverage of your dataset's domain words. (Words in GoogleNews, from a circa-2013 training run on news articles, might miss both words, and word-senses in your alternative & more-recent reviews.) And, summarizing a text by averaging all its words is a very crude method: it can learn nothing from word-ordering/grammar (that can often reverse intended sense), and aspects of words may all cancel-out/dilute each other in longer texts.
But still, it's bringing in more language info that otherwise wouldn't be in the data at all, so in some cases it may help.
If your dataset is sufficiently large, training your own word-vectors may help a bit, too. (Though, the gain you've seen so far suggests some useful patterns of word-similarities may not be well-taught from your limited dataset.)
Of course also note that you can use blended techniques. Perhaps, each text can be even better-represented by a concatenation of the N TF-IDF dimensions and the M word2vec-average dimensions. (If your texts have many significany 2-word phrases that mean hings different than the individual words, adding in word 2-grams features may help. If your texts have many typos or rare word variants that still share word-roots with other words, than adding in character-n-grams – word fragments – may help.)
I don't have large corpus of data to train word similarities e.g. 'hot' is more similar to 'warm' than to 'cold'. However, I like to train doc2vec on a relatively small corpus ~100 docs so that it can classify my domain specific documents.
To elaborate let me use this toy example. Assume I've only 4 training docs given by 4 sentences - "I love hot chocolate.", "I hate hot chocolate.", "I love hot tea.", and "I love hot cake.".
Given a test document "I adore hot chocolate", I would expect, doc2vec will invariably return "I love hot chocolate." as the closest document. This expectation will be true if word2vec already supplies the knowledge that "adore" is very similar to "love". However, I'm getting most similar document as "I hate hot chocolate" -- which is a bizarre!!
Any suggestion on how to circumvent this, i.e. be able to use pre-trained word embeddings so that I don't need to venture into training "adore" is close to "love", "hate" is close to "detest", and so on.
Code (Jupyter Nodebook. Python 3.7. Jensim 3.8.1)
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
from nltk.tokenize import word_tokenize
data = ["I love hot chocolate.",
"I hate hot chocolate",
"I love hot tea.",
"I love hot cake."]
tagged_data = [TaggedDocument(words=word_tokenize(_d.lower()), tags=[str(i)]) for i, _d in enumerate(data)]
print(tagged_data)
#Train and save
max_epochs = 10
vec_size = 5
alpha = 0.025
model = Doc2Vec(vector_size=vec_size, #it was size earlier
alpha=alpha,
min_alpha=0.00025,
min_count=1,
dm =1)
model.build_vocab(tagged_data)
for epoch in range(max_epochs):
if epoch % 10 == 0:
print('iteration {0}'.format(epoch))
model.train(tagged_data,
total_examples=model.corpus_count,
epochs=model.epochs) #It was model.iter earlier
# decrease the learning rate
model.alpha -= 0.0002
# fix the learning rate, no decay
model.min_alpha = model.alpha
print("Model Ready")
test_sentence="I adore hot chocolate"
test_data = word_tokenize(test_sentence.lower())
v1 = model.infer_vector(test_data)
#print("V1_infer", v1)
# to find most similar doc using tags
sims = model.docvecs.most_similar([v1])
print("\nTest: %s\n" %(test_sentence))
for indx, score in sims:
print("\t(score: %.4f) %s" %(score, data[int(indx)]))
Just ~100 documents is way too small to meaningfully train a Doc2Vec (or Word2Vec) model. Published Doc2Vec work tends to use tens-of-thousands to millions of documents.
To the extent you may be able to get slightly meaningful results from smaller datasets, you'll usually need to reduce the vector-sizes a lot – to far smaller than the number of words/examples – and increase the training epochs. (Your toy data has 4 texts & 6 unique words. Even to get 5-dimensional vectors, you probably want something like 5^2 constrasting documents.)
Also, gensim's Doc2Vec doesn't offer any official option to import word-vectors from elsewhere. The internal Doc2Vec training is not a process where word-vectors are trained 1st, then doc-vectors calculated. Rather, doc-vectors & word-vectors are trained in a simultaneous process, gradually improving together. (Some modes, like the fast & often highly effective DBOW that can be enabled with dm=0, don't create or use word-vectors at all.)
There's not really anything bizarre about your 4-sentence results, when looking at the data as if we were the Doc2Vec or Word2Vec algorithms, which have no prior knowledge about words, only what's in the training data. In your training data, the token 'love' and the token 'hate' are used in nearly exactly the same way, with the same surrounding words. Only by seeing many subtly varied alternative uses of words, alongside many contrasting surrounding words, can these "dense embedding" models move the word-vectors to useful relative positions, where they are closer to related words & farther from other words. (And, since you've provided no training data with the token 'adore', the model knows nothing about that word – and if it's provided inside a test document, as if to the model's infer_vector() method, it will be ignored. So the test document it 'sees' is only the known words ['i', 'hot', 'chocolate'].)
But also, even if you did manage to train on a larger dataset, or somehow inject the knowledge from other word-vectors that 'love' and 'adore' are somewhat similar, it's important to note that antonyms are typically quite similar in sets of word-vectors, too – as they are used in the same contexts, and often syntactically interchangeable, and of the same general category. These models often aren't very good at detecting the flip-in-human-perceived meaning from the swapping of a word for its antonym (or insertion of a single 'not' or other reversing-intent words).
Ultimately if you want to use gensim's Doc2Vec, you should train it with far more data. (If you were willing to grab some other pre-trainined word-vectors, why not grab some other source of somewhat-similar bulk sentences? The effect of using data that isn't exactly like your actual problem will be similar whether you leverage that outside data via bulk text or a pre-trained model.)
Finally: it's a bad, error-prone pattern to be calling train() more than once in your own loop, with your own alpha adjustments. You can just call it once, with the right number of epochs, and the model will perform the multiple training passes & manage the internal alpha smoothly over the right number of epochs.
I am using NLTK, to classify documents - having 1 label each, with there being 10 type of documents.
For text extraction, I am cleaning text (punctuation removal, html tag removal, lowercasing), removing nltk.corpus.stopwords, as well as my own collection of stopwords.
For my document feature I am looking across all 50k documents, and gathering the top 2k words, by frequency (frequency_words) then for each document identifying which words in the document that are also in the global frequency_words.
I am then passing in each document as hashmap of {word: boolean} into the nltk.NaiveBayesClassifier(...) I have a 20:80 test-training ratio in regards to the total number of documents.
The issues I am having:
Is this classifier by NLTK, suitable to multi labelled data? - all examples I have seen are more about 2-class classification, such as whether something is declared as a positive or negative.
The documents are such that they should have a set of key skills in - unfortunately I haven't got a corpus where these skills lie. So I have taken an approach with the understanding, a word count per document would not be a good document extractor - is this correct? Each document has been written by individuals, so I need to leave way for individual variation in the document. I am aware SkLearn MBNaiveBayes which deals with word count.
Is there an alternative library I should be using, or variation of this algorithm?
Thanks!
Terminology: Documents are to be classified into 10 different classes which makes it a multi-class classification problem. Along with that if you want to classify documents with multiple labels then you can call it as multi-class multi-label classification.
For the issues which you are facing,
nltk.NaiveBayesClassifier() is a out-of-box multi-class classifier. So yes you can use this to solve this problem. As per the multi-labelled data, if your labels are a,b,c,d,e,f,g,h,i,j then you have to define label 'b' of a particular document as '0,1,0,0,0,0,0,0,0,0'.
Feature extraction is the hardest part of Classification (Machine learning). I recommend you to look into different algorithms to understand and select the one best suits for your data(without looking at your data, it is tough to recommend which algorithm/implementation to use)
There are many different libraries out there for classification. I personally used scikit-learn and i can say it was good out-of-box classifier.
Note: Using scikit-learn, i was able to achieve results within a week, given data set was huge and other setbacks.
I have a dataset with annotations in the form: <Word/Phrase, Ontology Class>, where Ontology Class can be one of the following {Physical Object, Action, Quantity}. I have created this dataset manually for my particular ontology model from a large corpus of text.
Because this process was manual, I am sure that I may have missed some words/phrases from my corpus. If such is the case, I am looking at ways to automatically extract other words from the same corpus that have the "characteristics" as these words in the labeled dataset. Therefore, the first task is to define "characteristics" before I even go with the task of extracting other words.
Are there any standard techniques that I can use to achieve this?
EDIT: Sorry. I should have mentioned that these are domain-specific words not found in WordNet.
Take a look at chapter 6 of the NLTK book. From what you have described, it sounds like a supervised classification technique based on feature ("characteristic") extraction might be a good choice. From the book:
A classifier is called supervised if it is built based on training
corpora containing the correct label for each input.
You can use some of the data that you have manually encoded to train your classifier. It might look like this:
def word_features(name):
features = {}
features["firstletter"] = name[0].lower()
features["lastletter"] = name[-1].lower()
for letter in 'abcdefghijklmnopqrstuvwxyz':
features["count(%s)" % letter] = name.lower().count(letter)
features["has(%s)" % letter] = (letter in name.lower())
return features
Next you can train your classifier on some of the data you have already tagged:
>> words = [('Rock', 'Physical Object'), ('Play', 'Action'), ... ]
>>> featuresets = [(word_features(n), g) for (n,g) in words]
>>> train_set, test_set = featuresets[500:], featuresets[:500]
>>> classifier = nltk.NaiveBayesClassifier.train(train_set)
You should probably train on half of the data you already tagged. That way you can test the accuracy of the classifier with the other half. Keep working on the features until the accuracy of the classifier is as you desire.
nltk.classify.accuracy(classifier, test_set)
You can check individual classifications as follows:
classifier.classify(word_features('Gold'))
If you are not familiar with NLTK, then you can read the previous chapters as well.
As jfocht has said, you need a classifier to do this. To train a classifier, you need a set of training data of 'things' with features and their classification. You can then feed in a new 'thing' with features and get out the classification.
The kicker here is that you don't have features, you just have the words. One idea is to use WordNet, which is a fancy dictionary, to generate features from the definitions of the words. One of WordNet's best features is it has a hierarchy for a word e.g.,
cat -> animal -> living thing -> thing ....
You might be able to do this simply by following the hierarchy, but if you can't, you could add features from it and train it. This will likely work much better than using the words themselves as features.
Regardless of whether you find Wordnet to be useful, you need a feature set to train your classifier, and you also have to label all your unclassified data with those features, so unless you have some way to do the feature part computationally, it's going to be less work to do it by hand
I'm doing a project on document classification using naive bayes classifier in python. I have used the nltk python module for the same. The docs are from reuters dataset. I performed preprocessing steps such as stemming and stopword elimination and proceeded to compute tf-idf of the index terms. i used these values to train the classifier but the accuracy is very poor(53%). What should I do to improve the accuracy?
A few points that might help:
Don't use a stoplist, it lowers accuracy (but do remove punctuation)
Look at word features, and take only the top 1000 for example. Reducing dimensionality will improve your accuracy a lot;
Use bigrams as well as unigrams - this will up the accuracy a bit.
You may also find alternative weighting techniques such as log(1 + TF) * log(IDF) will improve accuracy. Good luck!
There could be many reasons for the classifier not working, and there are many ways to tweak it.
did you train it with enough positive and negative examples?
how did you train the classifier? did you give it every word as a feature, or did you also add more features for it to train on(like length of the text for example)?
what exactly are you trying to classify? does the specified classification have specific words that are related to it?
So the question is rather broad. Maybe If you give more details You could get more relevant suggestions.
If you are using the nltk naive bayes classifier, it's likely your actually using smoothed multi-variate bernoulli naive bayes text classification. This could be an issue if your feature extraction function maps into the set of all floating point values (which it sounds like it might since
your using tf-idf) rather than the set of all boolean values.
If your feature extractor returns tf-idf values, then I think nltk.NaiveBayesClassifier will check if it is true that
tf-idf(word1_in_doc1) == tf-idf(word1_in_class1)
rather than the appropriate question for whatever continuous distribution is appropriate to tf-idf.
This could explain your low accuracy, especially if one category occurs 53% of the time in your training set.
You might want to check out the multinomial naive bayes classifier implemented in scikit-learn.
For more information on multinomial and multivariate Bernoulli classifiers, see this very readable paper.
Like what Maus was saying, NLTK Naive Bayes(NB) uses a Bernoulli model plus smoothing to control for feature conditional probabilities==0(for features not seen by the classifier in training) A common technique for smoothing is Laplace-smoothing where you add 1 to the numerator of the conditional probability, but I believe NLTK adds 0.5 to the numerator.The NLTK NB model uses boolean values and computes its conditionals based on that, so using tf-idf as a feature will not produce good or even meaningful results.
If you want to stay within NLTK, then you should use the words themselves as features and bigrams. Check out this article by Jacob Perkins on text processing with NB in NLTK: http://streamhacker.com/tag/information-gain/. This article does a great job explaining and demonstrating some of the things you can do to pre-process your data; it uses the movie reviews corpus from NLTK for sentiment classification.
There is another module Python for text processing called scikit-learn and that has various NB models in it like Multinomial NB, which uses the frequency each word instead of occurrence of each word for computing its conditional probabilities.
Here is some literature on NB and the how both the Multinomial and Bernoulli models work:
http://nlp.stanford.edu/IR-book/html/htmledition/naive-bayes-text-classification-1.html; navigate through the literature using the previous/next buttons on the webpage.