document classification using naive bayes in python

document classification using naive bayes in python - python

I'm doing a project on document classification using naive bayes classifier in python. I have used the nltk python module for the same. The docs are from reuters dataset. I performed preprocessing steps such as stemming and stopword elimination and proceeded to compute tf-idf of the index terms. i used these values to train the classifier but the accuracy is very poor(53%). What should I do to improve the accuracy?

A few points that might help:
Don't use a stoplist, it lowers accuracy (but do remove punctuation)
Look at word features, and take only the top 1000 for example. Reducing dimensionality will improve your accuracy a lot;
Use bigrams as well as unigrams - this will up the accuracy a bit.
You may also find alternative weighting techniques such as log(1 + TF) * log(IDF) will improve accuracy. Good luck!

There could be many reasons for the classifier not working, and there are many ways to tweak it.
did you train it with enough positive and negative examples?
how did you train the classifier? did you give it every word as a feature, or did you also add more features for it to train on(like length of the text for example)?
what exactly are you trying to classify? does the specified classification have specific words that are related to it?
So the question is rather broad. Maybe If you give more details You could get more relevant suggestions.

If you are using the nltk naive bayes classifier, it's likely your actually using smoothed multi-variate bernoulli naive bayes text classification. This could be an issue if your feature extraction function maps into the set of all floating point values (which it sounds like it might since
your using tf-idf) rather than the set of all boolean values.
If your feature extractor returns tf-idf values, then I think nltk.NaiveBayesClassifier will check if it is true that
tf-idf(word1_in_doc1) == tf-idf(word1_in_class1)
rather than the appropriate question for whatever continuous distribution is appropriate to tf-idf.
This could explain your low accuracy, especially if one category occurs 53% of the time in your training set.
You might want to check out the multinomial naive bayes classifier implemented in scikit-learn.
For more information on multinomial and multivariate Bernoulli classifiers, see this very readable paper.

Like what Maus was saying, NLTK Naive Bayes(NB) uses a Bernoulli model plus smoothing to control for feature conditional probabilities==0(for features not seen by the classifier in training) A common technique for smoothing is Laplace-smoothing where you add 1 to the numerator of the conditional probability, but I believe NLTK adds 0.5 to the numerator.The NLTK NB model uses boolean values and computes its conditionals based on that, so using tf-idf as a feature will not produce good or even meaningful results.
If you want to stay within NLTK, then you should use the words themselves as features and bigrams. Check out this article by Jacob Perkins on text processing with NB in NLTK: http://streamhacker.com/tag/information-gain/. This article does a great job explaining and demonstrating some of the things you can do to pre-process your data; it uses the movie reviews corpus from NLTK for sentiment classification.
There is another module Python for text processing called scikit-learn and that has various NB models in it like Multinomial NB, which uses the frequency each word instead of occurrence of each word for computing its conditional probabilities.
Here is some literature on NB and the how both the Multinomial and Bernoulli models work:
http://nlp.stanford.edu/IR-book/html/htmledition/naive-bayes-text-classification-1.html; navigate through the literature using the previous/next buttons on the webpage.

Related

How to classify unlabelled data?

I am new to Machine Learning. I am trying to build a classifier that classifies the text as having a url or not having a url. The data is not labelled. I just have textual data. I don't know how to proceed with it. Any help or examples is appreciated.

Since it's text, you can use bag of words technique to create vectors.
You can use cosine similarity to cluster the common type text.
Then use classifier, which would depend on number of clusters.
This way you have a labeled training set.
If you have two cluster, binary classifier like logistic regression would work.
If you have multiple classes, you need to train model based on multinomial logistic regression
or train multiple logistic models using One vs Rest technique.
Lastly, you can test your model using k-fold cross validation.

You cannot train a classifier with unlabeled data. You need labeled examples. There are services that will label it for you, but it might be simpler for you to do it by hand (I assume you can go through one per minute).
Stack Overflow is for programming; this question would be better suited in, say, Cross-Validated. Maybe they'll have better suggestions than me.
After you've labeled the data, there's a lot of info on the web on this subject - for example, this blog is a good place to start if you already have some grip on the issue.
Good luck!

nlp multilabel classification tf vs tfidf

I am trying to solve an NLP multilabel classification problem. I have a huge amount of documents that should be classified into 29 categories.
My approach to the problem was, after cleaning up the text, stop word removal, tokenizing etc., is to do the following:
To create the features matrix I looked at the frequency distribution of the terms of each document, I then created a table of these terms (where duplicate terms are removed), I then calculated the term frequency for each word in its corresponding text (tf). So, eventually I ended up with around a 1000 terms and their respected frequency in each document.
I then used selectKbest to narrow them down to around 490. and after scaling them I used OneVsRestClassifier(SVC) to do the classification.
I am getting an F1 score around 0.58 but it is not improving at all and I need to get 0.62.
Am I handling the problem correctly?
Do I need to use tfidf vectorizer instead of tf, and how?
I am very new to NLP and I am not sure at all what to do next and how to improve the score.
Any help in this subject is priceless.
Thanks

Tf method can give importance to common words more than necessary rather use Tfidf method which gives importance to words that are rare and unique in the particular document in the dataset.
Also before selecting Kbest rather train on the whole set of features and then use feature importance to get the best features.
You can also try using Tree Classifiers or XGB ones to better model but SVC is also very good classifier.
Try using Naive Bayes as the minimum standard of f1 score and try improving your results on other classifiers with the help of grid search.

scikit-learn: classifying texts using custom labels

I have a large training set of words labeled pos and neg to classify texts. I used TextBlob (according to this tutorial) to classify texts. While it works fairly well, it can be very slow for a large training set (e.g. 8k words).
I would like to try doing this with scikit-learn but I'm not sure where to start. What would the above tutorial look like in scikit-learn? I'd also like the training set to include weights for certain words. Some that should pretty much guarantee that a particular text is classed as "positive" while others guarantee that it's classed as "negative". And lastly, is there a way to imply that certain parts of the analyzed text are more valuable than others?
Any pointers to existing tutorials or docs appreciated!

There is an excellent chapter on this topic in Sebastian Raschka's Python Machine Learning book and the code can be found here: https://github.com/rasbt/python-machine-learning-book/blob/master/code/ch08/ch08.ipynb.
He does sentiment analysis (what you are trying to do) on an IMDB dataset. His data is not as clean as yours - from the looks of it - so he needs to do a bit more pre-processing work. Your problem can be solved with these steps:
Create numerical features by vectorizing your text: http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.HashingVectorizer.html
Train test split: http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html
Train and test your favourite model, e.g.: http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html

There are many ways to do this like Tf-Idf (Term frequency - Inverse Document Frequency), Count Vectorizer, Latent Semantic Analysis (LSA), Latent Dirichlet Allocation (LDA), Word2Vec.
Among all of above mentioned methods, Word2Vec is the best method. You can use a pre-trained model by Google for Word2Vec, available on:
https://github.com/mmihaltz/word2vec-GoogleNews-vectors

How to improve classification of small texts

The data that I've got are mostly tweets or small comments (300-400 chars). I used a Bag-Of-Word model and used NaiveBayes classification. Now I'm having a lot of misclassified cases which are of the type mentioned below :-
1.] He sucked on a lemon early morning to get rid of hangover.
2.] That movie sucked big time.
Now the problem is that during sentiment classification both are getting "Negative" just because of the word "sucked"
Sentiment Classification : 1.] Negative 2.] Negative
Similarly during document classification both are getting classified into "movies" due to the presence of word "sucked".
Document classification : 1.] Movie 2.] Movie
This is just one of such instances, I'm facing a huge number of wrong classifications and don't have any idea on how to improve the accuracy.

(1)
One straightforward possible change from Bag-of-Words with Naive Bayes is to generate polynomial combinations of Bag-of-Words features. It might solve the problems you have shown above.
"sucked" + "lemon" (positive)
"sucked" + "movie" (negative)
Of course, you can also generate polynomial combinations of n-grams but the number of features might be too large.
The scikit-learn library prepares a preprocessing class for the purpose.
sklearn.preprocessing.PolynomialFeatures (http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.PolynomialFeatures.html)
Theoretically, SVM with the polynomial kernel does the same thing as PolynomialFeatures + linear SVM but slightly different regarding how you store the model information.
In my experience, PolynomialFeatures + linear SVM performs reasonably well for short text classification including sentiment analysis.
If the dataset size is not large enough, the training dataset might not contain "sucked" + "lemon". In the case, dimensionality reduction such as Singular Value Decomposition (SVD) and topic models such as Latent Dirichlet Allocation (LDA) are suitable tools to semantic clusters for words.
(2)
Another direction is to utilize more sophisticated natural language processing (NLP) techniques to extract additional information from short texts. For example, Part-of-Speech (POS) tagging, Named Entity Recognition (NER) will give more information than plain BoWs. A python library for NLP called Natural Language Toolkit (NLTK) implements those functions.
(3)
You can also take slow but steady way. Analyzing errors in prediction by the current model to design new hand-crafted features is a promising way to improve the accuracy of the model.
There is a library for short-text classification called LibShortText, which also contains an error analysis function and preprocessing functions such as TF-IDF weighting. It might help you to learn how to improve the model via error analysis.
LibShortText (https://www.csie.ntu.edu.tw/~cjlin/libshorttext/)
(4)
For further information, take a look at the literature on sentiment analysis of Tweets will give you more advanced information.

Maybe you could try to use a more powerful classifier like Support Vector Machines. Also depending on the amount of data you have, you could try deep learning with convolutional neural nets. For this you will need a huge number of training examples (100k-1M).

Python nltk Naive Bayes doesn't seem to work

I'm using the nltk book - Natural Language Processing with Python(2009) and looking at the Naive Bayes classifier. In particular, Example 6-3 on Pg 228 in my version.
The training set is movie reviews.
classifier = nltk.NaiveBayesClassifier.train(train_set)
I peek at the most informative features -
classifier.show_most_informative_features(5)
and I get 'outstanding', 'mulan' and 'wonderfully' among the top ranking ones for the sentence to be tagged 'positive'.
So, I try the following -
in1 = 'wonderfully mulan'
classifier.classify(document_features(in1.split()))
And I get 'neg'. Now this makes no sense. These were supposed to be the top features.
the document_features function is taken directly from the book -
def document_features(document):
document_words = set(document)
features = {}
for word in word_features:
features['contains(%s)' % word] = (word in document_words)
return features

Note that the feature vector in that example is comprised of the "2000 most frequent words in the overall corpus." So assuming that the corpus is comprehensive, a regular review will probably have quite a few of those words. (In real-world reviews of the latest Jackass movie and Dallas Buyers Club, I get 26/2000 and 28/2000 features respectively.)
If you feed it a review containing only "wonderfully mulan", the resulting feature vector only has 2/2000 features set to True. Basically, you're giving it a pseudoreview with little to no information that it knows about or that it can do anything with. For that vector, it's hard to tell what it will predict.
The feature vector should be healthily populated with vectors leaning in a positive direction for it to output pos. Maybe look at the most informative, say, 500 features, look at which ones lean positively and then create a string with only those? That might get you closer to pos, but not necessarily.
Some feature vectors in the train_set classify as pos. (Anecdotally, I found one of them to have 417 features equal to True). However, in my tests, no documents from the neg or pos training set partitions classified to pos, so while you may be right that the classifier doesn't seem to be doing a great job - at least the pos training examples should classify to pos - the example you're giving it is not a great measure of that.

There are at least two different flavors of the naive Bayes classifier. In a quick search, it appears that NLTK implements the Bernoulli flavor: Different results between the Bernoulli Naive Bayes in NLTK and in scikit-learn . In any case, some flavors of naive Bayes pay attention to words/features missing from a document as much as the visible words. So, if you try to classify a document containing a few positive words but that document is also lacking many words that indicate a negative document when they are missing, it is very reasonable that the document will be categorized as negative. So, the bottom line is, pay attention to not only the visible features but also the missing features (depending on the details of the naive Bayes implementation).

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.