I am trying to solve an NLP multilabel classification problem. I have a huge amount of documents that should be classified into 29 categories.
My approach to the problem was, after cleaning up the text, stop word removal, tokenizing etc., is to do the following:
To create the features matrix I looked at the frequency distribution of the terms of each document, I then created a table of these terms (where duplicate terms are removed), I then calculated the term frequency for each word in its corresponding text (tf). So, eventually I ended up with around a 1000 terms and their respected frequency in each document.
I then used selectKbest to narrow them down to around 490. and after scaling them I used OneVsRestClassifier(SVC) to do the classification.
I am getting an F1 score around 0.58 but it is not improving at all and I need to get 0.62.
Am I handling the problem correctly?
Do I need to use tfidf vectorizer instead of tf, and how?
I am very new to NLP and I am not sure at all what to do next and how to improve the score.
Any help in this subject is priceless.
Thanks
Tf method can give importance to common words more than necessary rather use Tfidf method which gives importance to words that are rare and unique in the particular document in the dataset.
Also before selecting Kbest rather train on the whole set of features and then use feature importance to get the best features.
You can also try using Tree Classifiers or XGB ones to better model but SVC is also very good classifier.
Try using Naive Bayes as the minimum standard of f1 score and try improving your results on other classifiers with the help of grid search.
Related
The data that I've got are mostly tweets or small comments (300-400 chars). I used a Bag-Of-Word model and used NaiveBayes classification. Now I'm having a lot of misclassified cases which are of the type mentioned below :-
1.] He sucked on a lemon early morning to get rid of hangover.
2.] That movie sucked big time.
Now the problem is that during sentiment classification both are getting "Negative" just because of the word "sucked"
Sentiment Classification : 1.] Negative 2.] Negative
Similarly during document classification both are getting classified into "movies" due to the presence of word "sucked".
Document classification : 1.] Movie 2.] Movie
This is just one of such instances, I'm facing a huge number of wrong classifications and don't have any idea on how to improve the accuracy.
(1)
One straightforward possible change from Bag-of-Words with Naive Bayes is to generate polynomial combinations of Bag-of-Words features. It might solve the problems you have shown above.
"sucked" + "lemon" (positive)
"sucked" + "movie" (negative)
Of course, you can also generate polynomial combinations of n-grams but the number of features might be too large.
The scikit-learn library prepares a preprocessing class for the purpose.
sklearn.preprocessing.PolynomialFeatures (http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.PolynomialFeatures.html)
Theoretically, SVM with the polynomial kernel does the same thing as PolynomialFeatures + linear SVM but slightly different regarding how you store the model information.
In my experience, PolynomialFeatures + linear SVM performs reasonably well for short text classification including sentiment analysis.
If the dataset size is not large enough, the training dataset might not contain "sucked" + "lemon". In the case, dimensionality reduction such as Singular Value Decomposition (SVD) and topic models such as Latent Dirichlet Allocation (LDA) are suitable tools to semantic clusters for words.
(2)
Another direction is to utilize more sophisticated natural language processing (NLP) techniques to extract additional information from short texts. For example, Part-of-Speech (POS) tagging, Named Entity Recognition (NER) will give more information than plain BoWs. A python library for NLP called Natural Language Toolkit (NLTK) implements those functions.
(3)
You can also take slow but steady way. Analyzing errors in prediction by the current model to design new hand-crafted features is a promising way to improve the accuracy of the model.
There is a library for short-text classification called LibShortText, which also contains an error analysis function and preprocessing functions such as TF-IDF weighting. It might help you to learn how to improve the model via error analysis.
LibShortText (https://www.csie.ntu.edu.tw/~cjlin/libshorttext/)
(4)
For further information, take a look at the literature on sentiment analysis of Tweets will give you more advanced information.
Maybe you could try to use a more powerful classifier like Support Vector Machines. Also depending on the amount of data you have, you could try deep learning with convolutional neural nets. For this you will need a huge number of training examples (100k-1M).
I am using NLTK, to classify documents - having 1 label each, with there being 10 type of documents.
For text extraction, I am cleaning text (punctuation removal, html tag removal, lowercasing), removing nltk.corpus.stopwords, as well as my own collection of stopwords.
For my document feature I am looking across all 50k documents, and gathering the top 2k words, by frequency (frequency_words) then for each document identifying which words in the document that are also in the global frequency_words.
I am then passing in each document as hashmap of {word: boolean} into the nltk.NaiveBayesClassifier(...) I have a 20:80 test-training ratio in regards to the total number of documents.
The issues I am having:
Is this classifier by NLTK, suitable to multi labelled data? - all examples I have seen are more about 2-class classification, such as whether something is declared as a positive or negative.
The documents are such that they should have a set of key skills in - unfortunately I haven't got a corpus where these skills lie. So I have taken an approach with the understanding, a word count per document would not be a good document extractor - is this correct? Each document has been written by individuals, so I need to leave way for individual variation in the document. I am aware SkLearn MBNaiveBayes which deals with word count.
Is there an alternative library I should be using, or variation of this algorithm?
Thanks!
Terminology: Documents are to be classified into 10 different classes which makes it a multi-class classification problem. Along with that if you want to classify documents with multiple labels then you can call it as multi-class multi-label classification.
For the issues which you are facing,
nltk.NaiveBayesClassifier() is a out-of-box multi-class classifier. So yes you can use this to solve this problem. As per the multi-labelled data, if your labels are a,b,c,d,e,f,g,h,i,j then you have to define label 'b' of a particular document as '0,1,0,0,0,0,0,0,0,0'.
Feature extraction is the hardest part of Classification (Machine learning). I recommend you to look into different algorithms to understand and select the one best suits for your data(without looking at your data, it is tough to recommend which algorithm/implementation to use)
There are many different libraries out there for classification. I personally used scikit-learn and i can say it was good out-of-box classifier.
Note: Using scikit-learn, i was able to achieve results within a week, given data set was huge and other setbacks.
I'm trying to apply SVM from Scikit learn to classify the tweets I collected.
So, there will be two categories, name them A and B.
For now, I have all the tweets categorized in two text file, 'A.txt' and 'B.txt'.
However, I'm not sure what type of data inputs the Scikit Learn SVM is asking for.
I have a dictionary with labels (A and B) as its keys and a dictionary of features (unigrams) and their frequencies as values.
Sorry, I'm really new to machine learning and not sure what I should do to get the SVM work.
And I found that SVM is using numpy.ndarray as the type of its data input. Do I need to create one based on my own data?
Should it be something like this?
Labels features frequency
A 'book' 54
B 'movies' 32
Any help is appreciated.
Have a look at the documentation on text feature extraction.
Also have a look at the text classification example.
There is also a tutorial here:
http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html
In particular don't focus too much on SVM models (in particular not sklearn.svm.SVC that is more interesting for kernel models hence not text classification): a simple Perceptron, LogisticRegression or Bernoulli naive Bayes models might work as good while being much faster to train.
I'm doing a project on document classification using naive bayes classifier in python. I have used the nltk python module for the same. The docs are from reuters dataset. I performed preprocessing steps such as stemming and stopword elimination and proceeded to compute tf-idf of the index terms. i used these values to train the classifier but the accuracy is very poor(53%). What should I do to improve the accuracy?
A few points that might help:
Don't use a stoplist, it lowers accuracy (but do remove punctuation)
Look at word features, and take only the top 1000 for example. Reducing dimensionality will improve your accuracy a lot;
Use bigrams as well as unigrams - this will up the accuracy a bit.
You may also find alternative weighting techniques such as log(1 + TF) * log(IDF) will improve accuracy. Good luck!
There could be many reasons for the classifier not working, and there are many ways to tweak it.
did you train it with enough positive and negative examples?
how did you train the classifier? did you give it every word as a feature, or did you also add more features for it to train on(like length of the text for example)?
what exactly are you trying to classify? does the specified classification have specific words that are related to it?
So the question is rather broad. Maybe If you give more details You could get more relevant suggestions.
If you are using the nltk naive bayes classifier, it's likely your actually using smoothed multi-variate bernoulli naive bayes text classification. This could be an issue if your feature extraction function maps into the set of all floating point values (which it sounds like it might since
your using tf-idf) rather than the set of all boolean values.
If your feature extractor returns tf-idf values, then I think nltk.NaiveBayesClassifier will check if it is true that
tf-idf(word1_in_doc1) == tf-idf(word1_in_class1)
rather than the appropriate question for whatever continuous distribution is appropriate to tf-idf.
This could explain your low accuracy, especially if one category occurs 53% of the time in your training set.
You might want to check out the multinomial naive bayes classifier implemented in scikit-learn.
For more information on multinomial and multivariate Bernoulli classifiers, see this very readable paper.
Like what Maus was saying, NLTK Naive Bayes(NB) uses a Bernoulli model plus smoothing to control for feature conditional probabilities==0(for features not seen by the classifier in training) A common technique for smoothing is Laplace-smoothing where you add 1 to the numerator of the conditional probability, but I believe NLTK adds 0.5 to the numerator.The NLTK NB model uses boolean values and computes its conditionals based on that, so using tf-idf as a feature will not produce good or even meaningful results.
If you want to stay within NLTK, then you should use the words themselves as features and bigrams. Check out this article by Jacob Perkins on text processing with NB in NLTK: http://streamhacker.com/tag/information-gain/. This article does a great job explaining and demonstrating some of the things you can do to pre-process your data; it uses the movie reviews corpus from NLTK for sentiment classification.
There is another module Python for text processing called scikit-learn and that has various NB models in it like Multinomial NB, which uses the frequency each word instead of occurrence of each word for computing its conditional probabilities.
Here is some literature on NB and the how both the Multinomial and Bernoulli models work:
http://nlp.stanford.edu/IR-book/html/htmledition/naive-bayes-text-classification-1.html; navigate through the literature using the previous/next buttons on the webpage.
I've got about 300k documents stored in a Postgres database that are tagged with topic categories (there are about 150 categories in total). I have another 150k documents that don't yet have categories. I'm trying to find the best way to programmaticly categorize them.
I've been exploring NLTK and its Naive Bayes Classifier. Seems like a good starting point (if you can suggest a better classification algorithm for this task, I'm all ears).
My problem is that I don't have enough RAM to train the NaiveBayesClassifier on all 150 categoies/300k documents at once (training on 5 categories used 8GB). Furthermore, accuracy of the classifier seems to drop as I train on more categories (90% accuracy with 2 categories, 81% with 5, 61% with 10).
Should I just train a classifier on 5 categories at a time, and run all 150k documents through the classifier to see if there are matches? It seems like this would work, except that there would be a lot of false positives where documents that don't really match any of the categories get shoe-horned into on by the classifier just because it's the best match available... Is there a way to have a "none of the above" option for the classifier just in case the document doesn't fit into any of the categories?
Here is my test class http://gist.github.com/451880
You should start by converting your documents into TF-log(1 + IDF) vectors: term frequencies are sparse so you should use python dict with term as keys and count as values and then divide by total count to get the global frequencies.
Another solution is to use the abs(hash(term)) for instance as positive integer keys. Then you an use scipy.sparse vectors which are more handy and more efficient to perform linear algebra operation than python dict.
Also build the 150 frequencies vectors by averaging the frequencies of all the labeled documents belonging to the same category. Then for new document to label, you can compute the cosine similarity between the document vector and each category vector and choose the most similar category as label for your document.
If this is not good enough, then you should try to train a logistic regression model using a L1 penalty as explained in this example of scikit-learn (this is a wrapper for liblinear as explained by #ephes). The vectors used to train your logistic regression model should be the previously introduced TD-log(1+IDF) vectors to get good performance (precision and recall). The scikit learn lib offers a sklearn.metrics module with routines to compute those score for a given model and given dataset.
For larger datasets: you should try the vowpal wabbit which is probably the fastest rabbit on earth for large scale document classification problems (but not easy to use python wrappers AFAIK).
How big (number of words) are your documents? Memory consumption at 150K trainingdocs should not be an issue.
Naive Bayes is a good choice especially when you have many categories with only a few training examples or very noisy trainingdata. But in general, linear Support Vector Machines do perform much better.
Is your problem multiclass (a document belongs only to one category exclusivly) or multilabel (a document belongs to one or more categories)?
Accuracy is a poor choice to judge classifier performance. You should rather use precision vs recall, precision recall breakeven point (prbp), f1, auc and have to look at the precision vs recall curve where recall (x) is plotted against precision (y) based on the value of your confidence-threshold (wether a document belongs to a category or not). Usually you would build one binary classifier per category (positive training examples of one category vs all other trainingexamples which don't belong to your current category). You'll have to choose an optimal confidence threshold per category. If you want to combine those single measures per category into a global performance measure, you'll have to micro (sum up all true positives, false positives, false negatives and true negatives and calc combined scores) or macro (calc score per category and then average those scores over all categories) average.
We have a corpus of tens of million documents, millions of training examples and thousands of categories (multilabel). Since we face serious training time problems (the number of documents are new, updated or deleted per day is quite high), we use a modified version of liblinear. But for smaller problems using one of the python wrappers around liblinear (liblinear2scipy or scikit-learn) should work fine.
Is there a way to have a "none of the
above" option for the classifier just
in case the document doesn't fit into
any of the categories?
You might get this effect simply by having a "none of the above" pseudo-category trained each time. If the max you can train is 5 categories (though I'm not sure why it's eating up quite so much RAM), train 4 actual categories from their actual 2K docs each, and a "none of the above" one with its 2K documents taken randomly from all the other 146 categories (about 13-14 from each if you want the "stratified sampling" approach, which may be sounder).
Still feels like a bit of a kludge and you might be better off with a completely different approach -- find a multi-dimensional doc measure that defines your 300K pre-tagged docs into 150 reasonably separable clusters, then just assign each of the other yet-untagged docs to the appropriate cluster as thus determined. I don't think NLTK has anything directly available to support this kind of thing, but, hey, NLTK's been growing so fast that I may well have missed something...;-)