Classifying Documents into Categories

Classifying Documents into Categories - python

I've got about 300k documents stored in a Postgres database that are tagged with topic categories (there are about 150 categories in total). I have another 150k documents that don't yet have categories. I'm trying to find the best way to programmaticly categorize them.
I've been exploring NLTK and its Naive Bayes Classifier. Seems like a good starting point (if you can suggest a better classification algorithm for this task, I'm all ears).
My problem is that I don't have enough RAM to train the NaiveBayesClassifier on all 150 categoies/300k documents at once (training on 5 categories used 8GB). Furthermore, accuracy of the classifier seems to drop as I train on more categories (90% accuracy with 2 categories, 81% with 5, 61% with 10).
Should I just train a classifier on 5 categories at a time, and run all 150k documents through the classifier to see if there are matches? It seems like this would work, except that there would be a lot of false positives where documents that don't really match any of the categories get shoe-horned into on by the classifier just because it's the best match available... Is there a way to have a "none of the above" option for the classifier just in case the document doesn't fit into any of the categories?
Here is my test class http://gist.github.com/451880

You should start by converting your documents into TF-log(1 + IDF) vectors: term frequencies are sparse so you should use python dict with term as keys and count as values and then divide by total count to get the global frequencies.
Another solution is to use the abs(hash(term)) for instance as positive integer keys. Then you an use scipy.sparse vectors which are more handy and more efficient to perform linear algebra operation than python dict.
Also build the 150 frequencies vectors by averaging the frequencies of all the labeled documents belonging to the same category. Then for new document to label, you can compute the cosine similarity between the document vector and each category vector and choose the most similar category as label for your document.
If this is not good enough, then you should try to train a logistic regression model using a L1 penalty as explained in this example of scikit-learn (this is a wrapper for liblinear as explained by #ephes). The vectors used to train your logistic regression model should be the previously introduced TD-log(1+IDF) vectors to get good performance (precision and recall). The scikit learn lib offers a sklearn.metrics module with routines to compute those score for a given model and given dataset.
For larger datasets: you should try the vowpal wabbit which is probably the fastest rabbit on earth for large scale document classification problems (but not easy to use python wrappers AFAIK).

How big (number of words) are your documents? Memory consumption at 150K trainingdocs should not be an issue.
Naive Bayes is a good choice especially when you have many categories with only a few training examples or very noisy trainingdata. But in general, linear Support Vector Machines do perform much better.
Is your problem multiclass (a document belongs only to one category exclusivly) or multilabel (a document belongs to one or more categories)?
Accuracy is a poor choice to judge classifier performance. You should rather use precision vs recall, precision recall breakeven point (prbp), f1, auc and have to look at the precision vs recall curve where recall (x) is plotted against precision (y) based on the value of your confidence-threshold (wether a document belongs to a category or not). Usually you would build one binary classifier per category (positive training examples of one category vs all other trainingexamples which don't belong to your current category). You'll have to choose an optimal confidence threshold per category. If you want to combine those single measures per category into a global performance measure, you'll have to micro (sum up all true positives, false positives, false negatives and true negatives and calc combined scores) or macro (calc score per category and then average those scores over all categories) average.
We have a corpus of tens of million documents, millions of training examples and thousands of categories (multilabel). Since we face serious training time problems (the number of documents are new, updated or deleted per day is quite high), we use a modified version of liblinear. But for smaller problems using one of the python wrappers around liblinear (liblinear2scipy or scikit-learn) should work fine.

Is there a way to have a "none of the
above" option for the classifier just
in case the document doesn't fit into
any of the categories?
You might get this effect simply by having a "none of the above" pseudo-category trained each time. If the max you can train is 5 categories (though I'm not sure why it's eating up quite so much RAM), train 4 actual categories from their actual 2K docs each, and a "none of the above" one with its 2K documents taken randomly from all the other 146 categories (about 13-14 from each if you want the "stratified sampling" approach, which may be sounder).
Still feels like a bit of a kludge and you might be better off with a completely different approach -- find a multi-dimensional doc measure that defines your 300K pre-tagged docs into 150 reasonably separable clusters, then just assign each of the other yet-untagged docs to the appropriate cluster as thus determined. I don't think NLTK has anything directly available to support this kind of thing, but, hey, NLTK's been growing so fast that I may well have missed something...;-)

Related

Can sentiment classification problem be resolved using regression?

I have a data set of tweets, where each tweet has an average confidence score.
For example
Tweet Average Confidence Standard Deviation
too much thoughts inside his headdd we can t even imagine 0.3 0.163951
His ass need to stay up 0.8 0.161962
First time I heard his name in camp, he seems amazing 0.19 0.181962
Average Confidence is the average of the confidences predicted by several supervised models for a specific instance to belong to the positive class.
Standard Deviation is the confidences' standard deviation from Average Confidence for a particular instance.
If i consider it as a regression task, how to handle multi label data
EDIT

Didn't understood your question quite yet, so answered as best I thought :)
Basically the score (in your case average) is used alone in sentiment analysis to classify good-bad sentences (data), pick a threshold which yield the best classification results, lets say 0.6 so
if score >= 0.6
classify as GOOD
else
classify as BAD
I suggest to see if this simple approach is good enough for your requirements
in case you want to classify using more variables (information) e.g 'averageandstd` you can use another classification model (like logistic-regression, decision trees, svm and more...)
in case you would like to use some regression approach, I suggest a logistic regression (it's pretty strait forward)
because your current model consists of just 2 variables average and std a svm, might give better results (basically it project the data on higher dimension and do the classification there)
keep in mind all methods (maybe except decision-trees and such) will output another score, like prob of classification between 0 to 1 so a thresholds will always have to be applied at the end

dealing with imbalanced data after encoding for classification

I have a data of dimension (13961,48 ) initially, and after one hot encoding and also basic massaging of data the dimension observed around (13961,862). the data is imbalance with two categories of 'Retained' around 6% and 'not Retained' around 94%.
While running any algorithms such as logistic,knn,decision tree,random forest, the data results in very high accuracy even without any feature selection process carried out and the accuracy crosses more than 94% mostly except 'Naive bias classifier'.
This seems like odd and even by having any two features randomly also--> that gives accuracy more than 94% , which seems non reality in general.
Applying SMOTE also, provide result of more than 94% of accuracy even for baseline model of any algorithms said above such as logistic,knn,decision tree,random forest,
After removing the top 20 features also , this gives accuracy of good result more than 94% ( checked for understanding the genuineness )
g = data[Target_col_Y_name]
df = pd.concat([g.value_counts(),
g.value_counts(normalize=True).mul(100)],axis=1, keys=('counts','percentage'))
print('The % distribution between the retention and non-retention flag\n')
print (df)
# The code o/p to show the imbalance is
The % distribution between the retention and non-retention flag
counts percentage
Non Retained 13105 93.868634
Retained 856 6.131366
My data have 7 numerical variables such as month, amount, interest rate and all others ( around 855) as one-hot-encoding transformed categorical variables.
Any methodology , to handle this kind of data on baseline,feature selection or imbalance optimization techniques ? please guide by looking at the dimensionality and the imbalance count for each levels.

I would like to add something in addition to Elias answer.
Firstly, you have to understand that even if you's create "dumb classifier", which always predicts "not retained", you'd still be correct 94% of times. So accuracy is clearly weak metric in this case.
You should definitely learn about confusion matrix and metrics that come along with it (like AUC).
One of these metrics is F1 score, which is harmonic average of precision and recall. It is better that accuracy in imbalanced class setting, but... it doesn't have to be the best. F1 will favor these classifiers that have similar precision and recall. But this is not necessary something that is important for you.
For instance, if you'd build sfw content filter, you would be ok with labeling some SFW content as nsfw (negative class), which would increase false negative rate (and decrease recall), but you would like to be sure that you kept only safe ones (high precision).
In your case you can reason what is worse: retaining something bad or abandoning something good, and pick the metric in that way.
As far as strategy is concerned: there are plenty of ways to handle class imbalance: sampling techniques (try down-sampling, up-sampling besides SMOTE or ROSE) and check out whether your validation score (training metrics alone are almost useless) improved. Just remember to apply sampling/augmentations techniques after the train-validation split.
Moreover, some models have special hyperparametrs to focus more on rare class (for instance in xgboost there is scale_pos_weight parameter). From my experience, tunning this hyperparam is way more effective than SMOTE.
Good luck

Accuracy is not a very good measure in general, particularly for imbalanced classes. I would recommend this other stackoverflow answer, that explains when to use F1 score and when to use AUROC, which are both far better measures than accuracy; in this case F1 is better.
Few points just to clear up:
For models such as random forest, you should not have to remove features to improve the accuracy, as it will just regard them as insignificant features. I recommend random forests as it tends to be very accurate (except in some cases) and can show significant features just by using clf.feature_significances_ (if using the scipy random forest).
Decision trees will almost always perform worse than random forests, as random forests are many aggregated decision trees.

additional of features decrease the accuracy- random forest

I am using sklearn's random forests module to predict a binary target variable based on 166 features.
When I increase the number of dimensions to 175 the accuracy of the model decreases (from accuracy = 0.86 to 0.81 and from recall = 0.37 to 0.32) .
I would expect more data to only make the model more accurate, especially when the added features were with business value.
I built the model using sklearn in python.
Why the new features did not get weight 0 and left the accuracy as it was ?

Basically, you may be "confusing" your model with useless features. MORE FEATURES or MORE DATA WILL NOT ALWAYS MAKE YOUR MODEL BETTER. The new features will also not get weight zero because the model will try hard to use them! Because there are so many (175!), RF is just not able to come back to the previous "pristine" model with better accuracy and recall (maybe these 9 features are really not adding anything useful).
Think about how a decision tree essentially works. These new features will cause some new splits that can worsen the results. Try to work it out from the basics and slowly adding new information always checking the performance. In addition, pay attention to for example the number of features used per split (mtry). For so many features, you would need to have a very high mtry (to allow for a big sample to be considered for every split). Have you considered adding 1 or 2 more and checking how the accuracy responds? Also, don't forget mtry!

More data does not always make the model more accurate. Random forest is a traditional machine learning method where the programmer has to do the feature selection. If the model is given a lot of data but it is bad, then the model will try to make sense out of that bad data too and will end up messing things up. More data is better for neural networks as those networks select the best possible features out of the data on their own.
Also, 175 features is too much and you should definitely look into dimensionality reduction techniques and select the features which are highly correlated with the target. there are several methods in sklearn to do that. You can try PCA if your data is numerical or RFE to remove bad features, etc.

nlp multilabel classification tf vs tfidf

I am trying to solve an NLP multilabel classification problem. I have a huge amount of documents that should be classified into 29 categories.
My approach to the problem was, after cleaning up the text, stop word removal, tokenizing etc., is to do the following:
To create the features matrix I looked at the frequency distribution of the terms of each document, I then created a table of these terms (where duplicate terms are removed), I then calculated the term frequency for each word in its corresponding text (tf). So, eventually I ended up with around a 1000 terms and their respected frequency in each document.
I then used selectKbest to narrow them down to around 490. and after scaling them I used OneVsRestClassifier(SVC) to do the classification.
I am getting an F1 score around 0.58 but it is not improving at all and I need to get 0.62.
Am I handling the problem correctly?
Do I need to use tfidf vectorizer instead of tf, and how?
I am very new to NLP and I am not sure at all what to do next and how to improve the score.
Any help in this subject is priceless.
Thanks

Tf method can give importance to common words more than necessary rather use Tfidf method which gives importance to words that are rare and unique in the particular document in the dataset.
Also before selecting Kbest rather train on the whole set of features and then use feature importance to get the best features.
You can also try using Tree Classifiers or XGB ones to better model but SVC is also very good classifier.
Try using Naive Bayes as the minimum standard of f1 score and try improving your results on other classifiers with the help of grid search.

When using multiple classifiers - How to measure the ensemble's performance? [SciKit Learn]

I have a classification problem (predicting whether a sequence belongs to a class or not), for which I decided to use multiple classification methods, in order to help filter out the false positives.
(The problem is in bioinformatics - classifying protein sequences as being Neuropeptide precursors sequences. Here's the original article if anyone's interested, and the code used to generate features and to train a single predictor) .
Now, the classifiers have roughly similar performance metrics (83-94% accuracy/precision/etc' on the training set for 10-fold CV), so my 'naive' approach was to simply use multiple classifiers (Random Forests, ExtraTrees, SVM (Linear kernel), SVM (RBF kernel) and GRB) , and to use a simple majority vote.
MY question is:
How can I get the performance metrics for the different classifiers and/or their votes predictions?
That is, I want to see if using the multiple classifiers improves my performance at all, or which combination of them does.
My intuition is maybe to use the ROC score, but I don't know how to "combine" the results and to get it from a combination of classifiers. (That is, to see what the ROC curve is just for each classifier alone [already known], then to see the ROC curve or AUC for the training data using combinations of classifiers).
(I currently filter the predictions using "predict probabilities" with the Random Forests and ExtraTrees methods, then I filter arbitrarily for results with a predicted score below '0.85'. An additional layer of filtering is "how many classifiers agree on this protein's positive classification").
Thank you very much!!
(The website implementation, where we're using the multiple classifiers - http://neuropid.cs.huji.ac.il/ )
The whole shebang is implemented using SciKit learn and python. Citations and all!)

To evaluate the performance of the ensemble, simply follow the same approach as you would normally. However, you will want to get the 10 fold data set partitions first, and for each fold, train all of your ensemble on that same fold, measure the accuracy, rinse and repeat with the other folds and then compute the accuracy of the ensemble. So the key difference is to not train the individual algorithms using k fold cross-validation when evaluating the ensemble. The important thing is not to let the ensemble see the test data either directly or by letting one of it's algorithms see the test data.
Note also that RF and Extra Trees are already ensemble algorithms in their own right.
An alternative approach (again making sure the ensemble approach) is to take the probabilities and \ or labels output by your classifiers, and feed them into another classifier (say a DT, RF, SVM, or whatever) that produces a prediction by combining the best guesses from these other classifiers. This is termed "Stacking"

You can use a linear regression for stacking. For each 10-fold, you can split the data with:
8 training sets
1 validation set
1 test set
Optimise the hyper-parameters for each algorithm using the training set and validation set, then stack yours predictions by using a linear regression - or a logistic regression - over the validation set. Your final model will be p = a_o + a_1 p_1 + … + a_k p_K, where K is the number of classifier, p_k is the probability given by model k and a_k is the weight of the model k. You can also directly use the predicted outcomes, if the model doesn't give you probabilities.
If yours models are the same, you can optimise for the parameters of the models and the weights in the same time.
If you have obvious differences, you can do different bins with different parameters for each. For example one bin could be short sequences and the other long sequences. Or different type of proteins.
You can use the metric whatever metric you want, as long as it makes sens, like for not blended algorithms.
You may want to look at the 2007 Belkor solution of the Netflix challenges, section Blending. In 2008 and 2009 they used more advances technics, it may also be interesting for you.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.