How to reduce the number of features in text classification? - python

I'm doing dialect text classification and I'm using countVectorizer with naive bayes. The number of features are too many, I have collected 20k tweets with 4 dialects. every dialect have 5000 tweets. And the total number of features are 43K. I was thinking maybe that's why I could be having overfitting. Because the accuracy has dropped a lot when I tested on new data. So how can I fix the number of features to avoid overfitting the data?

You can set the parameter max_features to 5000 for instance, It might help with overfitting. You could also tinker with max_df (for instance set it to 0.95)

This drop on test data is caused by curse of dimensionality. You can use some dimensionality reduction method to reduce this effect. Possible choice is Latent Semantic Analysis implemented in sklearn.

Related

Vocabulary size vs. number of samples

Community,
I am training an SVM for sentiment classification. For the training I use the sentiment140 twitter dataset.
For this process I tried two different approaches:
Use only 10 % (160,000 messages) of the Data for training but don't limit the feature size (only 10% for computational reasons)
Limit the feature size (vocabulary size) to 12,000 or 20,000 words which allows me to use up to 400,000 twitter messages.
The tokenizer/vectorizer I use uses 1-grams and 2-grams.
Comparing the results, method 1 brings about 80 % accuracy whereas the second method only shows about 79.5 % accuracy.
I am quite unsure about these results and what will perform best for an unlabeled dataset I use for sentiment classification.
The limitation of the vocabulary size would focus the classification on the most common words and on the other hand, could provide more robust learning because of a larger dataset and maybe more accurate classification for the unlabeled dataset?
Having no limitations on the vocab size with a smaller dataset brings the benefit of marginally higher classification accuracy and therefore a better classification for the unlabeled dataset?
I am rather new to this topic and thus not sure about how to decide...
Maybe someone of you can give me an recommendation what would be better?
Thanks in advance!

Accuracy with TF-IDF and non-TF-IDF features

I run a Random Forest algorithm with TF-IDF and non-TF-IDF features.
In total the features are around 130k in number (after a feature selection conducted on the TF-IDF features) and the observations of the training set are around 120k in number.
Around 500 of them are the non-TF-IDF features.
The issue is that the accuracy of the Random Forest on the same test set etc with
- only the non-TF-IDF features is 87%
- the TF-IDF and non-TF-IDF features is 76%
This significant aggravation of the accuracy raises some questions in my mind.
The relevant piece of code of mine with the training of the models is the following:
drop_columns = ['labels', 'complete_text_1', 'complete_text_2']
# Split to predictors and targets
X_train = df.drop(columns=drop_columns).values
y_train = df['labels'].values
# Instantiate, train and transform with tf-idf models
vectorizer_1 = TfidfVectorizer(analyzer="word", ngram_range=(1,2), vocabulary=tf_idf_feature_names_selected)
X_train_tf_idf_1 = vectorizer_1.fit_transform(df['complete_text_1'])
vectorizer_2 = TfidfVectorizer(analyzer="word", ngram_range=(1,2), vocabulary=tf_idf_feature_names_selected)
X_train_tf_idf_2 = vectorizer_2.fit_transform(df['complete_text_2'])
# Covert the general features to sparse array
X_train = np.array(X_train, dtype=float)
X_train = csr_matrix(X_train)
# Concatenate the general features and tf-idf features array
X_train_all = hstack([X_train, X_train_tf_idf_1, X_train_tf_idf_2])
# Instantiate and train the model
rf_classifier = RandomForestClassifier(n_estimators=150, random_state=0, class_weight='balanced', n_jobs=os.cpu_count()-1)
rf_classifier.fit(X_train_all, y_train)
Personally, I have not seen any bug in my code (this piece above and in general).
The hypothesis which I have formulated to explain this decrease in accuracy is the following.
The number of non-TF-IDF features is only 500 (out of the 130k features in total)
This gives some chances that the non-TF-IDF features are not picked that much at each split by the trees of the random forest (eg because of max_features etc)
So if the non-TF-IDF features do actually matter then this will create problems because they are not taken enough into account.
Related to this, when I check the features' importances of the random forest after training it I see the importances of the non-TF-IDF features being very very low (although I am not sure how reliable indicator are the feature importances especially with TF-IDF features included).
Can you explain differently the decrease in accuracy at my classifier?
In any case, what would you suggest doing?
Some other ideas of combining the TF-IDF and non-TF-IDF features are the following.
One option would be to have two separate (random forest) models - one for the TF-IDF features and one for the non-TF-IDF features.
Then the results of these two models will be combined either by (weighted) voting or meta-classification.
Your view that 130K of features is way too much for the Random forest sounds right. You didn't mention how many examples you have in your dataset and that would be cruccial to the choice of the possible next steps. Here are a few ideas on top of my head.
If number of datapoints is large enough you myabe want to train some transformation for the TF-IDF features - e.g. you might want to train a small-dimensional embeddings of these TF-IDF features into, say 64-dimensional space and then e.g. a small NN on top of that (even a linear model maybe). After you have embeddings you could use them as transforms to generate 64 additional features for each example to replace TF-IDF features for RandomForest training. Or alternatively just replace the whole random forest with a NN of such architecture that e.g. TF-IDFs are all combined into a few neurons via fully-connected layers and later concatened with other features (pretty much same as embeddings but as a part of NN).
If you don't have enough data to train a large NN maybe you can try to train GBDT ensemble instead of random forest. It probably should do much better job at picking the good features compared to random forest which definitely likely to be affected a lot by a lot of noisy useless features. Also you can first train some crude version and then do a feature selection based on that (again, I would expect it should do a more reasonable job compared to random forest).
My guess is that your hypothesis is partly correct.
When using the full dataset (in the 130K feature model), each split in the tree uses only a small fraction of the 500 non-TF-IDF features. So if the non-TF-IDF features are important, then each split misses out on a lot of useful data. The data that is ignored for one split will probably be used for a different split in the tree, but the result isn't as good as it would be when more of the data is used at every split.
I would argue that there are some very important TF-IDF features, too. The fact that we have so many features means that a small fraction of those features is considered at each split.
In other words: the problem isn't that we're weakening the non-TF-IDF features. The problem is that we're weakening all of the useful features (both non-TF-IDF and TF-IDF). This is along the lines of Alexander's answer.
In light of this, your proposed solutions won't solve the problem very well. If you make two random forest models, one with 500 non-TF-IDF features and the other with 125K TF-IDF features, the second model will perform poorly, and negatively influence the results. If you pass the results of the 500 model as an additional feature to the 125K model, you're still underperforming.
If we want to stick with random forests, a better solution would be to increase the max_features and/or the number of trees. This will increase the odds that useful features are considered at each split, leading to a more accurate model.

additional of features decrease the accuracy- random forest

I am using sklearn's random forests module to predict a binary target variable based on 166 features.
When I increase the number of dimensions to 175 the accuracy of the model decreases (from accuracy = 0.86 to 0.81 and from recall = 0.37 to 0.32) .
I would expect more data to only make the model more accurate, especially when the added features were with business value.
I built the model using sklearn in python.
Why the new features did not get weight 0 and left the accuracy as it was ?
Basically, you may be "confusing" your model with useless features. MORE FEATURES or MORE DATA WILL NOT ALWAYS MAKE YOUR MODEL BETTER. The new features will also not get weight zero because the model will try hard to use them! Because there are so many (175!), RF is just not able to come back to the previous "pristine" model with better accuracy and recall (maybe these 9 features are really not adding anything useful).
Think about how a decision tree essentially works. These new features will cause some new splits that can worsen the results. Try to work it out from the basics and slowly adding new information always checking the performance. In addition, pay attention to for example the number of features used per split (mtry). For so many features, you would need to have a very high mtry (to allow for a big sample to be considered for every split). Have you considered adding 1 or 2 more and checking how the accuracy responds? Also, don't forget mtry!
More data does not always make the model more accurate. Random forest is a traditional machine learning method where the programmer has to do the feature selection. If the model is given a lot of data but it is bad, then the model will try to make sense out of that bad data too and will end up messing things up. More data is better for neural networks as those networks select the best possible features out of the data on their own.
Also, 175 features is too much and you should definitely look into dimensionality reduction techniques and select the features which are highly correlated with the target. there are several methods in sklearn to do that. You can try PCA if your data is numerical or RFE to remove bad features, etc.

Doc2Vec Worse Than Mean or Sum of Word2Vec Vectors

I'm training a Word2Vec model like:
model = Word2Vec(documents, size=200, window=5, min_count=0, workers=4, iter=5, sg=1)
and Doc2Vec model like:
doc2vec_model = Doc2Vec(size=200, window=5, min_count=0, iter=5, workers=4, dm=1)
doc2vec_model.build_vocab(doc2vec_tagged_documents)
doc2vec_model.train(doc2vec_tagged_documents, total_examples=doc2vec_model.corpus_count, epochs=doc2vec_model.iter)
with the same data and comparable parameters.
After this I'm using these models for my classification task. And I have found out that simply averaging or summing the word2vec embeddings of a document performs considerably better than using the doc2vec vectors. I also tried with much more doc2vec iterations (25, 80 and 150 - makes no difference).
Any tips or ideas why and how to improve doc2vec results?
Update: This is how doc2vec_tagged_documents is created:
doc2vec_tagged_documents = list()
counter = 0
for document in documents:
doc2vec_tagged_documents.append(TaggedDocument(document, [counter]))
counter += 1
Some more facts about my data:
My training data contains 4000 documents
with 900 words on average.
My vocabulary size is about 1000 words.
My data for the classification task is much smaller on average (12 words on average), but I also tried to split the training data to lines and train the doc2vec model like this, but it's almost the same result.
My data is not about natural language, please keep this in mind.
Summing/averaging word2vec vectors is often quite good!
It is more typical to use 10 or 20 iterations with Doc2Vec, rather than the default 5 inherited from Word2Vec. (I see you've tried that, though.)
If your main interest is the doc-vectors – and not the word-vectors that are in some Doc2Vec modes co-trained – definitely try the PV-DBOW mode (dm=0) as well. It'll train faster and is often a top-performer.
If your corpus is very small, or the docs very short, it may be hard for the doc-vectors to become generally meaningful. (In some cases, decreasing the vector size may help.) But especially if window is a large proportion of the average doc size, what's learned by word-vectors and what's learned by the doc-vectors will be very, very similar. And since the words may get trained more times, in more diverse contexts, they may have more generalizable meaning – unless you have a larger collections of longer docs.
Other things that sometimes help improve Doc2Vec vectors for classification purposes:
re-inferring all document vectors, at the end of training, perhaps even using parameters different from infer_vector() defaults, such as infer_vector(tokens, steps=50, alpha=0.025) – while quite slow, this means all docs get vectors from the same final model state, rather than what's left-over from bulk training
where classification labels are known, adding them as trained doc-tags, using the capability of TaggedDocument tags to be a list of tags
rare words are essentially just noise to Word2Vec or Doc2Vec - so a min_count above 1, perhaps significatly higher, often helps. (Singleton words mixed in may be especially damaging to individual doc-ID doc-vectors that are also, by design, singletons. The training process is also, in competition to the doc-vector, trying to make those singleton word-vectors predictive of their single-document neighborhoods... when really, for your purposes, you just want the doc-vector to be most descriptive. So this suggests both trying PV-DBOW, and increasing min_count.)
Hope this helps.

Classifying Documents into Categories

I've got about 300k documents stored in a Postgres database that are tagged with topic categories (there are about 150 categories in total). I have another 150k documents that don't yet have categories. I'm trying to find the best way to programmaticly categorize them.
I've been exploring NLTK and its Naive Bayes Classifier. Seems like a good starting point (if you can suggest a better classification algorithm for this task, I'm all ears).
My problem is that I don't have enough RAM to train the NaiveBayesClassifier on all 150 categoies/300k documents at once (training on 5 categories used 8GB). Furthermore, accuracy of the classifier seems to drop as I train on more categories (90% accuracy with 2 categories, 81% with 5, 61% with 10).
Should I just train a classifier on 5 categories at a time, and run all 150k documents through the classifier to see if there are matches? It seems like this would work, except that there would be a lot of false positives where documents that don't really match any of the categories get shoe-horned into on by the classifier just because it's the best match available... Is there a way to have a "none of the above" option for the classifier just in case the document doesn't fit into any of the categories?
Here is my test class http://gist.github.com/451880
You should start by converting your documents into TF-log(1 + IDF) vectors: term frequencies are sparse so you should use python dict with term as keys and count as values and then divide by total count to get the global frequencies.
Another solution is to use the abs(hash(term)) for instance as positive integer keys. Then you an use scipy.sparse vectors which are more handy and more efficient to perform linear algebra operation than python dict.
Also build the 150 frequencies vectors by averaging the frequencies of all the labeled documents belonging to the same category. Then for new document to label, you can compute the cosine similarity between the document vector and each category vector and choose the most similar category as label for your document.
If this is not good enough, then you should try to train a logistic regression model using a L1 penalty as explained in this example of scikit-learn (this is a wrapper for liblinear as explained by #ephes). The vectors used to train your logistic regression model should be the previously introduced TD-log(1+IDF) vectors to get good performance (precision and recall). The scikit learn lib offers a sklearn.metrics module with routines to compute those score for a given model and given dataset.
For larger datasets: you should try the vowpal wabbit which is probably the fastest rabbit on earth for large scale document classification problems (but not easy to use python wrappers AFAIK).
How big (number of words) are your documents? Memory consumption at 150K trainingdocs should not be an issue.
Naive Bayes is a good choice especially when you have many categories with only a few training examples or very noisy trainingdata. But in general, linear Support Vector Machines do perform much better.
Is your problem multiclass (a document belongs only to one category exclusivly) or multilabel (a document belongs to one or more categories)?
Accuracy is a poor choice to judge classifier performance. You should rather use precision vs recall, precision recall breakeven point (prbp), f1, auc and have to look at the precision vs recall curve where recall (x) is plotted against precision (y) based on the value of your confidence-threshold (wether a document belongs to a category or not). Usually you would build one binary classifier per category (positive training examples of one category vs all other trainingexamples which don't belong to your current category). You'll have to choose an optimal confidence threshold per category. If you want to combine those single measures per category into a global performance measure, you'll have to micro (sum up all true positives, false positives, false negatives and true negatives and calc combined scores) or macro (calc score per category and then average those scores over all categories) average.
We have a corpus of tens of million documents, millions of training examples and thousands of categories (multilabel). Since we face serious training time problems (the number of documents are new, updated or deleted per day is quite high), we use a modified version of liblinear. But for smaller problems using one of the python wrappers around liblinear (liblinear2scipy or scikit-learn) should work fine.
Is there a way to have a "none of the
above" option for the classifier just
in case the document doesn't fit into
any of the categories?
You might get this effect simply by having a "none of the above" pseudo-category trained each time. If the max you can train is 5 categories (though I'm not sure why it's eating up quite so much RAM), train 4 actual categories from their actual 2K docs each, and a "none of the above" one with its 2K documents taken randomly from all the other 146 categories (about 13-14 from each if you want the "stratified sampling" approach, which may be sounder).
Still feels like a bit of a kludge and you might be better off with a completely different approach -- find a multi-dimensional doc measure that defines your 300K pre-tagged docs into 150 reasonably separable clusters, then just assign each of the other yet-untagged docs to the appropriate cluster as thus determined. I don't think NLTK has anything directly available to support this kind of thing, but, hey, NLTK's been growing so fast that I may well have missed something...;-)

Categories