Vocabulary size vs. number of samples

Vocabulary size vs. number of samples - python

Community,
I am training an SVM for sentiment classification. For the training I use the sentiment140 twitter dataset.
For this process I tried two different approaches:
Use only 10 % (160,000 messages) of the Data for training but don't limit the feature size (only 10% for computational reasons)
Limit the feature size (vocabulary size) to 12,000 or 20,000 words which allows me to use up to 400,000 twitter messages.
The tokenizer/vectorizer I use uses 1-grams and 2-grams.
Comparing the results, method 1 brings about 80 % accuracy whereas the second method only shows about 79.5 % accuracy.
I am quite unsure about these results and what will perform best for an unlabeled dataset I use for sentiment classification.
The limitation of the vocabulary size would focus the classification on the most common words and on the other hand, could provide more robust learning because of a larger dataset and maybe more accurate classification for the unlabeled dataset?
Having no limitations on the vocab size with a smaller dataset brings the benefit of marginally higher classification accuracy and therefore a better classification for the unlabeled dataset?
I am rather new to this topic and thus not sure about how to decide...
Maybe someone of you can give me an recommendation what would be better?
Thanks in advance!

Related

Accuracy with TF-IDF and non-TF-IDF features

I run a Random Forest algorithm with TF-IDF and non-TF-IDF features.
In total the features are around 130k in number (after a feature selection conducted on the TF-IDF features) and the observations of the training set are around 120k in number.
Around 500 of them are the non-TF-IDF features.
The issue is that the accuracy of the Random Forest on the same test set etc with
- only the non-TF-IDF features is 87%
- the TF-IDF and non-TF-IDF features is 76%
This significant aggravation of the accuracy raises some questions in my mind.
The relevant piece of code of mine with the training of the models is the following:
drop_columns = ['labels', 'complete_text_1', 'complete_text_2']
# Split to predictors and targets
X_train = df.drop(columns=drop_columns).values
y_train = df['labels'].values
# Instantiate, train and transform with tf-idf models
vectorizer_1 = TfidfVectorizer(analyzer="word", ngram_range=(1,2), vocabulary=tf_idf_feature_names_selected)
X_train_tf_idf_1 = vectorizer_1.fit_transform(df['complete_text_1'])
vectorizer_2 = TfidfVectorizer(analyzer="word", ngram_range=(1,2), vocabulary=tf_idf_feature_names_selected)
X_train_tf_idf_2 = vectorizer_2.fit_transform(df['complete_text_2'])
# Covert the general features to sparse array
X_train = np.array(X_train, dtype=float)
X_train = csr_matrix(X_train)
# Concatenate the general features and tf-idf features array
X_train_all = hstack([X_train, X_train_tf_idf_1, X_train_tf_idf_2])
# Instantiate and train the model
rf_classifier = RandomForestClassifier(n_estimators=150, random_state=0, class_weight='balanced', n_jobs=os.cpu_count()-1)
rf_classifier.fit(X_train_all, y_train)
Personally, I have not seen any bug in my code (this piece above and in general).
The hypothesis which I have formulated to explain this decrease in accuracy is the following.
The number of non-TF-IDF features is only 500 (out of the 130k features in total)
This gives some chances that the non-TF-IDF features are not picked that much at each split by the trees of the random forest (eg because of max_features etc)
So if the non-TF-IDF features do actually matter then this will create problems because they are not taken enough into account.
Related to this, when I check the features' importances of the random forest after training it I see the importances of the non-TF-IDF features being very very low (although I am not sure how reliable indicator are the feature importances especially with TF-IDF features included).
Can you explain differently the decrease in accuracy at my classifier?
In any case, what would you suggest doing?
Some other ideas of combining the TF-IDF and non-TF-IDF features are the following.
One option would be to have two separate (random forest) models - one for the TF-IDF features and one for the non-TF-IDF features.
Then the results of these two models will be combined either by (weighted) voting or meta-classification.

Your view that 130K of features is way too much for the Random forest sounds right. You didn't mention how many examples you have in your dataset and that would be cruccial to the choice of the possible next steps. Here are a few ideas on top of my head.
If number of datapoints is large enough you myabe want to train some transformation for the TF-IDF features - e.g. you might want to train a small-dimensional embeddings of these TF-IDF features into, say 64-dimensional space and then e.g. a small NN on top of that (even a linear model maybe). After you have embeddings you could use them as transforms to generate 64 additional features for each example to replace TF-IDF features for RandomForest training. Or alternatively just replace the whole random forest with a NN of such architecture that e.g. TF-IDFs are all combined into a few neurons via fully-connected layers and later concatened with other features (pretty much same as embeddings but as a part of NN).
If you don't have enough data to train a large NN maybe you can try to train GBDT ensemble instead of random forest. It probably should do much better job at picking the good features compared to random forest which definitely likely to be affected a lot by a lot of noisy useless features. Also you can first train some crude version and then do a feature selection based on that (again, I would expect it should do a more reasonable job compared to random forest).

My guess is that your hypothesis is partly correct.
When using the full dataset (in the 130K feature model), each split in the tree uses only a small fraction of the 500 non-TF-IDF features. So if the non-TF-IDF features are important, then each split misses out on a lot of useful data. The data that is ignored for one split will probably be used for a different split in the tree, but the result isn't as good as it would be when more of the data is used at every split.
I would argue that there are some very important TF-IDF features, too. The fact that we have so many features means that a small fraction of those features is considered at each split.
In other words: the problem isn't that we're weakening the non-TF-IDF features. The problem is that we're weakening all of the useful features (both non-TF-IDF and TF-IDF). This is along the lines of Alexander's answer.
In light of this, your proposed solutions won't solve the problem very well. If you make two random forest models, one with 500 non-TF-IDF features and the other with 125K TF-IDF features, the second model will perform poorly, and negatively influence the results. If you pass the results of the 500 model as an additional feature to the 125K model, you're still underperforming.
If we want to stick with random forests, a better solution would be to increase the max_features and/or the number of trees. This will increase the odds that useful features are considered at each split, leading to a more accurate model.

Given the dataset, how to select the learning algorithm?

I've to build an ML model to classify sentences into different categories. I have a dataset with 2 columns (sentence and label) and 350 rows i.e. with shape (350, 2). To convert the sentences into numeric representation I've used TfIdf vectorization, and so the transformed dataset now has 452 columns (451 columns were obtained using TfIdf, and 1 is the label) i.e. with shape (350, 452). More generally speaking, I have a dataset with a lot more features than training samples. In such a scenario what's the best classification algorithm to use? Logistic Regression, SVM (again what kernel?), neural networks (again which architecture?), naive Bayes or is there any other algorithm?
How about if I get more training samples in the future (but the number of columns doesn't increase much), say with a shape (10000, 750)?
Edit: The sentences are actually narrations from bank statements. I have around 10 to 15 labels, all of which I have labelled manually. Eg. Tax, Bank Charges, Loan etc. In future I do plan to get more statements and I will be labelling them as well. I believe I may end up having around 20 labels at most.

With such a small training set, I think you would only get any reasonable results by getting some pre-trained language model such as GPT-2 and fine tune to your problem. That probably is still true even for a larger dataset, a neural net would probably still do best even if you train your own from scratch. Btw, how many labels do you have? What kind of labels are those?

What should be the ideal validation accuracy of a LSTM based text generator?

I modelled a LSTM based text generator using a data set I have. The purpose of the model is to predict the end of sentences. My training is showing a validation accuracy of around 81%. When reading through a couple of articles, I found that unlike a classification problem I should be worried more about loss rather than accuracy. Is this the case, and if so what would be an ideal loss value? Right now my loss is around 1.5+.

There is no minimum limit for accuracy in any of the machine learning or Deep Learning problem.It's as many say garbage IN, garbage OUT
Quality of data and with a decent model will give you good accuracy.
Generally, these accuracy benchmark is set for the standard dataset available on an open internet like SQUAD, RACE, SWAG, GLUE and many more.
Usually, the state of the art models will check their performance on these datasets and set a accuarcy benchmark specific to these dataset.
Coming to your problem, you can tell the model is performing goog based on accuracy, and the evaluation metric you are using, generally in NLP to calculate loss is bit tricky. Considering your case where you are trying to predict the end of a sentence where there is no fixed dimension the reason being that the same information can be expressed in multiple ways with varying number of words.
By looking at the validation and test accuracy of your model it looks decent, but before pushing the accuracy you should worry about the overfitting problem also, the model should not be biased on your data.
You can try with different metrics to evaluate the model and you can compare the results on your own.
I hope this answers your question, Happy Learning!

How to reduce the number of features in text classification?

I'm doing dialect text classification and I'm using countVectorizer with naive bayes. The number of features are too many, I have collected 20k tweets with 4 dialects. every dialect have 5000 tweets. And the total number of features are 43K. I was thinking maybe that's why I could be having overfitting. Because the accuracy has dropped a lot when I tested on new data. So how can I fix the number of features to avoid overfitting the data?

You can set the parameter max_features to 5000 for instance, It might help with overfitting. You could also tinker with max_df (for instance set it to 0.95)

This drop on test data is caused by curse of dimensionality. You can use some dimensionality reduction method to reduce this effect. Possible choice is Latent Semantic Analysis implemented in sklearn.

Classifying Documents into Categories

I've got about 300k documents stored in a Postgres database that are tagged with topic categories (there are about 150 categories in total). I have another 150k documents that don't yet have categories. I'm trying to find the best way to programmaticly categorize them.
I've been exploring NLTK and its Naive Bayes Classifier. Seems like a good starting point (if you can suggest a better classification algorithm for this task, I'm all ears).
My problem is that I don't have enough RAM to train the NaiveBayesClassifier on all 150 categoies/300k documents at once (training on 5 categories used 8GB). Furthermore, accuracy of the classifier seems to drop as I train on more categories (90% accuracy with 2 categories, 81% with 5, 61% with 10).
Should I just train a classifier on 5 categories at a time, and run all 150k documents through the classifier to see if there are matches? It seems like this would work, except that there would be a lot of false positives where documents that don't really match any of the categories get shoe-horned into on by the classifier just because it's the best match available... Is there a way to have a "none of the above" option for the classifier just in case the document doesn't fit into any of the categories?
Here is my test class http://gist.github.com/451880

You should start by converting your documents into TF-log(1 + IDF) vectors: term frequencies are sparse so you should use python dict with term as keys and count as values and then divide by total count to get the global frequencies.
Another solution is to use the abs(hash(term)) for instance as positive integer keys. Then you an use scipy.sparse vectors which are more handy and more efficient to perform linear algebra operation than python dict.
Also build the 150 frequencies vectors by averaging the frequencies of all the labeled documents belonging to the same category. Then for new document to label, you can compute the cosine similarity between the document vector and each category vector and choose the most similar category as label for your document.
If this is not good enough, then you should try to train a logistic regression model using a L1 penalty as explained in this example of scikit-learn (this is a wrapper for liblinear as explained by #ephes). The vectors used to train your logistic regression model should be the previously introduced TD-log(1+IDF) vectors to get good performance (precision and recall). The scikit learn lib offers a sklearn.metrics module with routines to compute those score for a given model and given dataset.
For larger datasets: you should try the vowpal wabbit which is probably the fastest rabbit on earth for large scale document classification problems (but not easy to use python wrappers AFAIK).

How big (number of words) are your documents? Memory consumption at 150K trainingdocs should not be an issue.
Naive Bayes is a good choice especially when you have many categories with only a few training examples or very noisy trainingdata. But in general, linear Support Vector Machines do perform much better.
Is your problem multiclass (a document belongs only to one category exclusivly) or multilabel (a document belongs to one or more categories)?
Accuracy is a poor choice to judge classifier performance. You should rather use precision vs recall, precision recall breakeven point (prbp), f1, auc and have to look at the precision vs recall curve where recall (x) is plotted against precision (y) based on the value of your confidence-threshold (wether a document belongs to a category or not). Usually you would build one binary classifier per category (positive training examples of one category vs all other trainingexamples which don't belong to your current category). You'll have to choose an optimal confidence threshold per category. If you want to combine those single measures per category into a global performance measure, you'll have to micro (sum up all true positives, false positives, false negatives and true negatives and calc combined scores) or macro (calc score per category and then average those scores over all categories) average.
We have a corpus of tens of million documents, millions of training examples and thousands of categories (multilabel). Since we face serious training time problems (the number of documents are new, updated or deleted per day is quite high), we use a modified version of liblinear. But for smaller problems using one of the python wrappers around liblinear (liblinear2scipy or scikit-learn) should work fine.

Is there a way to have a "none of the
above" option for the classifier just
in case the document doesn't fit into
any of the categories?
You might get this effect simply by having a "none of the above" pseudo-category trained each time. If the max you can train is 5 categories (though I'm not sure why it's eating up quite so much RAM), train 4 actual categories from their actual 2K docs each, and a "none of the above" one with its 2K documents taken randomly from all the other 146 categories (about 13-14 from each if you want the "stratified sampling" approach, which may be sounder).
Still feels like a bit of a kludge and you might be better off with a completely different approach -- find a multi-dimensional doc measure that defines your 300K pre-tagged docs into 150 reasonably separable clusters, then just assign each of the other yet-untagged docs to the appropriate cluster as thus determined. I don't think NLTK has anything directly available to support this kind of thing, but, hey, NLTK's been growing so fast that I may well have missed something...;-)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.