Given the dataset, how to select the learning algorithm? - python

I've to build an ML model to classify sentences into different categories. I have a dataset with 2 columns (sentence and label) and 350 rows i.e. with shape (350, 2). To convert the sentences into numeric representation I've used TfIdf vectorization, and so the transformed dataset now has 452 columns (451 columns were obtained using TfIdf, and 1 is the label) i.e. with shape (350, 452). More generally speaking, I have a dataset with a lot more features than training samples. In such a scenario what's the best classification algorithm to use? Logistic Regression, SVM (again what kernel?), neural networks (again which architecture?), naive Bayes or is there any other algorithm?
How about if I get more training samples in the future (but the number of columns doesn't increase much), say with a shape (10000, 750)?
Edit: The sentences are actually narrations from bank statements. I have around 10 to 15 labels, all of which I have labelled manually. Eg. Tax, Bank Charges, Loan etc. In future I do plan to get more statements and I will be labelling them as well. I believe I may end up having around 20 labels at most.

With such a small training set, I think you would only get any reasonable results by getting some pre-trained language model such as GPT-2 and fine tune to your problem. That probably is still true even for a larger dataset, a neural net would probably still do best even if you train your own from scratch. Btw, how many labels do you have? What kind of labels are those?

Related

Vocabulary size vs. number of samples

Community,
I am training an SVM for sentiment classification. For the training I use the sentiment140 twitter dataset.
For this process I tried two different approaches:
Use only 10 % (160,000 messages) of the Data for training but don't limit the feature size (only 10% for computational reasons)
Limit the feature size (vocabulary size) to 12,000 or 20,000 words which allows me to use up to 400,000 twitter messages.
The tokenizer/vectorizer I use uses 1-grams and 2-grams.
Comparing the results, method 1 brings about 80 % accuracy whereas the second method only shows about 79.5 % accuracy.
I am quite unsure about these results and what will perform best for an unlabeled dataset I use for sentiment classification.
The limitation of the vocabulary size would focus the classification on the most common words and on the other hand, could provide more robust learning because of a larger dataset and maybe more accurate classification for the unlabeled dataset?
Having no limitations on the vocab size with a smaller dataset brings the benefit of marginally higher classification accuracy and therefore a better classification for the unlabeled dataset?
I am rather new to this topic and thus not sure about how to decide...
Maybe someone of you can give me an recommendation what would be better?
Thanks in advance!

Fine-tuning SentenceTransformer on text classification task

Wish to fine-tune SentenceTransformer model with multi-class labeled dataset for text classification.
Tutorials seen so far need a specific format as a training data, such as list of positive triplets such as (senetnce1, sentence2, 1) and list of negative triplets such as (senetnce1, senetnce3, 0).
A typical classification dataset is not like that. Its a list of (senetnce1, class1), (senetnce2, class2), (senetence3, class1), (senetnce4, class3), etc.
Is there any ready logic/code/tutorial which will demonstrate, given a typical classification dataset, generate necessary triplet lists, by permutations and combinations? and then train SentenceTransformer successfully, and hopefully with better accuracy?
If you have small number of samples, ie. for few-shots-training, SetFit can be used
If you have large number of samples for fine-tuning, there is unsupervised way called TSDAE.

Accuracy with TF-IDF and non-TF-IDF features

I run a Random Forest algorithm with TF-IDF and non-TF-IDF features.
In total the features are around 130k in number (after a feature selection conducted on the TF-IDF features) and the observations of the training set are around 120k in number.
Around 500 of them are the non-TF-IDF features.
The issue is that the accuracy of the Random Forest on the same test set etc with
- only the non-TF-IDF features is 87%
- the TF-IDF and non-TF-IDF features is 76%
This significant aggravation of the accuracy raises some questions in my mind.
The relevant piece of code of mine with the training of the models is the following:
drop_columns = ['labels', 'complete_text_1', 'complete_text_2']
# Split to predictors and targets
X_train = df.drop(columns=drop_columns).values
y_train = df['labels'].values
# Instantiate, train and transform with tf-idf models
vectorizer_1 = TfidfVectorizer(analyzer="word", ngram_range=(1,2), vocabulary=tf_idf_feature_names_selected)
X_train_tf_idf_1 = vectorizer_1.fit_transform(df['complete_text_1'])
vectorizer_2 = TfidfVectorizer(analyzer="word", ngram_range=(1,2), vocabulary=tf_idf_feature_names_selected)
X_train_tf_idf_2 = vectorizer_2.fit_transform(df['complete_text_2'])
# Covert the general features to sparse array
X_train = np.array(X_train, dtype=float)
X_train = csr_matrix(X_train)
# Concatenate the general features and tf-idf features array
X_train_all = hstack([X_train, X_train_tf_idf_1, X_train_tf_idf_2])
# Instantiate and train the model
rf_classifier = RandomForestClassifier(n_estimators=150, random_state=0, class_weight='balanced', n_jobs=os.cpu_count()-1)
rf_classifier.fit(X_train_all, y_train)
Personally, I have not seen any bug in my code (this piece above and in general).
The hypothesis which I have formulated to explain this decrease in accuracy is the following.
The number of non-TF-IDF features is only 500 (out of the 130k features in total)
This gives some chances that the non-TF-IDF features are not picked that much at each split by the trees of the random forest (eg because of max_features etc)
So if the non-TF-IDF features do actually matter then this will create problems because they are not taken enough into account.
Related to this, when I check the features' importances of the random forest after training it I see the importances of the non-TF-IDF features being very very low (although I am not sure how reliable indicator are the feature importances especially with TF-IDF features included).
Can you explain differently the decrease in accuracy at my classifier?
In any case, what would you suggest doing?
Some other ideas of combining the TF-IDF and non-TF-IDF features are the following.
One option would be to have two separate (random forest) models - one for the TF-IDF features and one for the non-TF-IDF features.
Then the results of these two models will be combined either by (weighted) voting or meta-classification.
Your view that 130K of features is way too much for the Random forest sounds right. You didn't mention how many examples you have in your dataset and that would be cruccial to the choice of the possible next steps. Here are a few ideas on top of my head.
If number of datapoints is large enough you myabe want to train some transformation for the TF-IDF features - e.g. you might want to train a small-dimensional embeddings of these TF-IDF features into, say 64-dimensional space and then e.g. a small NN on top of that (even a linear model maybe). After you have embeddings you could use them as transforms to generate 64 additional features for each example to replace TF-IDF features for RandomForest training. Or alternatively just replace the whole random forest with a NN of such architecture that e.g. TF-IDFs are all combined into a few neurons via fully-connected layers and later concatened with other features (pretty much same as embeddings but as a part of NN).
If you don't have enough data to train a large NN maybe you can try to train GBDT ensemble instead of random forest. It probably should do much better job at picking the good features compared to random forest which definitely likely to be affected a lot by a lot of noisy useless features. Also you can first train some crude version and then do a feature selection based on that (again, I would expect it should do a more reasonable job compared to random forest).
My guess is that your hypothesis is partly correct.
When using the full dataset (in the 130K feature model), each split in the tree uses only a small fraction of the 500 non-TF-IDF features. So if the non-TF-IDF features are important, then each split misses out on a lot of useful data. The data that is ignored for one split will probably be used for a different split in the tree, but the result isn't as good as it would be when more of the data is used at every split.
I would argue that there are some very important TF-IDF features, too. The fact that we have so many features means that a small fraction of those features is considered at each split.
In other words: the problem isn't that we're weakening the non-TF-IDF features. The problem is that we're weakening all of the useful features (both non-TF-IDF and TF-IDF). This is along the lines of Alexander's answer.
In light of this, your proposed solutions won't solve the problem very well. If you make two random forest models, one with 500 non-TF-IDF features and the other with 125K TF-IDF features, the second model will perform poorly, and negatively influence the results. If you pass the results of the 500 model as an additional feature to the 125K model, you're still underperforming.
If we want to stick with random forests, a better solution would be to increase the max_features and/or the number of trees. This will increase the odds that useful features are considered at each split, leading to a more accurate model.

Does order of the traning data matter when building a CNN model?

I'm reviewing the Keras CNN example located here and i see that the input data has the positive and negative sentiment training samples randomly shuffled. I was wondering if the CNN is sensitive to the ordering of the training data.
For clarity: if my y_train was of shape 100x1, in which 0-50 where all positive sentiments and 50-100 were negative sentiments, would the results be any different compared to when every even index has positive sentiment and odd index has negative?
Theoretically for the last epoch, if the last half of the samples were only positive your model might have a slight bias for the positive. However this is why Keras' fit() function has a shuffle feature so it shuffles training samples for every epoch to ensure there is no bias and your model gets to train on different batches get a look at your problem from many different angles. Unless you have a reason to believe you should not be doing this i'd definitely recommend it.
Shuffling the data when training a neural network in batches might be crucial for the performance of your model. A more detailed discussion on this topic is presented here on the data science stackexchange.
I just want to add that shuffling is in general beneficial for the evaluation of your model when you do cross validation, for example. In each train-test folds you want to have random samples, so that you can be sure that your model can generalize well.

How to cancel the huge negative effect of my training data distribution on subsequent neural network classification function?

I need to train my network on a data that has a normal distribution, I've noticed that my neural net has a very high tendency to only predict the most occurring class label in a csv file I exported (comparing its prediction with the actual label).
What are some suggestions (except cleaning the data to produce an evenly distributed training data), that would help my neural net to not go and only predict the most occurring label?
UPDATE: Just wanted to mention that, indeed the suggestions made in the comment sections worked. I, however, found out that adding an extra layer to my NN, mitigated the problem.
Assuming the NN is trained using mini-batches, it is possible to simulate (instead of generate) an evenly distributed training data by making sure each mini-batch is evenly distributed.
For example, assuming a 3-class classification problem and a minibatch size=30, construct each mini-batch by randomly selecting 10 samples per class (with repetition, if necessary).

Categories