Fine-tuning SentenceTransformer on text classification task - python

Wish to fine-tune SentenceTransformer model with multi-class labeled dataset for text classification.
Tutorials seen so far need a specific format as a training data, such as list of positive triplets such as (senetnce1, sentence2, 1) and list of negative triplets such as (senetnce1, senetnce3, 0).
A typical classification dataset is not like that. Its a list of (senetnce1, class1), (senetnce2, class2), (senetence3, class1), (senetnce4, class3), etc.
Is there any ready logic/code/tutorial which will demonstrate, given a typical classification dataset, generate necessary triplet lists, by permutations and combinations? and then train SentenceTransformer successfully, and hopefully with better accuracy?

If you have small number of samples, ie. for few-shots-training, SetFit can be used
If you have large number of samples for fine-tuning, there is unsupervised way called TSDAE.


Vocabulary size vs. number of samples

I am training an SVM for sentiment classification. For the training I use the sentiment140 twitter dataset.
For this process I tried two different approaches:
Use only 10 % (160,000 messages) of the Data for training but don't limit the feature size (only 10% for computational reasons)
Limit the feature size (vocabulary size) to 12,000 or 20,000 words which allows me to use up to 400,000 twitter messages.
The tokenizer/vectorizer I use uses 1-grams and 2-grams.
Comparing the results, method 1 brings about 80 % accuracy whereas the second method only shows about 79.5 % accuracy.
I am quite unsure about these results and what will perform best for an unlabeled dataset I use for sentiment classification.
The limitation of the vocabulary size would focus the classification on the most common words and on the other hand, could provide more robust learning because of a larger dataset and maybe more accurate classification for the unlabeled dataset?
Having no limitations on the vocab size with a smaller dataset brings the benefit of marginally higher classification accuracy and therefore a better classification for the unlabeled dataset?
I am rather new to this topic and thus not sure about how to decide...
Maybe someone of you can give me an recommendation what would be better?
Thanks in advance!

Accuracy with TF-IDF and non-TF-IDF features

I run a Random Forest algorithm with TF-IDF and non-TF-IDF features.
In total the features are around 130k in number (after a feature selection conducted on the TF-IDF features) and the observations of the training set are around 120k in number.
Around 500 of them are the non-TF-IDF features.
The issue is that the accuracy of the Random Forest on the same test set etc with
- only the non-TF-IDF features is 87%
- the TF-IDF and non-TF-IDF features is 76%
This significant aggravation of the accuracy raises some questions in my mind.
The relevant piece of code of mine with the training of the models is the following:
drop_columns = ['labels', 'complete_text_1', 'complete_text_2']
# Split to predictors and targets
X_train = df.drop(columns=drop_columns).values
y_train = df['labels'].values
# Instantiate, train and transform with tf-idf models
vectorizer_1 = TfidfVectorizer(analyzer="word", ngram_range=(1,2), vocabulary=tf_idf_feature_names_selected)
X_train_tf_idf_1 = vectorizer_1.fit_transform(df['complete_text_1'])
vectorizer_2 = TfidfVectorizer(analyzer="word", ngram_range=(1,2), vocabulary=tf_idf_feature_names_selected)
X_train_tf_idf_2 = vectorizer_2.fit_transform(df['complete_text_2'])
# Covert the general features to sparse array
X_train = np.array(X_train, dtype=float)
X_train = csr_matrix(X_train)
# Concatenate the general features and tf-idf features array
X_train_all = hstack([X_train, X_train_tf_idf_1, X_train_tf_idf_2])
# Instantiate and train the model
rf_classifier = RandomForestClassifier(n_estimators=150, random_state=0, class_weight='balanced', n_jobs=os.cpu_count()-1), y_train)
Personally, I have not seen any bug in my code (this piece above and in general).
The hypothesis which I have formulated to explain this decrease in accuracy is the following.
The number of non-TF-IDF features is only 500 (out of the 130k features in total)
This gives some chances that the non-TF-IDF features are not picked that much at each split by the trees of the random forest (eg because of max_features etc)
So if the non-TF-IDF features do actually matter then this will create problems because they are not taken enough into account.
Related to this, when I check the features' importances of the random forest after training it I see the importances of the non-TF-IDF features being very very low (although I am not sure how reliable indicator are the feature importances especially with TF-IDF features included).
Can you explain differently the decrease in accuracy at my classifier?
In any case, what would you suggest doing?
Some other ideas of combining the TF-IDF and non-TF-IDF features are the following.
One option would be to have two separate (random forest) models - one for the TF-IDF features and one for the non-TF-IDF features.
Then the results of these two models will be combined either by (weighted) voting or meta-classification.
Your view that 130K of features is way too much for the Random forest sounds right. You didn't mention how many examples you have in your dataset and that would be cruccial to the choice of the possible next steps. Here are a few ideas on top of my head.
If number of datapoints is large enough you myabe want to train some transformation for the TF-IDF features - e.g. you might want to train a small-dimensional embeddings of these TF-IDF features into, say 64-dimensional space and then e.g. a small NN on top of that (even a linear model maybe). After you have embeddings you could use them as transforms to generate 64 additional features for each example to replace TF-IDF features for RandomForest training. Or alternatively just replace the whole random forest with a NN of such architecture that e.g. TF-IDFs are all combined into a few neurons via fully-connected layers and later concatened with other features (pretty much same as embeddings but as a part of NN).
If you don't have enough data to train a large NN maybe you can try to train GBDT ensemble instead of random forest. It probably should do much better job at picking the good features compared to random forest which definitely likely to be affected a lot by a lot of noisy useless features. Also you can first train some crude version and then do a feature selection based on that (again, I would expect it should do a more reasonable job compared to random forest).
My guess is that your hypothesis is partly correct.
When using the full dataset (in the 130K feature model), each split in the tree uses only a small fraction of the 500 non-TF-IDF features. So if the non-TF-IDF features are important, then each split misses out on a lot of useful data. The data that is ignored for one split will probably be used for a different split in the tree, but the result isn't as good as it would be when more of the data is used at every split.
I would argue that there are some very important TF-IDF features, too. The fact that we have so many features means that a small fraction of those features is considered at each split.
In other words: the problem isn't that we're weakening the non-TF-IDF features. The problem is that we're weakening all of the useful features (both non-TF-IDF and TF-IDF). This is along the lines of Alexander's answer.
In light of this, your proposed solutions won't solve the problem very well. If you make two random forest models, one with 500 non-TF-IDF features and the other with 125K TF-IDF features, the second model will perform poorly, and negatively influence the results. If you pass the results of the 500 model as an additional feature to the 125K model, you're still underperforming.
If we want to stick with random forests, a better solution would be to increase the max_features and/or the number of trees. This will increase the odds that useful features are considered at each split, leading to a more accurate model.

Given the dataset, how to select the learning algorithm?

I've to build an ML model to classify sentences into different categories. I have a dataset with 2 columns (sentence and label) and 350 rows i.e. with shape (350, 2). To convert the sentences into numeric representation I've used TfIdf vectorization, and so the transformed dataset now has 452 columns (451 columns were obtained using TfIdf, and 1 is the label) i.e. with shape (350, 452). More generally speaking, I have a dataset with a lot more features than training samples. In such a scenario what's the best classification algorithm to use? Logistic Regression, SVM (again what kernel?), neural networks (again which architecture?), naive Bayes or is there any other algorithm?
How about if I get more training samples in the future (but the number of columns doesn't increase much), say with a shape (10000, 750)?
Edit: The sentences are actually narrations from bank statements. I have around 10 to 15 labels, all of which I have labelled manually. Eg. Tax, Bank Charges, Loan etc. In future I do plan to get more statements and I will be labelling them as well. I believe I may end up having around 20 labels at most.
With such a small training set, I think you would only get any reasonable results by getting some pre-trained language model such as GPT-2 and fine tune to your problem. That probably is still true even for a larger dataset, a neural net would probably still do best even if you train your own from scratch. Btw, how many labels do you have? What kind of labels are those?

Text Classification Approach

I have data with 2 important columns, Product Name and Product Category. I wanted to classify a search term into a category. The approach (in Python using Sklearn & DaskML) to create a classifier was:
Clean Product Name column for stopwords, numbers, etc.
Create 90% 10% train-test split
Convert text to vector using OneHotEncoder
Create classifier (Naive Bayes) on the training data
Test the classifier
I realized the OneHotEncoder (or any encoder) converts the text to numbers by creating a matrix keeping into account where and how many times a word occurs.
Q1. Do I need to convert from Word to Vectors before train-test split or after train-test split?
Q2. When I will search for new words (which may not be in the text already), how will I classify it because if I encode the search term, it will be irrelevant to the encoder used for the training data. Can anybody help me with the approach so that I can classify a search term into a category if the term doesn't exist in the training data?
Q1. Do I need to convert from Words to Vectors before train-test split?
Answer: Every algorithm takes input as some number representation of the inputs, so you have to convert from words to vectors. There is no alternative to this. Apart from OneHotEncode, there are other approaches like CountVectorizer and TfIdf-Vectorizer which are recommended to use instead of OneHotEncoding. You can read more about them here .

Text classification with Naive Bayes

I am leaning NLP and noticed that TextBlob classification based in Naive Bayes (textblob is Build on top of NLTK) works fine when training data is list of sentences and does not work at all when training data are individual words (where each word and assigned classification).
Because you don't have single words in the training data.
Usually the training and evaluation/testing data are supposed to be selected with identical distribution. Biases or skews are usually problematic. In very few cases you can train the model to do one thing and use it to do something else.
In your case, the model likely spreads the weights over the words in the sentence. So when you pick a single word, you only get a small portion of the weight represented.
To get it to work you should add single word examples to your training data.
