I am leaning NLP and noticed that TextBlob classification based in Naive Bayes (textblob is Build on top of NLTK) https://textblob.readthedocs.io/en/dev/classifiers.html works fine when training data is list of sentences and does not work at all when training data are individual words (where each word and assigned classification).
Why?
Because you don't have single words in the training data.
Usually the training and evaluation/testing data are supposed to be selected with identical distribution. Biases or skews are usually problematic. In very few cases you can train the model to do one thing and use it to do something else.
In your case, the model likely spreads the weights over the words in the sentence. So when you pick a single word, you only get a small portion of the weight represented.
To get it to work you should add single word examples to your training data.
Related
Wish to fine-tune SentenceTransformer model with multi-class labeled dataset for text classification.
Tutorials seen so far need a specific format as a training data, such as list of positive triplets such as (senetnce1, sentence2, 1) and list of negative triplets such as (senetnce1, senetnce3, 0).
A typical classification dataset is not like that. Its a list of (senetnce1, class1), (senetnce2, class2), (senetence3, class1), (senetnce4, class3), etc.
Is there any ready logic/code/tutorial which will demonstrate, given a typical classification dataset, generate necessary triplet lists, by permutations and combinations? and then train SentenceTransformer successfully, and hopefully with better accuracy?
If you have small number of samples, ie. for few-shots-training, SetFit can be used
If you have large number of samples for fine-tuning, there is unsupervised way called TSDAE.
I'm trying to classify whether or not I liked books that I've read this year based on the text in the books. I'm using the preprocessing described here, and a variety of sklearn classification models.
At first I was just feeding the models the raw text, but I cleaned it based on GloVe embeddings (a process described here). The text was improved from 40% vocab, 80% coverage to 80% vocab, 98% coverage based on GloVe embeddings. However, for some reason, after cleaning the text, the accuracy of the classification models seemed to be the same or lower.
Uncleaned text model results:
Cleaned text model results:
One thing to note is that the classes are quite imbalanced (75% of books were good as compared to 25% bad), so accuracy above 75% should be expected, since 75% is what the model would get if it guessed good every single time.
I've linked my full notebook here so you can check out the specific code if that will be helpful for solving this issue. I'm incredibly confused; I can't see where I'm going wrong, but it can't be right that cleaning the text data has zero or negative impact on model accuracy.
I think the main point you are missing is that data cleaning is an empirical process. Text preprocessing may consist of removing stop words, punctuations, numericals, lowercasing, but if this adds to model's ability to learn and generalize remains to be seen through Cross Validation, i.e. feeding results of your peprocessing to model train and seeing if this generalizes to test well.
In general preproceeeing (stop words removal, etc) works well for Bag of Words models because it reduces data dimensionality because data in BOW is long and sparse (check out Curse of dimensionality e.g. for possible theoretical foundations). The need for data preprocessing is diminished with word embeddings like word2vec or BERT.
In short, if you have any data preprocessing in mind, check if it helps your model to learn and generalize through properly constructed CV.
I have training data as two columns
1.'Sentences'
2.'Relevant_text' (text in this column is a subset of text in the column 'Sentences')
I tried training a RNN with LSTM directly treating 'Sentences' as input and 'Relevant_text' and output but the results were disappointing.
I want to know how to approach this type of problem? Does this kind of problem have a name? Which models should I explore?
If the target text is the subset of the input text, then, I believe, this problem can be solved as a tagging problem: make your neural network for each word predict whether it is "relevant" or not.
On the one hand, the problem of taking a text and selecting its subset that best reflects its meaning is called extractive summarization, and has lots of solutions, from the well known unsupervised textRank algorithm to complex BERT-based neural models.
On the other hand, technically your problem is just binary token-wise classification: you label each token (word or other symbol) of your input text as "relevant" or not, and train any neural network architecture which is good for tagging on this data. Specifically, I would look into architectures for POS tagging, because they are very well studied. Typically, it is BiLSTM, maybe with a CRF head. More modern models are based on pretrained contextual word embeddings, such as BERT (maybe, you won't even need to fine tune them - just use it as a feature extractor, and add a BiLSTM on top). If you want a more lightweight model, you can consider a CNN over pretrained and fixed word embeddings.
One final parameter you should time playing with is the threshold for classifying the word as relevant - maybe, the default one, 0.5, is not the best choice. Maybe, instead of keeping all the tokens with probability-of-being-important higher than 0.5, you would like to keep the top k tokens, where k is fixed or is some percentage of the whole text.
Of course, more specific recommendations would be dataset-specific, so if you could share your dataset, it would be a great help.
I've been searching for an answer to this specific question for a few hours and while I've learned a lot, I still haven't figured it out.
I have a dataset of ~70,000 sentences with subset of about 4,000 sentences that have been appropriately categorized, the rest are uncategorized. Currently I'm using a scikit pipeline with CountVectorizer and TfidfTransformer to vectorize the data, however I'm only vectorizing based off the 4,000 sentences and then testing various models via cross-validation.
I'm wondering if there is a way to use Word2Vec or something similar to vectorize the entire corpus of data and then use these vectors with my subset of 4,000 sentences. My intention is to increase the accuracy of my model predictions by using word vectors that incorporate all of the semantic data in the corpus rather than just data from the 4,000 sentences.
The code I'm currently using is:
svc = Pipeline([('vect', CountVectorizer(ngram_range=(3, 5))),
('tfidf', TfidfTransformer()),
('clf', LinearSVC()),
])
nb.fit(X_train, y_train)
y_pred = svc.predict(X_test)
Where X_train and y_train are my features and labels, respectively. I also have a list z_all which includes all remaining uncategorized features.
Just getting pointed in the right direction (or told whether or not this is possible) would be super helpful.
Thank you!
I would say that the answer is yes: you can use Word2Vec or another similar word-embedding method to get vectors of each sentence in your data, and then use these vectors both as training and testing data in a linear Support Vector Machine (SVC).
And yes, you can first create those vectors for your entire corpus of ~70,000 sentences before actually doing any training on your data.
It is however not as straightforward as the approach you're currently using.
There are many different ways to do this so I'll just go through one of them to help you get the basics of how this can be done.
Before we start and see what possible steps you can follow, let's remember that the goal here is to get one vector for each and every sentence of your corpus.
If you don't know what word-embeddings are, I highly suggest you to read about it, but in short this is just a way to link each word of a pre-defined vocabulary to a vector of a given dimension.
For instance, you would have:
# the vector associated with the word "cat" is the following vector of fixed-length
word_embeddings["cat"] = [0.0014, 0.6710, ..., 0.3281]
Now that you know this, here are the steps you could be following:
Tokenization - The first thing that you want to do is to tokenize each of your sentences. This can be done using a NLP library (SpaCy for instance) that will help you to:
split each sentence in a list of words
remove any punctuation from these words and converting them to lowercase
remove stopwords - optionally
lemmatize all the words - optionally
Train a word embedding model - Now that you have each sentence as a pre-processed list of words, you need to train a word-embedding model using your corpus. There are many different algorithms to do that. I would suggest using GenSim and Word2Vec or fastText. What you can also do is using pre-trained word embeddings, like GloVe or anything that best fits your corpus in terms of language/context. Either way, this will allow you to:
have one vector of pre-defined size for each and every word in your corpus' vocabulary
get a list of equally-sized vectors for each sentence in your corpus
Adopting a weighting method - Once you have a list of vectors for each sentence in your corpus, and mainly because your sentences vary in length (some have 6 words, some others have 13 words, etc.) what you want to do is getting a single vector for each and every sentence. To do this, what you can do is simply weighting the vectors corresponding to the words in each sentence. You can:
average all vectors
using weights like TF-IDF weights to give some words more importance than others
use other weighting methods...
Training and testing - Finally, all you're left to do is training a model using these vectors, for instance with a linear Support Vector Machine (SVC), and testing the accuracy of your model on a test dataset (you can also use a validation dataset).
My opinion is, if you are going to use a word2vec embedding, use one pre-trained or used generic text to generate it.
Word2vec embedding are usually used to give meaning and context to your text data, if you train an embedding using only your data, it might be biased and not represent a language. And that means it vectors doesn't carry any meaning.
After having your embedding working, you also has to think about what to do with your words, because a sentence has 1 or more words (embedding works at word level), and you want to feed your models with 1 sentence -> 1 vector. not 1 sentences -> N vectors.
People usually average or multiply those vectors so for example, for the sentence "Hello there" and an embedding of 5 dims:
Hello -> [0, 0, .2, .3, .8]
there -> [.1, .2, 0, 0, .5]
AVG Hello there -> [.05, .1, .1, .15, .65]
This is what you want to use for your models!
So instead of using TF-IDF to generate your sentence vectors, use word2vec like this and you shouldn't have any problem. I already work in a text calssification project and we ended usind a self-trained w2v embedding an ExtraTrees model with amazing results.
I have a dataset with annotations in the form: <Word/Phrase, Ontology Class>, where Ontology Class can be one of the following {Physical Object, Action, Quantity}. I have created this dataset manually for my particular ontology model from a large corpus of text.
Because this process was manual, I am sure that I may have missed some words/phrases from my corpus. If such is the case, I am looking at ways to automatically extract other words from the same corpus that have the "characteristics" as these words in the labeled dataset. Therefore, the first task is to define "characteristics" before I even go with the task of extracting other words.
Are there any standard techniques that I can use to achieve this?
EDIT: Sorry. I should have mentioned that these are domain-specific words not found in WordNet.
Take a look at chapter 6 of the NLTK book. From what you have described, it sounds like a supervised classification technique based on feature ("characteristic") extraction might be a good choice. From the book:
A classifier is called supervised if it is built based on training
corpora containing the correct label for each input.
You can use some of the data that you have manually encoded to train your classifier. It might look like this:
def word_features(name):
features = {}
features["firstletter"] = name[0].lower()
features["lastletter"] = name[-1].lower()
for letter in 'abcdefghijklmnopqrstuvwxyz':
features["count(%s)" % letter] = name.lower().count(letter)
features["has(%s)" % letter] = (letter in name.lower())
return features
Next you can train your classifier on some of the data you have already tagged:
>> words = [('Rock', 'Physical Object'), ('Play', 'Action'), ... ]
>>> featuresets = [(word_features(n), g) for (n,g) in words]
>>> train_set, test_set = featuresets[500:], featuresets[:500]
>>> classifier = nltk.NaiveBayesClassifier.train(train_set)
You should probably train on half of the data you already tagged. That way you can test the accuracy of the classifier with the other half. Keep working on the features until the accuracy of the classifier is as you desire.
nltk.classify.accuracy(classifier, test_set)
You can check individual classifications as follows:
classifier.classify(word_features('Gold'))
If you are not familiar with NLTK, then you can read the previous chapters as well.
As jfocht has said, you need a classifier to do this. To train a classifier, you need a set of training data of 'things' with features and their classification. You can then feed in a new 'thing' with features and get out the classification.
The kicker here is that you don't have features, you just have the words. One idea is to use WordNet, which is a fancy dictionary, to generate features from the definitions of the words. One of WordNet's best features is it has a hierarchy for a word e.g.,
cat -> animal -> living thing -> thing ....
You might be able to do this simply by following the hierarchy, but if you can't, you could add features from it and train it. This will likely work much better than using the words themselves as features.
Regardless of whether you find Wordnet to be useful, you need a feature set to train your classifier, and you also have to label all your unclassified data with those features, so unless you have some way to do the feature part computationally, it's going to be less work to do it by hand