How to combine embeddins vectors of bert with other features?

How to combine embeddins vectors of bert with other features? - python

I am working on a classification task with 3 labels (0,1,2 = neg, pos, neu). Data are sentences. So to produce vectors/embeddings of sentences, I use a Bert encoder to get embeddings for each sentence and then I used a simple knn to make predictions.
My data look like this : each sentence has a label and other numerical value of classification.
For example, my data look like this
Sentence embeddings_BERT level sub-level label
je mange [0.21, 0.56] 2 2.1 pos
il hait [0.25, 0.39] 3 3.1 neg
.....
As you can see each sentence has other categories but the are not the final one but indices to help figure the label when a human annotated the data. I want my model to take into consideration those two values when predicting the label. I was wondering if I have to concatenate them with the embeddings generate by the bert encoding or is there another way ?

There is not one perfect way to tackle this problem, but a simple solution will be to concat the bert embeddings with hard-coded features. The BERT embeddings (sentence embeddings) will be of dimension 768 (if you have used BERT base). These embeddings can be treated as features of the sentence itself. The additional features can be concatenated to form a higher dimensional vector. If the features are categorical, it will be ideal to convert to one-hot vectors and concatenate them. For example, if you want to use level in your example as set of input features, it will be best to convert it into one-hot feature vector and then concatenate with BERT embeddings. However, in some cases, your hard coded features can be dominant feature to bias the classifier, and in some other cases, it can have no influence at all. It all depends on the data that you have.

Related

How to tackle open set classification problem in Python?

I am given an open set Insect classification problem using DNA Barcodes. The goal is to predict species labels for testing samples represented in the training set and predict genus labels for testing samples not represented in the training set. Given data variables are something like this:
gtrain: This is a column vector of size 16128. This variable contains genus level labels for each insect instance in the training set. You can think of these as the parent nodes of the leaf nodes in a tree, where leaf nodes are the species and parent nodes are the genera. All instances with the same gtrain value share the same genus.
ytrain: This is a column vector of size 16128. This variable contains species level labels for each insect instance in the training set. All insect instances with the same ytrain value belong to the same species.
emb_train: This is a 2D matrix of size 16128x1000. Each row in this matrix is a high dimensional encoding (or embedding) of the corresponding nucleotide sequence in the training set.
emb_test: This is a 2D matrix of size 5989x1000. Each row in this matrix is a high dimensional encoding (or embedding) of the corresponding nucleotide sequence in the test set.
I can either predict genus or species labels using the code below by replacing it with gtrain or ytrain variable:
xtrain, xtest, ytrain, ytest = train_test_split(emb_train, gtrain *or* ytrain, test_size=0.3)
classifier=RandomForestClassifier(n_estimators=5)
classifier.fit(xtrain, ytrain.ravel())
ypred=classifier.predict(emb_test)
But I think these predictions are inaccurate because as stated above I need to be able to use both gtrain and ytrain to train my model in some way and make final accurate predictions on emb_test. I am unable to do so this.
Can someone provide some guidance/resources/ideas on how to tackle a problem like this? I can provide more info if something is unclear about the problem.

If gtrain are the parent labels for y_train (IIUC, to visualize all the labels, we could connect the nodes of genus labels to their corresponding species children labels into a depth-2 tree), we could learn to predict both genus label and species label at training time. If I am doing this, I will simply concatenate the label output using both genus label space and species label space.
Let's assume your genus space is 100 (you have 100 unique genus categories), and your species space is 1000 (you have 1000 unique species across all genus).
Your gtrain is 1x16128, this could be transformed to 100x16128 one hot-vector per row.
Your ytrain is 1x16128, this could be transformed to 1000x16128 one hot-vector per row.
After concatenation, you have a label with shape [1100, 16128].
Your could build a model that uses the 1000-dimensional input embedding, connects to a few hidden fully-connected neural network layers and finally connect to the 1100-dimensional output.
At training time, in each step, picks a small batch of examples (say 64 examples out of 16128 totally).
input: 64 x 1000 (batch size x embedding dimension)
output: 64 x 1100 (batch size x output label dimension)
simply reduce cross-entropy loss at output.
At prediction time, you can use some heuristic. For example,
based on the confidence of species output. if all logits from species output nodes are low (the threshold value can be determined with a validation dataset), you probably could predict nothing at species level, but then pick the top prediction from the genus logits.
consider the mutual agreement on the prediction from genus-level logits and species-level logits. IIUC, suppose one genus label has very high logit, but all the corresponding species logits are low (and also vice versa), this could be considered as a "disagreement" thus triggering the logic of not predicting species label but only the genus-level label.
Edit: I also look at your code that uses random forest. In that case, you could build two classifiers using the same embedding feature as input, one predicts into genus label and the other predicts into species label. At inference time, you run two classifiers in parallel, and get both genus-level predictions and species-level predictions. Then you could use the similar heuristics above to decide the final prediction.

How to generate independent(X) variable using Word2vec?

I have a movie review data set which has two columns Review(Sentences) and Sentiment(1 or 0).
I want to create a classification model using word2vec for the embedding and a CNN for the classification.
I've looked for tutorials on youtube but all they do is create vectors for every words and show me the similar words. Like this-
model= gensim.models.Word2Vec(cleaned_dataset, min_count = 2, size = 100, window = 5)
words= model.wv.vocab
simalar= model.wv.most_similar("bad")
I already have my dependent variable(y) which is my 'Sentiment' column all I need is the independent variable(X) which I can pass on to my CNN model.
Before using word2vec I used the Bag Of Words(BOW) model which generated a sparse matrix which was my independent(X) variable. How can I achieve something similar using word2vec?
Kindly correct me if I'm doing something wrong.

To get the word vector, you have to do this:
model['word_that_you_want']
You may also want to handle the KeyError that could arise if you don't find that given word in your model. You also might want to read about what an embedding layer is, which is usually used as the first layer of the neural network (for NLP generally) and is basically a lookup mapping of a word to its corresponding word vector.
To get the word vectors for an entire sentence, you need to first initialize a numpy array of zeros to the dimensions you want.
You might need other variables such as the length of the longest sentence so that you can pad all sentences to that length. The documentation of the pad_sequences method for Keras is here.
A simple example of getting a sentence of word vectors is:
import numpy as np
embedding_matrix = np.zeros((vocab_len, size_of_your_word_vector))
Then iterate over the index of embedding_matrix and add to it, if you find a word vector in your model.
I use this resource which has a lot of examples and I have referenced some of the code there (which I have also used myself sometimes):
embedding_matrix = np.zeros((vocab_length, 100))
for word, index in word_tokenizer.word_index.items():
embedding_vector = model[word] # using your w2v model, KeyError possible
if embedding_vector is not None:
embedding_matrix[index] = embedding_vector
And in your model (I'm assuming Tensorflow with Keras)
embedding_layer = Embedding(vocab_length, 100, weights=[embedding_matrix], input_length=length_long_sentence, trainable=False)
I hope this helps.

Word2Vec doesn't inherently create vectors for a text (set of words) – just individual words.
But, sometimes a not-so-bad vector for a multi-word text is the average of all its word-vectors.
If list_of_words is a list of the words in your text, and all the words are in the Word2Vec model, a simple way to get the average of those words' vectors is:
avg_vector_of_words = model.wv[list_of_words].mean(axis=0)
(If some words aren't present, you'd need to filter them before attempting this to avoid KeyErrors. If you wanted to leave out some words, or use unit-normed word-vectors, or unit-normalize the final vector, you'd need more code.)
Then avg_vector_of_words is a small, dense/'embedded' feature vector for the list_of-words text.
You could pass these vectors, one per text, to another downstream classifier, like your CNN, exactly analogously to how you were previously using sparse BOW vectors.

Python: LSTM model and word embedding

My problem is mainly theoretical. I would like to use an LSTM model to classify the sentiment of sentences in this way 1 = positive, 0 = neutral and -1 = negative. I have a bag of word (BOW) that I would like to use to train the model. BOW is dataframe with two columns like this:
Text | Sentiment
hello dear... 1
I hate you... -1
... ...
According to the example proposed by keras I should transform the sentences of the 'Text' column of my BOW into numerical vectors where each number represents a word of the vocabulary.
Now my questions is how do I turn my sentences into vectors of numbers and what are the best techniques to do it?
For now my code is this, what am i doing wrong?
model = Sequential()
model.add(LSTM(units=50))
model.add(Dense(2, activation='softmax')) # 2 because I have 3 classes
model.compile(loss='categorical_crossentropy', optimizer='rmsprop')
X_train, X_test, y_train, y_test = train_test_split(df['Text'], df['Sentiment'], test_size=0.3, random_state=1) #Sentiment maiuscolo per altro dataframe
clf = model.fit(X_train, y_train)
predicted = clf.predict(X_test)
print(predicted)

First of all, as Marat commented, you are not using the term Bag of Words (BOW) correctly here. What you are claiming to be your BOW is simply just a labeled dataset of sentences. While there are a lot of questions here, I will try to answer the first one on how to convert your sentences into vectors that can be used in an LSTM model.
The most basic way to do this is to create one-hot-encoding vectors for each word in each sentence. To create these, you first need to iterate through your dataset and assign a unique index to each word. So for example:
vocab =
{ 'hello': 0,
'dear': 1,
.
.
.
'hate': 999}
Once you have this dictionary created, you can then go through each sentence and assign each word in each sentence a vector of len(vocab) with zeros at every index except for the index corresponding to that word. For example, using vocab above, dear would look like:
[0,1,0,0,0,...,0,0].
The pros of one-hot-encoding vectors is that they are easy to create, and pretty simple to work with. The downside is that you can pretty quickly be working with super high dimension vectors if you have a large vocabulary. That's where word embeddings come into play, and honestly are the superior route to one-hot-encoding vectors. However, they are a bit more complex and harder to understand what exactly they are doing behind the scenes. You can read more about that here if you want: https://towardsdatascience.com/what-the-heck-is-word-embedding-b30f67f01c81

You should first create an index of you vocabulary, i.e. assign an index to each token in your. And then transform to a numeric form by replacing each token in the text by its corresponding index. Your model should be then:
model = Sequential()
model.add(Embedding(len(vocab), 64, input_length=sent_len)
model.add(LSTM(units=50))
model.add(Dense(3, activation='softmax'))
Note that you need to pad you sentences to a common length before feeding them to the network. You can use np.pad to do so.
An other alternative is to used pre-trained word embeddings, you can download them from fastText
P.S. You are miss using the BOW, however BOW is a good baseline model you can use for sentiment analysis.

How to get output with maximum probability from the all the predicted outputs from dense layer?

I trained a neural network for sign language recognition. Here's my output layer model.add(Dense(units=26,activation="softmax"))
Now I'm getting probability for all 26 alphabets. Somehow I'm getting 99% accuracy when I test this model accuracy = model.evaluate(x=test_X,y=test_Y,batch_size=32). I'm new at this. I can't understand how this code works and I'm missing something major here. How to get a 1D list having just the predicted alphabet in it?

To get probabilities you need to do something like this:
prediction = model.predict(test_X)
probs = prediction.max(1)
But it is important to remember that softmax doesn't exactly provide probabilities of each class.

To get outputs with maximum probability in a single list, run:
np.argmax(model.predict(x_test),axis=1)

Supposing alphabet is a list with all alphabet symbols alphabet = ['a', 'b', ...]
pred = model.predict(test_X)
pred_ind = pred.max(1)
pred_alphabet = [alphabet[ind] for ind in pred_ind]
will give you the list with predicted symbols.

In neural networks first layer is for the input image you have. Let's say your image is 32x32 pixels. In that case you would have 32x32x3 nodes in the input layer. This 3 comes for the RGA color scheme. Then depending on your design and model you should use appropriate number of hidden input layers. At most scenarios we use 2 hidden input layers. Then the final layer is for the number of distinct classes you have. Let's say you're going to identify 26 distinct signs. Then you will have 26 nodes in the final layer.
model.evaluate(x=test_X,y=test_Y,batch_size=32)
I think here you're trying to make predictions on your test data set. At first you may have separated your data set into train and test sets. Here test_X stands for the images in test set. test_Y stands for corresponding labels. You're trying to evaluate your network by taking 32 images at a time. That's the meaning of batch_size=32.
I think this information might helpful for you to understand what you're doing. But your question is not clear. Please refer the below tutorial. That might helpful for you.
https://www.pyimagesearch.com/2018/09/10/keras-tutorial-how-to-get-started-with-keras-deep-learning-and-python/

POS tagging using RNN

I implemented a POS tagger using RNN. There are 3 features, if current word is W_i:
Feature 1: W_i-2, W_i-1, W_i, W_i+1, W_i+2
Feature 2: suffix of Feature 1, 2 characters
Feature 3: [If W_i is all uppercase, If W_i is all lowercase, If
first character of W_i is uppercase]
In my model, I have two RNNs, for Feature 1 and Feature 2, then the outputs of the RNNs and Feature 3 are concatenated, following with a softmax. The RNN for Feature 1 is bidirectional.
I tried my model on PennTree Bank, but the accuracy is very low (<50% on both training and evaluation). Just wondering, if anyone know an open source POS tagger using RNN (word based feature) in python that I can compare it with my model, then I can find if there is a bug in my code or simply because this model is not working.
Thanks,

There is one that is implemented using a bi-directional LSTM and CRF. It can be found here.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.