How to do Sentence Similarity with XLNet?

How to do Sentence Similarity with XLNet? - python

I want to perform a sentence similarity task and tried the following:
from transformers import XLNetTokenizer, XLNetModel
import torch
import scipy
import torch.nn as nn
import torch.nn.functional as F
tokenizer = XLNetTokenizer.from_pretrained('xlnet-large-cased')
model = XLNetModel.from_pretrained('xlnet-large-cased')
input_ids = torch.tensor(tokenizer.encode("Hello, my animal is cute", add_special_tokens=False)).unsqueeze(0)
outputs = model(input_ids)
last_hidden_states = outputs[0]
input_ids = torch.tensor(tokenizer.encode("I like your cat", add_special_tokens=False)).unsqueeze(0)
outputs1 = model(input_ids)
last_hidden_states1 = outputs1[0]
cos = nn.CosineSimilarity(dim=1, eps=1e-6)
output = cos(last_hidden_states, last_hidden_states1)
However, I get the following error:
RuntimeError: The size of tensor a (7) must match the size of tensor b (4) at non-singleton dimension 1
Can anybody tell me, what I am doing wrong? Is there a better way to do it?

There are several things you are doing wrong.
add_special_tokens should be set to True. The model was trained with <sep> token for separating sentences and <cls> token for sentence classification. Not using the leads to weird behavior because of the train-test data mismatch.
outputs[0] gives you the first element of a single-member Python tuple. All models from the Transformer package return tuples, therefore this single-member tuple. It contains one vector per input token, including the special ones.
Unlike BERT whose [CLS] token is the first one, here the <cls> token is the very last one (see Transformers documentation). If you want to compare the classification vectors, you should take last vector from the sequence, i.e. outputs[0][:, -1].
Alternatively, you might want to compare the average (mean-pool) of the embedding rather than <cls> token embedding. In that case, you can just do output[0].mean(1).

Related

How to use a batch size bigger than zero in Bert Sequence Classification

Hugging Face documentation describes how to do a sequence classification using a Bert model:
from transformers import BertTokenizer, BertForSequenceClassification
import torch
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForSequenceClassification.from_pretrained('bert-base-uncased')
input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute", add_special_tokens=True)).unsqueeze(0) # Batch size 1
labels = torch.tensor([1]).unsqueeze(0) # Batch size 1
outputs = model(input_ids, labels=labels)
loss, logits = outputs[:2]
However, there is only example for batch size 1. How to implement it when we have a list of phrases and want to use a bigger batch size?

in that example unsqueeze is used to add a dimension to the input/labels, so that it is an array of size (batch_size, sequence_length). If you want to use a batch size > 1, you can build an array of sequences instead, like in the following example:
from transformers import BertTokenizer, BertForSequenceClassification
import torch
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForSequenceClassification.from_pretrained('bert-base-uncased')
sequences = ["Hello, my dog is cute", "My dog is cute as well"]
input_ids = torch.tensor([tokenizer.encode(sequence, add_special_tokens=True) for sequence in sequences])
labels = torch.tensor([[1], [0]]) # Labels depend on the task
outputs = model(input_ids, labels=labels)
loss, logits = outputs[:2]
In that example, both sequences get encoded in the same number of tokens so it's easy to build a tensor containing both sequences, but if they have a differing amount of elements you would need to pad the sequences and tell the model which tokens it should attend to (so that it ignores the padded values) using an attention mask.
There is an entry in the glossary concerning attention masks which explains their purpose and usage. You pass this attention mask to the model when calling its forward method.

How to generate independent(X) variable using Word2vec?

I have a movie review data set which has two columns Review(Sentences) and Sentiment(1 or 0).
I want to create a classification model using word2vec for the embedding and a CNN for the classification.
I've looked for tutorials on youtube but all they do is create vectors for every words and show me the similar words. Like this-
model= gensim.models.Word2Vec(cleaned_dataset, min_count = 2, size = 100, window = 5)
words= model.wv.vocab
simalar= model.wv.most_similar("bad")
I already have my dependent variable(y) which is my 'Sentiment' column all I need is the independent variable(X) which I can pass on to my CNN model.
Before using word2vec I used the Bag Of Words(BOW) model which generated a sparse matrix which was my independent(X) variable. How can I achieve something similar using word2vec?
Kindly correct me if I'm doing something wrong.

To get the word vector, you have to do this:
model['word_that_you_want']
You may also want to handle the KeyError that could arise if you don't find that given word in your model. You also might want to read about what an embedding layer is, which is usually used as the first layer of the neural network (for NLP generally) and is basically a lookup mapping of a word to its corresponding word vector.
To get the word vectors for an entire sentence, you need to first initialize a numpy array of zeros to the dimensions you want.
You might need other variables such as the length of the longest sentence so that you can pad all sentences to that length. The documentation of the pad_sequences method for Keras is here.
A simple example of getting a sentence of word vectors is:
import numpy as np
embedding_matrix = np.zeros((vocab_len, size_of_your_word_vector))
Then iterate over the index of embedding_matrix and add to it, if you find a word vector in your model.
I use this resource which has a lot of examples and I have referenced some of the code there (which I have also used myself sometimes):
embedding_matrix = np.zeros((vocab_length, 100))
for word, index in word_tokenizer.word_index.items():
embedding_vector = model[word] # using your w2v model, KeyError possible
if embedding_vector is not None:
embedding_matrix[index] = embedding_vector
And in your model (I'm assuming Tensorflow with Keras)
embedding_layer = Embedding(vocab_length, 100, weights=[embedding_matrix], input_length=length_long_sentence, trainable=False)
I hope this helps.

Word2Vec doesn't inherently create vectors for a text (set of words) – just individual words.
But, sometimes a not-so-bad vector for a multi-word text is the average of all its word-vectors.
If list_of_words is a list of the words in your text, and all the words are in the Word2Vec model, a simple way to get the average of those words' vectors is:
avg_vector_of_words = model.wv[list_of_words].mean(axis=0)
(If some words aren't present, you'd need to filter them before attempting this to avoid KeyErrors. If you wanted to leave out some words, or use unit-normed word-vectors, or unit-normalize the final vector, you'd need more code.)
Then avg_vector_of_words is a small, dense/'embedded' feature vector for the list_of-words text.
You could pass these vectors, one per text, to another downstream classifier, like your CNN, exactly analogously to how you were previously using sparse BOW vectors.

Outputting attention for bert-base-uncased with huggingface/transformers (torch)

I was following a paper on BERT-based lexical substitution (specifically trying to implement equation (2) - if someone has already implemented the whole paper that would also be great). Thus, I wanted to obtain both the last hidden layers (only thing I am unsure is the ordering of the layers in the output: last first or first first?) and the attention from a basic BERT model (bert-base-uncased).
However, I am a bit unsure whether the huggingface/transformers library actually outputs the attention (I was using torch, but am open to using TF instead) for bert-base-uncased?
From what I had read, I was expected to get a tuple of (logits, hidden_states, attentions), but with the example below (runs e.g. in Google Colab), I get of length 2 instead.
Am I misinterpreting what I am getting or going about this the wrong way? I did the obvious test and used output_attention=False instead of output_attention=True (while output_hidden_states=True does indeed seem to add the hidden states, as expected) and nothing change in the output I got. That's clearly a bad sign about my understanding of the library or indicates an issue.
import numpy as np
import torch
!pip install transformers
from transformers import (AutoModelWithLMHead,
AutoTokenizer,
BertConfig)
bert_tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
config = BertConfig.from_pretrained('bert-base-uncased', output_hidden_states=True, output_attention=True) # Nothign changes, when I switch to output_attention=False
bert_model = AutoModelWithLMHead.from_config(config)
sequence = "We went to an ice cream cafe and had a chocolate ice cream."
bert_tokenized_sequence = bert_tokenizer.tokenize(sequence)
indexed_tokens = bert_tokenizer.encode(bert_tokenized_sequence, return_tensors='pt')
predictions = bert_model(indexed_tokens)
########## Now let's have a look at what the predictions look like #############
print(len(predictions)) # Length is 2, I expected 3: logits, hidden_layers, attention
print(predictions[0].shape) # torch.Size([1, 16, 30522]) - seems to be logits (shape is 1 x sequence length x vocabulary
print(len(predictions[1])) # Length is 13 - the hidden layers?! There are meant to be 12, right? Is one somehow the attention?
for k in range(len(predictions[1])):
print(predictions[1][k].shape) # These all seem to be torch.Size([1, 16, 768]), so presumably the hidden layers?
Explanation of what worked in the end inspired by accepted answer
import numpy as np
import torch
!pip install transformers
from transformers import BertModel, BertConfig, BertTokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
config = BertConfig.from_pretrained('bert-base-uncased', output_hidden_states=True, output_attentions=True)
model = BertModel.from_pretrained('bert-base-uncased', config=config)
sequence = "We went to an ice cream cafe and had a chocolate ice cream."
tokenized_sequence = tokenizer.tokenize(sequence)
indexed_tokens = tokenizer.encode(tokenized_sequence, return_tensors='pt'
enter code here`outputs = model(indexed_tokens)
print( len(outputs) ) # 4
print( outputs[0].shape ) #1, 16, 768
print( outputs[1].shape ) # 1, 768
print( len(outputs[2]) ) # 13 = input embedding (index 0) + 12 hidden layers (indices 1 to 12)
print( outputs[2][0].shape ) # for each of these 13: 1,16,768 = input sequence, index of each input id in sequence, size of hidden layer
print( len(outputs[3]) ) # 12 (=attenion for each layer)
print( outputs[3][0].shape ) # 0 index = first layer, 1,12,16,16 = , layer, index of each input id in sequence, index of each input id in sequence

I think it's too late to make an answer here, but with the update from the huggingface's transformers, I think we can use this
config = BertConfig.from_pretrained('bert-base-uncased',
output_hidden_states=True, output_attentions=True)
bert_model = BertModel.from_pretrained('bert-base-uncased',
config=config)
with torch.no_grad():
out = bert_model(input_ids)
last_hidden_states = out.last_hidden_state
pooler_output = out.pooler_output
hidden_states = out.hidden_states
attentions = out.attentions

The reason is that you are using AutoModelWithLMHead which is a wrapper for the actual model. It calls the BERT model (i.e., an instance of BERTModel) and then it uses the embedding matrix as a weight matrix for the word prediction. In between the underlying model indeed returns attentions, but the wrapper does not care and only returns the logits.
You can either get the BERT model directly by calling AutoModel. Note that this model does not return the logits, but the hidden states.
bert_model = AutoModel.from_config(config)
Or you can get it from the BertWithLMHead object by calling:
wrapped_model = bert_model.base_model

keras LSTM get hidden-state (converting sentece-sequence to document context vectors)

Im trying to create document context vectors from sentence-vectors via LSTM using keras (so each document consist of a sequence of sentence vectors).
My goal is to replicate the following blog post using keras: https://andriymulyar.com/blog/bert-document-classification
I have a (toy-)tensor, that looks like this: X = np.array(features).reshape(5, 200, 768) So 5 documents with each having a 200 sequence of sentence vectors - each sentence vector having 768 features.
So to get an embedding from my sentence vectors, I encoded my documents as one-hot-vectors to learn an LSTM:
y = [1,2,3,4,5] # 5 documents in toy-tensor
y = np.array(y)
yy = to_categorical(y)
yy = yy[0:5,1:6]
Until now, my code looks like this
inputs1=Input(shape=(200,768))
lstm1, states_h, states_c =LSTM(5,dropout=0.3,recurrent_dropout=0.2, return_state=True)(inputs1)
model1=Model(inputs1,lstm1)
model1.compile(loss='categorical_crossentropy',optimizer='rmsprop',metrics=['acc'])
model1.summary()
model1.fit(x=X,y=yy,batch_size=100,epochs=10,verbose=1,shuffle=True,validation_split=0.2)
When I print states_h I get a tensor of shape=(?,5) and I dont really know how to access the vectors inside the tensor, which should represent my documents.
print(states_h)
Tensor("lstm_51/while/Exit_3:0", shape=(?, 5), dtype=float32)
Or am I doing something wrong? To my understanding there should be 5 document vectors e.g. doc1=[...] ; ...; doc5=[...] so that I can reuse the document vectors for a classification task.

Well, printing a tensor shows exactly this: it's a tensor, it has that shape and that type.
If you want to see data, you need to feed data.
States are not weights, they are not persistent, they only exist with input data, just as any other model output.
You should create a model that outputs this information (yours doesn't) in order to grab it. You can have two models:
#this is the model you compile and train - exactly as you are already doing
training_model = Model(inputs1,lstm1)
#this is just for getting the states, nothing else, don't compile, don't train
state_getting_model = Model(inputs1, [lstm1, states_h, states_c])
(Don't worry, these two models will share the same weights and be updated together, even if you only train the training_model)
Now you can:
With eager mode off (and probably "on" too):
lstm_out, states_h_out, states_c_out = state_getting_model.predict(X)
print(states_h_out)
print(states_c_out)
With eager mode on:
lstm_out, states_h_out, states_c_out = state_getting_model(X)
print(states_h_out.numpy())
print(states_c_out.numpy())

TF 1.x with tf.keras (Tested with TF 1.15)
Keras does operations using symbolic tensors. Therefore, print(states_h) won't give you anything unless you pass data to the placeholders states_h depends on (in this case inputs1). You can do that as follows.
import tensorflow.keras.backend as K
inputs1=Input(shape=(200,768))
lstm1, states_h, states_c =LSTM(5,dropout=0.3,recurrent_dropout=0.2, return_state=True)(inputs1)
model1=Model(inputs1,lstm1)
model1.compile(loss='categorical_crossentropy',optimizer='rmsprop',metrics=['acc'])
model1.summary()
model1.fit(x=X,y=yy,batch_size=100,epochs=10,verbose=1,shuffle=True,validation_split=0.2)
sess = K.get_session()
out = sess.run(states_h, feed_dict={inputs1:X})
Then out will be (batch_size, 5) sized output.
TF 2.x with tf.keras
The above code won't work as it is. And I still haven't found how to get this to work with TF 2.0 (even though TF 2.0 will still produce a placeholder according to docs). I will edit my answer when I find how to fix this for TF 2.x.

Making predictions on text data using keras

i had trained and tested a CNN for sentiment analysis. The train and test data were prepared the same way, tokenizing the sentence and giving unique integers:
tokenizer = Tokenizer(filters='$%&()*/:;<=>#[\\]^`{|}~\t\n')
tokenizer.fit_on_texts(text)
vocab_size = len(tokenizer.word_index) + 1
sequences = tokenizer.texts_to_sequences(text)
Then pre-trained glove model to create embedding matrix for CNN as:
filepath_glove = 'glove.twitter.27B.200d.txt'
glove_vocab = []
glove_embd=[]
embedding_dict = {}
file = open(filepath_glove,'r',encoding='UTF-8')
for line in file.readlines():
row = line.strip().split(' ')
vocab_word = row[0]
glove_vocab.append(vocab_word)
embed_vector = [float(i) for i in row[1:]] # convert to list of float
embedding_dict[vocab_word]=embed_vector
file.close()
for word, index in tokenizer.word_index.items():
`embedding_matrix[index] = embedding_dict[word]`
At this point i also used the test sentences to create this matrix which was later passed as weights into embedding layer:
e= Embedding(vocab_size, 200, input_length=maxSeqLength, weights=[embedding_matrix], trainable=False)(inp)
Now i want to reload my model and test with some new data but it would mean that embedding matrix wont include some words from new data.This makes me wonder that if even before i shouldnt have had included test data while creating embedding matrix? And if not,how does the embedding layer work for those new words?This part is similar to this question but i couldnt find answer:
How does the Keras Embedding Layer work if word is not found?
Thanks

It´s quite simple. You are providing the vocab_size, which is the number of words, the embedding layer knows. If you pass an index, which is out of bounds of the vocab_size (new word), it will be ignored, or an error will be thrown by keras.
This answers your question regarding if you should include all data for your embedding matrix. Yes, you should.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to do Sentence Similarity with XLNet? - python

Related

How to use a batch size bigger than zero in Bert Sequence Classification

How to generate independent(X) variable using Word2vec?

Outputting attention for bert-base-uncased with huggingface/transformers (torch)

keras LSTM get hidden-state (converting sentece-sequence to document context vectors)

Making predictions on text data using keras

Categories

Resources