Getting embedding lookup result from BERT

Getting embedding lookup result from BERT - python

Prior to passing my tokens through BERT, I would like to perform some processing on their embeddings, (the result of the embedding lookup layer). The HuggingFace BERT TensorFlow implementation allows us to access the output of embedding lookup using:
import tensorflow as tf
from transformers import BertConfig, BertTokenizer, TFBertModel
bert_tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
input_ids = tf.constant(bert_tokenizer.encode("Hello, my dog is cute", add_special_tokens=True))[None, :]
attention_mask = tf.stack([tf.ones(shape=(len(sent),)) for sent in input_ids])
token_type_ids = tf.stack([tf.ones(shape=(len(sent),)) for sent in input_ids])
config = BertConfig.from_pretrained('bert-base-uncased', output_hidden_states=True)
bert_model = TFBertModel.from_pretrained('bert-base-uncased', config=config)
result = bert_model(inputs={'input_ids': input_ids,
'attention_mask': attention_mask,
'token_type_ids': token_type_ids})
inputs_embeds = result[-1][0] # output of embedding lookup
Subsequently, one can process inputs_embeds and then send this in as an input to the same model using:
inputs_embeds = process(inputs_embeds) # some processing on inputs_embeds done here (dimensions kept the same)
result = bert_model(inputs={'inputs_embeds': inputs_embeds,
'attention_mask': attention_mask,
'token_type_ids': token_type_ids})
output = result[0]
where output now contains the output of BERT for the modified input. However, this requires two full passes through BERT. Instead of running BERT all the way through just to perform embedding lookup, I would like to just get the output of the embedding lookup layer. Is this possible, and if so, how?

It is in fact incorrect to treat the first output result[-1][0] as the result of an embedding lookup. The raw embedding lookup is given by:
embeddings = bert_model.bert.get_input_embeddings()
word_embeddings = embeddings.word_embeddings
inputs_embeds = tf.gather(word_embeddings, input_ids)
while result[-1][0] gives the embedding lookup plus positional embeddings and token type embeddings. The above code does not require a full pass through BERT, and the result can be processed prior to feeding into the remaining layers of BERT.
EDIT: To get the result of addition positional and token type embeddings to an arbitrary inputs_embeds, one can use:
full_embeddings = embeddings(inputs=[None, None, token_type_ids, inputs_embeds])
Here, the call method for the embeddings object accepts a list which is fed into the _embeddings method. The first value is input_ids, the second position_ids, the third token_type_ids, and the fourth inputs_embeds. (See here for more details.) If you have multiple sentences in one input, you may need to set position_ids.

Related

how to resize the embedding vectors from huggingface bert

I try to use the tokenizer method to tokenize the sentence and then mean pool the attention mask to get the vectors for each sentence. However, the current default size embedding is 768 and I wish to reduce it to 200 instead but failed. below is my code.
from transformers import AutoTokenizer, AutoModel
import torch
#Mean Pooling - Take attention mask into account for correct averaging
def mean_pooling(model_output, attention_mask):
token_embeddings = model_output[0] #First element of model_output contains all token embeddings
input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
# Sentences we want sentence embeddings for
sentences = ['This is an example sentence', 'Each sentence is converted']
# Load model from HuggingFace Hub
tokenizer = AutoTokenizer.from_pretrained('sentence-transformers/bert-base-nli-mean-tokens')
model = AutoModel.from_pretrained('sentence-transformers/bert-base-nli-mean-tokens')
model.resize_token_embeddings(200)
# Tokenize sentences
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
# Compute token embeddings
with torch.no_grad():
model_output = model(**encoded_input)
# Perform pooling. In this case, max pooling.
sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
print("Sentence embeddings:")
print(sentence_embeddings)
Error:
2193 # Note [embedding_renorm set_grad_enabled]
2194 # XXX: equivalent to
2195 # with torch.no_grad():
2196 # torch.embedding_renorm_
2197 # remove once script supports set_grad_enabled
2198 _no_grad_embedding_renorm_(weight, input, max_norm, norm_type)
-> 2199 return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
IndexError: index out of range in self
my expected output is:
when use:
print(len(sentence_embeddings[0]))
-> 200

I think you've misunderstood the resize_token_embeddings. According to docs It
Resizes input token embeddings matrix of the model if new_num_tokens != >config.vocab_size.
Takes care of tying weights embeddings afterwards if the model class has a >tie_weights() method.
meaning it is used when you add/remove tokens from vocabulary. Here resizing refers to resizing the token->embedding dictionary.
I guess what you want to do is change the hidden_size of bert model. In order to do that you have to change the hidden_size in config.json which will re-initialize all weights and you have to re-train everything i.e very computationally expensive.
I think your best option is to add a linear layer on top of BertModel of dimension (768x200) and fine-tune on your downstream task.

Fine-tune huggingface transformer to classify synonyms

I have a dataset of synonyms and non-synonyms. These are stored in a list of python dictionaries like {"sentence1": <string>, "sentence2": <string>, "label": <1.0 or 0.0> }. Note that this words (or sentences) do not have to be a single token in the tokenizer.
I want to fine-tune a BERT-based model to take both sentences like: [[CLS], <sentence1_token1>, ...,<sentence1_tokenN>, [SEP], <sentence2_token1>, ..., <sentence2_tokenM>, [SEP]]. I want to take the embedding for the [CLS] token (or the pooled_ouput available in some models) and run it through one or more perceptron layers (MLP).
Once I have this new model with the additional layers I want to train it using my data. I have found some examples and I have been able to create the desired pipeline (using PyTorch's torch.nn for the perceptron layers, although I am open to hear recommendations on what is best).
model = AutoModel.from_pretrained(modelname)
tokenizer = AutoTokenizer.from_pretrained(modelname)
# input_sentences1 is a list of the first sentence of every pair
# input_sentences2 is a list of the second sentence of every pair
input = tokenizer( input_sentences1,
input_sentences2,
add_special_tokens = True,
padding=True,
return_tensors="pt" )
bert_output = model(**input)
# Extract embedding that will go through additional layers
pooled_output = bert_output.pooler_output
pooled_ouput_CLS_embedding = pooled_output[:]
## OR
# sequence_output = bert_output.last_hidden_state
# sequence_ouput_CLS_embedding = sequence_output[:,0,:]
# First layer
linear1 = nn.Linear(768, 256)
linear1_output = linear1(pooled_ouput_CLS_embedding)
# Second layer
linear2 = nn.Linear(256, 1)
linear2_output = linear2(linear1_output)
linear2_output # Random results becuase the layers have not been trained
How do I encapsulate this to facilitate training and how do I perform the fine tuning?

Getting embeddings from wav2vec2 models in HuggingFace

I am trying to get the embeddings from pre-trained wav2vec2 models (e.g., from jonatasgrosman/wav2vec2-large-xlsr-53-german) using my own dataset.
My aim is to use these features for a downstream task (not specifically speech recognition). Namely, since the dataset is relatively small, I would train an SVM with these embeddings for the final classification.
So far I have tried this:
model_name = "facebook/wav2vec2-large-xlsr-53-german"
feature_extractor = Wav2Vec2Processor.from_pretrained(model_name)
model = Wav2Vec2Model.from_pretrained(model_name)
input_values = feature_extractor(train_dataset[:10]["speech"], return_tensors="pt", padding=True,
feature_size=1, sampling_rate=16000 ).input_values
Then, I am not sure whether the embeddings here correspond to the sequence of last_hidden_states:
hidden_states = model(input_values).last_hidden_state
or to the sequence of features of the last conv layer of the model:
features_last_cnn_layer = model(input_values).extract_features
Also, is this the correct way to extract features from a pre-trained model?
How one can get embeddings from a specific layer?
PD: Posting here as the HuggingFace's forum seems to be less active.

Just check the documentation:
last_hidden_state (torch.FloatTensor of shape (batch_size,
sequence_length, hidden_size)) – Sequence of hidden-states at the
output of the last layer of the model.
extract_features (torch.FloatTensor of shape (batch_size,
sequence_length, conv_dim[-1])) – Sequence of extracted feature
vectors of the last convolutional layer of the model.
The last_hidden_state vector represents so called contextualized embeddings (i.e. every feature (CNN output) has a vector representation that is to some extend influenced by the other tokens of the sequence).
The extract_features vector represents the embeddings of your input (after the CNNs).
.
Also, is this the correct way to extract features from a pre-trained
model?
Yes.
How one can get embeddings from a specific layer?
Set output_hidden_states=True:
o = model(input_values,output_hidden_states=True)
o.keys()
Output:
odict_keys(['last_hidden_state', 'extract_features', 'hidden_states'])
The hidden_states value contains the embeddings and the contextualized embeddings of each attention layer.
P.S.: jonatasgrosman/wav2vec2-large-xlsr-53-german model was trained with feat_extract_norm==layer. That means, you should also pass an attention mask to the model:
model_name = "facebook/wav2vec2-large-xlsr-53-german"
feature_extractor = Wav2Vec2Processor.from_pretrained(model_name)
model = Wav2Vec2Model.from_pretrained(model_name)
i= feature_extractor(train_dataset[:10]["speech"], return_tensors="pt", padding=True,
feature_size=1, sampling_rate=16000 )
model(**i)

How to use a batch size bigger than zero in Bert Sequence Classification

Hugging Face documentation describes how to do a sequence classification using a Bert model:
from transformers import BertTokenizer, BertForSequenceClassification
import torch
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForSequenceClassification.from_pretrained('bert-base-uncased')
input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute", add_special_tokens=True)).unsqueeze(0) # Batch size 1
labels = torch.tensor([1]).unsqueeze(0) # Batch size 1
outputs = model(input_ids, labels=labels)
loss, logits = outputs[:2]
However, there is only example for batch size 1. How to implement it when we have a list of phrases and want to use a bigger batch size?

in that example unsqueeze is used to add a dimension to the input/labels, so that it is an array of size (batch_size, sequence_length). If you want to use a batch size > 1, you can build an array of sequences instead, like in the following example:
from transformers import BertTokenizer, BertForSequenceClassification
import torch
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForSequenceClassification.from_pretrained('bert-base-uncased')
sequences = ["Hello, my dog is cute", "My dog is cute as well"]
input_ids = torch.tensor([tokenizer.encode(sequence, add_special_tokens=True) for sequence in sequences])
labels = torch.tensor([[1], [0]]) # Labels depend on the task
outputs = model(input_ids, labels=labels)
loss, logits = outputs[:2]
In that example, both sequences get encoded in the same number of tokens so it's easy to build a tensor containing both sequences, but if they have a differing amount of elements you would need to pad the sequences and tell the model which tokens it should attend to (so that it ignores the padded values) using an attention mask.
There is an entry in the glossary concerning attention masks which explains their purpose and usage. You pass this attention mask to the model when calling its forward method.

Initializing BERT embeddings in a classification model

I'm quite new to TensorFlow and trying to do a multitask classification with BERT (I have done this with GloVe in another part of the project). My problem is with the concept of placeholder in TensorFlow. I know that it is just a placeholder of some variables and will be filled. See this is the part of my classification model that I have problem with. I'll explain the exact problem down here.
def bert_emb_lookup(input_ids):
# TODO to be implemented;
"""
X is the input IDs, but a placeholder
"""
pass
class BertClassificationModel(object):
def __init__(self, num_class, args):
self.embedding_size = args.embedding_size
self.num_layers = args.num_layers
self.num_hidden = args.num_hidden
self.input_ids = tf.placeholder(tf.int32, [None, args.max_document_len])
self.Y1 = tf.placeholder(tf.int32, [None])
self.Y2 = tf.placeholder(tf.int32, [None])
self.dropout = tf.placeholder(tf.float64, [])
self.input_len = tf.reduce_sum(tf.sign(self.input_ids), 1)
with tf.name_scope("embedding"):
self.input_emb = bert_emb_lookup(self.)
...
It was easy to get the word embeddings from GloVe; I first loaded the glove vectors and then simply used tf.nn.embedding_lookup(embeddings, self.input_ids) to fetch the embeddings.
So in BERT classification model, I'm trying to do something similar by defining a function whose argument is input_ids, where I want to match input ids with their associated vocab (string). Thereafter, I'll use an API (BERT as a service) that gives BERT embeddings of any given list of strings at string-level/token-level. The problem is that since self.input_ids is just a placeholder, it shows it a NULL object. Is there any workaround that helps me with this?
Thanks!

You cannot use bert-as-service as a tensor directly. So you have two options:
Use bert-as-service to look up the embeddings. You give the sentences as input and get a numpy array of embeddings as ouput. You then feed the numpy array of embeddings to a placeholder self.embeddings = tf.placeholder(tf.float32, [None, 768])
Then you use self.embeddings wherever you would have used tf.nn.embedding_lookup(embeddings, self.input_ids).
The other option is probably overkill in this case, but it may give you some context. Here you do not use bert-as-service to get embeddings. Instead, you use the bert model graph directly. You would use BertModel from https://github.com/google-research/bert/blob/master/run_classifier.py to create a tensor which again can be used wherever you would use tf.nn.embedding_lookup(embeddings, self.input_ids).

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Getting embedding lookup result from BERT - python

Related

how to resize the embedding vectors from huggingface bert

Fine-tune huggingface transformer to classify synonyms

Getting embeddings from wav2vec2 models in HuggingFace

How to use a batch size bigger than zero in Bert Sequence Classification

Initializing BERT embeddings in a classification model

Categories

Resources