I have a dataset of synonyms and non-synonyms. These are stored in a list of python dictionaries like {"sentence1": <string>, "sentence2": <string>, "label": <1.0 or 0.0> }. Note that this words (or sentences) do not have to be a single token in the tokenizer.
I want to fine-tune a BERT-based model to take both sentences like: [[CLS], <sentence1_token1>, ...,<sentence1_tokenN>, [SEP], <sentence2_token1>, ..., <sentence2_tokenM>, [SEP]]. I want to take the embedding for the [CLS] token (or the pooled_ouput available in some models) and run it through one or more perceptron layers (MLP).
Once I have this new model with the additional layers I want to train it using my data. I have found some examples and I have been able to create the desired pipeline (using PyTorch's torch.nn for the perceptron layers, although I am open to hear recommendations on what is best).
model = AutoModel.from_pretrained(modelname)
tokenizer = AutoTokenizer.from_pretrained(modelname)
# input_sentences1 is a list of the first sentence of every pair
# input_sentences2 is a list of the second sentence of every pair
input = tokenizer( input_sentences1,
input_sentences2,
add_special_tokens = True,
padding=True,
return_tensors="pt" )
bert_output = model(**input)
# Extract embedding that will go through additional layers
pooled_output = bert_output.pooler_output
pooled_ouput_CLS_embedding = pooled_output[:]
## OR
# sequence_output = bert_output.last_hidden_state
# sequence_ouput_CLS_embedding = sequence_output[:,0,:]
# First layer
linear1 = nn.Linear(768, 256)
linear1_output = linear1(pooled_ouput_CLS_embedding)
# Second layer
linear2 = nn.Linear(256, 1)
linear2_output = linear2(linear1_output)
linear2_output # Random results becuase the layers have not been trained
How do I encapsulate this to facilitate training and how do I perform the fine tuning?
Related
Is the mean initialisation of new tokens correct? Also how should I save new tokenizer( after adding new tokens to it) to use it in downstream model?
I train a MLM model by adding new tokens and taking mean. How should I use the fine tuned MLM model for new classification task?
tokenizer_org = tr.BertTokenizer.from_pretrained("/home/pc/bert_base_multilingual_uncased")
tokenizer.add_tokens(joined_keywords)
model = tr.BertForMaskedLM.from_pretrained("/home/pc/bert_base_multilingual_uncased", return_dict=True)
# prepare input
text = ["Replace me by any text you'd like"]
encoded_input = tokenizer(text, truncation=True, padding=True, max_length=512, return_tensors="pt")
print(encoded_input)
# add embedding params for new vocab words
model.resize_token_embeddings(len(tokenizer))
weights = model.bert.embeddings.word_embeddings.weight
# initialize new embedding weights as mean of original tokens
with torch.no_grad():
emb = []
for i in range(len(joined_keywords)):
word = joined_keywords[i]
# first & last tokens are just string start/end; don't keep
tok_ids = tokenizer_org(word)["input_ids"][1:-1]
tok_weights = weights[tok_ids]
# average over tokens in original tokenization
weight_mean = torch.mean(tok_weights, axis=0)
emb.append(weight_mean)
weights[-len(joined_keywords):,:] = torch.vstack(emb).requires_grad_()
model.to(device)
trainer.save_model("/home/pc/Bert_multilingual_exp_TCM/model_mlm_exp1")
It saves model, config, training_args. How to save the new tokenizer as well??
What you are going to do is a convenient method for adding new markers and information to raw text. huggingface provided several method to do that I used the simplest one IMO.
BASE_MODEL = "distilbert-base-multilingual-cased"
tokenizer = AutoTokenizer.from_pretrained(BASE_MODEL)
print('Vocab size before manipulation: ', len(tokenizer))
special_tokens_dict = {'additional_special_tokens': ['[C1]','[C2]','[C3]','[C4]']}
num_added_toks = tokenizer.add_special_tokens(special_tokens_dict)
print('Vocab size after manipulation: ', len(tokenizer))
tokenizer.save_pretrained("./models/tokenizer/")
tokenizer2 = AutoTokenizer.from_pretrained("./models/tokenizer/")
print('Vocab size after saving and loading: ', len(tokenizer))
output:
Vocab size before manipulation: 119547
Vocab size after manipulation: 119551
Vocab size after saving and loading: 119551
The big caveat : When you manipulated the tokenizer you need to update the embedding layer of the model accordingly. Some thing like this model.resize_token_embeddings(len(tokenizer)).
I am trying to get the embeddings from pre-trained wav2vec2 models (e.g., from jonatasgrosman/wav2vec2-large-xlsr-53-german) using my own dataset.
My aim is to use these features for a downstream task (not specifically speech recognition). Namely, since the dataset is relatively small, I would train an SVM with these embeddings for the final classification.
So far I have tried this:
model_name = "facebook/wav2vec2-large-xlsr-53-german"
feature_extractor = Wav2Vec2Processor.from_pretrained(model_name)
model = Wav2Vec2Model.from_pretrained(model_name)
input_values = feature_extractor(train_dataset[:10]["speech"], return_tensors="pt", padding=True,
feature_size=1, sampling_rate=16000 ).input_values
Then, I am not sure whether the embeddings here correspond to the sequence of last_hidden_states:
hidden_states = model(input_values).last_hidden_state
or to the sequence of features of the last conv layer of the model:
features_last_cnn_layer = model(input_values).extract_features
Also, is this the correct way to extract features from a pre-trained model?
How one can get embeddings from a specific layer?
PD: Posting here as the HuggingFace's forum seems to be less active.
Just check the documentation:
last_hidden_state (torch.FloatTensor of shape (batch_size,
sequence_length, hidden_size)) – Sequence of hidden-states at the
output of the last layer of the model.
extract_features (torch.FloatTensor of shape (batch_size,
sequence_length, conv_dim[-1])) – Sequence of extracted feature
vectors of the last convolutional layer of the model.
The last_hidden_state vector represents so called contextualized embeddings (i.e. every feature (CNN output) has a vector representation that is to some extend influenced by the other tokens of the sequence).
The extract_features vector represents the embeddings of your input (after the CNNs).
.
Also, is this the correct way to extract features from a pre-trained
model?
Yes.
How one can get embeddings from a specific layer?
Set output_hidden_states=True:
o = model(input_values,output_hidden_states=True)
o.keys()
Output:
odict_keys(['last_hidden_state', 'extract_features', 'hidden_states'])
The hidden_states value contains the embeddings and the contextualized embeddings of each attention layer.
P.S.: jonatasgrosman/wav2vec2-large-xlsr-53-german model was trained with feat_extract_norm==layer. That means, you should also pass an attention mask to the model:
model_name = "facebook/wav2vec2-large-xlsr-53-german"
feature_extractor = Wav2Vec2Processor.from_pretrained(model_name)
model = Wav2Vec2Model.from_pretrained(model_name)
i= feature_extractor(train_dataset[:10]["speech"], return_tensors="pt", padding=True,
feature_size=1, sampling_rate=16000 )
model(**i)
Hugging Face documentation describes how to do a sequence classification using a Bert model:
from transformers import BertTokenizer, BertForSequenceClassification
import torch
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForSequenceClassification.from_pretrained('bert-base-uncased')
input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute", add_special_tokens=True)).unsqueeze(0) # Batch size 1
labels = torch.tensor([1]).unsqueeze(0) # Batch size 1
outputs = model(input_ids, labels=labels)
loss, logits = outputs[:2]
However, there is only example for batch size 1. How to implement it when we have a list of phrases and want to use a bigger batch size?
in that example unsqueeze is used to add a dimension to the input/labels, so that it is an array of size (batch_size, sequence_length). If you want to use a batch size > 1, you can build an array of sequences instead, like in the following example:
from transformers import BertTokenizer, BertForSequenceClassification
import torch
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForSequenceClassification.from_pretrained('bert-base-uncased')
sequences = ["Hello, my dog is cute", "My dog is cute as well"]
input_ids = torch.tensor([tokenizer.encode(sequence, add_special_tokens=True) for sequence in sequences])
labels = torch.tensor([[1], [0]]) # Labels depend on the task
outputs = model(input_ids, labels=labels)
loss, logits = outputs[:2]
In that example, both sequences get encoded in the same number of tokens so it's easy to build a tensor containing both sequences, but if they have a differing amount of elements you would need to pad the sequences and tell the model which tokens it should attend to (so that it ignores the padded values) using an attention mask.
There is an entry in the glossary concerning attention masks which explains their purpose and usage. You pass this attention mask to the model when calling its forward method.
When doing training, I initialize my embedding matrix, using the pretrained embeddings picked for words in training set vocabulary.
import torchtext as tt
contexts = tt.data.Field(lower=True, sequential=True, tokenize=tokenizer, use_vocab=True)
contexts.build_vocab(data, vectors="fasttext.en.300d",
vectors_cache=config["vectors_cache"])
In my model I pass contexts.vocab as parameter and initialize embeddings:
embedding_dim = vocab.vectors.shape[1]
self.embeddings = nn.Embedding(len(vocab), embedding_dim)
self.embeddings.weight.data.copy_(vocab.vectors)
self.embeddings.weight.requires_grad=False
I train my model and during training I save its 'best' state via torch.save(model, f).
Then I want to test/create demo for model in separate file for evaluation. I load the model via torch.load. How do I extend the embedding matrix to contain test vocabulary? I tried to replace embedding matrix
# data is TabularDataset with test data
contexts.build_vocab(data, vectors="fasttext.en.300d",
vectors_cache=config["vectors_cache"])
model.embeddings = torch.nn.Embedding(len(contexts.vocab), contexts.vocab.vectors.shape[1])
model.embeddings.weight.data.copy_(contexts.vocab.vectors)
model.embeddings.weight.requires_grad = False
But the results are terrible (almost 0 accuracy). Model was doing good during training. What is the 'correct way' of doing this?
In my model, I use GloVe pre-trained embeddings. I wish to keep them non-trainable in order to decrease the number of model parameters and avoid overfit. However, I have a special symbol whose embedding I do want to train.
Using the provided Embedding Layer, I can only use the parameter 'trainable' to set the trainability of all embeddings in the following way:
embedding_layer = Embedding(voc_size,
emb_dim,
weights=[embedding_matrix],
input_length=MAX_LEN,
trainable=False)
Is there a Keras-level solution to training only a subset of embeddings?
Please note:
There is not enough data to generate new embeddings for all words.
These answers only relate to native TensorFlow.
Found some nice workaround, inspired by Keith's two embeddings layers.
Main idea:
Assign the special tokens (and the OOV) with the highest IDs. Generate a 'sentence' containing only special tokens, 0-padded elsewhere. Then apply non-trainable embeddings to the 'normal' sentence, and trainable embeddings to the special tokens. Lastly, add both.
Works fine to me.
# Normal embs - '+2' for empty token and OOV token
embedding_matrix = np.zeros((vocab_len + 2, emb_dim))
# Special embs
special_embedding_matrix = np.zeros((special_tokens_len + 2, emb_dim))
# Here we may apply pre-trained embeddings to embedding_matrix
embedding_layer = Embedding(vocab_len + 2,
emb_dim,
mask_zero = True,
weights = [embedding_matrix],
input_length = MAX_SENT_LEN,
trainable = False)
special_embedding_layer = Embedding(special_tokens_len + 2,
emb_dim,
mask_zero = True,
weights = [special_embedding_matrix],
input_length = MAX_SENT_LEN,
trainable = True)
valid_words = vocab_len - special_tokens_len
sentence_input = Input(shape=(MAX_SENT_LEN,), dtype='int32')
# Create a vector of special tokens, e.g: [0,0,1,0,3,0,0]
special_tokens_input = Lambda(lambda x: x - valid_words)(sentence_input)
special_tokens_input = Activation('relu')(special_tokens_input)
# Apply both 'normal' embeddings and special token embeddings
embedded_sequences = embedding_layer(sentence_input)
embedded_special = special_embedding_layer(special_tokens_input)
# Add the matrices
embedded_sequences = Add()([embedded_sequences, embedded_special])
I haven't found a nice solution like a mask for the Embedding layer. But here's what I've been meaning to try:
Two embedding layers - one trainable and one not
The non-trainable one has all the Glove embeddings for in-vocab words and zero vectors for others
The trainable one only maps the OOV words and special symbols
The output of these two layers is added (I was thinking of this like ResNet)
The Conv/LSTM/etc below the embedding is unchanged
That would get you a solution with a small number of free parameters allocated to those embeddings.