Getting embeddings from wav2vec2 models in HuggingFace

Getting embeddings from wav2vec2 models in HuggingFace - python

I am trying to get the embeddings from pre-trained wav2vec2 models (e.g., from jonatasgrosman/wav2vec2-large-xlsr-53-german) using my own dataset.
My aim is to use these features for a downstream task (not specifically speech recognition). Namely, since the dataset is relatively small, I would train an SVM with these embeddings for the final classification.
So far I have tried this:
model_name = "facebook/wav2vec2-large-xlsr-53-german"
feature_extractor = Wav2Vec2Processor.from_pretrained(model_name)
model = Wav2Vec2Model.from_pretrained(model_name)
input_values = feature_extractor(train_dataset[:10]["speech"], return_tensors="pt", padding=True,
feature_size=1, sampling_rate=16000 ).input_values
Then, I am not sure whether the embeddings here correspond to the sequence of last_hidden_states:
hidden_states = model(input_values).last_hidden_state
or to the sequence of features of the last conv layer of the model:
features_last_cnn_layer = model(input_values).extract_features
Also, is this the correct way to extract features from a pre-trained model?
How one can get embeddings from a specific layer?
PD: Posting here as the HuggingFace's forum seems to be less active.

Just check the documentation:
last_hidden_state (torch.FloatTensor of shape (batch_size,
sequence_length, hidden_size)) – Sequence of hidden-states at the
output of the last layer of the model.
extract_features (torch.FloatTensor of shape (batch_size,
sequence_length, conv_dim[-1])) – Sequence of extracted feature
vectors of the last convolutional layer of the model.
The last_hidden_state vector represents so called contextualized embeddings (i.e. every feature (CNN output) has a vector representation that is to some extend influenced by the other tokens of the sequence).
The extract_features vector represents the embeddings of your input (after the CNNs).
.
Also, is this the correct way to extract features from a pre-trained
model?
Yes.
How one can get embeddings from a specific layer?
Set output_hidden_states=True:
o = model(input_values,output_hidden_states=True)
o.keys()
Output:
odict_keys(['last_hidden_state', 'extract_features', 'hidden_states'])
The hidden_states value contains the embeddings and the contextualized embeddings of each attention layer.
P.S.: jonatasgrosman/wav2vec2-large-xlsr-53-german model was trained with feat_extract_norm==layer. That means, you should also pass an attention mask to the model:
model_name = "facebook/wav2vec2-large-xlsr-53-german"
feature_extractor = Wav2Vec2Processor.from_pretrained(model_name)
model = Wav2Vec2Model.from_pretrained(model_name)
i= feature_extractor(train_dataset[:10]["speech"], return_tensors="pt", padding=True,
feature_size=1, sampling_rate=16000 )
model(**i)

Related

How to save an embedding layer which was created outside model?

I created a word embedding layer outside model and used it as input before fitting my model. Now I need to predict new sentences by this model, how can I save the pre-trained embedding layer and apply it to my new sentences?
Code example:
Before input to model and fitting:
embedding_sentence = tf.keras.layers.Embedding(vocab_size, model_dimension, trainable=True)
embedded_sentence = embedding_sentence(vectorized_sentence)
Model fitting:
model = tf.keras.Sequential()
model.add(tf.keras.layers.GlobalAveragePooling1D())
...
Now I need to predict new sentences, how can I apply the trained embedding to them?

The above information is insufficient to answer this question accurately but still, I will give it a try. In tensorflow, you can use a function named get_weights to get the weights of a pre-train embedding layer and save it in a numpy/hd5 file which can be used later as an embedding layer in a new architecture.
weights = embedding_sentence.get_weights()
np.save('embedding_weights.npy', weights)
# Now load the weights to the embedding layer again
new_embedding_sentence = tf.keras.layers.Embedding(vocab_size, model_dimension, trainable=True)
new_embedding_sentence.build((None,)) # required to set the weights
new_embedding_sentence.set_weights(weights)
new_sentence = "This a dummy sentence"
new_sentence_embedding = new_embedding_sentence(new_sentence )
predictions = model(new_sentence_embedding)

Fine-tune huggingface transformer to classify synonyms

I have a dataset of synonyms and non-synonyms. These are stored in a list of python dictionaries like {"sentence1": <string>, "sentence2": <string>, "label": <1.0 or 0.0> }. Note that this words (or sentences) do not have to be a single token in the tokenizer.
I want to fine-tune a BERT-based model to take both sentences like: [[CLS], <sentence1_token1>, ...,<sentence1_tokenN>, [SEP], <sentence2_token1>, ..., <sentence2_tokenM>, [SEP]]. I want to take the embedding for the [CLS] token (or the pooled_ouput available in some models) and run it through one or more perceptron layers (MLP).
Once I have this new model with the additional layers I want to train it using my data. I have found some examples and I have been able to create the desired pipeline (using PyTorch's torch.nn for the perceptron layers, although I am open to hear recommendations on what is best).
model = AutoModel.from_pretrained(modelname)
tokenizer = AutoTokenizer.from_pretrained(modelname)
# input_sentences1 is a list of the first sentence of every pair
# input_sentences2 is a list of the second sentence of every pair
input = tokenizer( input_sentences1,
input_sentences2,
add_special_tokens = True,
padding=True,
return_tensors="pt" )
bert_output = model(**input)
# Extract embedding that will go through additional layers
pooled_output = bert_output.pooler_output
pooled_ouput_CLS_embedding = pooled_output[:]
## OR
# sequence_output = bert_output.last_hidden_state
# sequence_ouput_CLS_embedding = sequence_output[:,0,:]
# First layer
linear1 = nn.Linear(768, 256)
linear1_output = linear1(pooled_ouput_CLS_embedding)
# Second layer
linear2 = nn.Linear(256, 1)
linear2_output = linear2(linear1_output)
linear2_output # Random results becuase the layers have not been trained
How do I encapsulate this to facilitate training and how do I perform the fine tuning?

Split TFLite model into two submodels

I have a trained TF model which has the following architecture:
Inputs:
word_a, one-hot representation, vocab-size: 50000
word_b, one-hot representation, vocab-size: 50
Output:
probs, size: 1x10000
The network consists of embedding lookup of word_a of size 1x100 (dense_word_a) from an embedding matrix. word_b is transformed into a similar vector using a Character CNN into a dense vector of size 1x250. Both the vectors are concatenated into a 1x350 vector and using a decoder layer and sigmoid we're mapping it to the output layer and sigmoid with vector size 1x10000.
I need to run this model on the client, and for this I'm converting it to TFLite.
However, I also need to break the model into two sub-models with the following inputs and outputs:
Model 1:
Inputs:
word_a: one-hot representation, (1x50000) vocab-size: 50000
Output:
dense_word_a: dense word-embedding looked up from embedding matrix (1x100)
Network:
Simple embedding lookup for word_a from embedding matrix.
Model 2:
Inputs:
dense_word_a: embedding for word_a received from Model 1. (1x100).
word_b, one-hot representation, vocab-size: 50 (1x50)
Output:
probs, size: 1x10000
In Model 1, the input word_a is a placeholder and dense_word_a is a variable. In Model 2, dense_word_a is a placeholder and it's value is concatenated with the word_b's embedding.
The embeddings in my model are not pre-trained, and are trained as part of the model training process itself. So I need to train the model as a combined model but during inference I want to break it up into Model 1 and Model 2 as described above.
The idea is to run the Model 1 on server side and pass the embedding values to client so it can perform inference using word_b and not have a 5MB embedding matrix on the client. So, I'm not constrained on the size of Model 1, but since Model 2 runs on the client I need it to be small.
Here's what I've tried:
1. During model freezing time, I freeze the full model but in the output nodes list I also add the variable name dense_word_a along with probs. I then convert the model to TFLite. During inference, I'm able to see the dense_word_a output as a 1x100 vector. This seems to work fine. I'm also getting the probs as output,
For generating Model 2, I just remove the dense_word_a variable and convert it into a placeholder (using tf.placeholder), remove the placeholder for word_a and freeze the graph again.
However, I'm not able to match the probs value. The probs vector generated by the full model don't match with the probs values vector generated by Model 2.
How can I go about breaking the model into sub-models and also match the results?

I think what you described should work.
Is it easy to reproduce the problem that you're seeing? If you can isolate the reproducible steps and you believe there's a bug, could you file a bug on github? Thanks!

Simple autoencoder keeping constant tensor as predict in keras

I'm new in keras and deep learning field. In fact, I want to make a dense vector for each document in my data so that i built a simple autoencoder using keras library.
The input data are normalized using Word2vec with 200 as embedding size and all features are between -1 and 1. I prepared a 3D tensor that contains 137 samples (number of document) with 469 columns (maximum numbers of words) and the third dimension is the embedding size.I used mse loss function and GRU as recurrent neural network. I am having the same vector for all documents as the autoencoder prediction output while loss start with a very low value and became constant after a few number of epochs.
I tried different number of epochs but I got the same thing. I tried also to change the batch size but no change. Can any one help me find the problem please.
input = Input(shape=(469,200))
encoder = GRU(120,activation='sigmoid',dropout=0.2)(input)
neck = Dense(20)(encoder)
decoder1 = RepeatVector(469)(neck)
decoder1 = GRU(120,return_sequences=True,activation='sigmoid',dropout=0.2)(decoder1)
decoder1 = TimeDistributed(Dense(200,activation='tanh'))(decoder1)
model = Model(inputs=input, outputs=decoder1)
model.compile(optimizer='adam', loss='mse')
history = model.fit(x_train, x_train,validation_data=(x_test,x_test) ,epochs=10, batch_size=8)
this is the input data "x_train" :
print(model.predict(x_train)) return this values (same vectors):
Why "model.predict(x_train)" return the same vector for the 137 samples ?
Thank you in advance.

Extending the vocabulary on deployment

When doing training, I initialize my embedding matrix, using the pretrained embeddings picked for words in training set vocabulary.
import torchtext as tt
contexts = tt.data.Field(lower=True, sequential=True, tokenize=tokenizer, use_vocab=True)
contexts.build_vocab(data, vectors="fasttext.en.300d",
vectors_cache=config["vectors_cache"])
In my model I pass contexts.vocab as parameter and initialize embeddings:
embedding_dim = vocab.vectors.shape[1]
self.embeddings = nn.Embedding(len(vocab), embedding_dim)
self.embeddings.weight.data.copy_(vocab.vectors)
self.embeddings.weight.requires_grad=False
I train my model and during training I save its 'best' state via torch.save(model, f).
Then I want to test/create demo for model in separate file for evaluation. I load the model via torch.load. How do I extend the embedding matrix to contain test vocabulary? I tried to replace embedding matrix
# data is TabularDataset with test data
contexts.build_vocab(data, vectors="fasttext.en.300d",
vectors_cache=config["vectors_cache"])
model.embeddings = torch.nn.Embedding(len(contexts.vocab), contexts.vocab.vectors.shape[1])
model.embeddings.weight.data.copy_(contexts.vocab.vectors)
model.embeddings.weight.requires_grad = False
But the results are terrible (almost 0 accuracy). Model was doing good during training. What is the 'correct way' of doing this?

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.