Train only some word embeddings (Keras) - python

In my model, I use GloVe pre-trained embeddings. I wish to keep them non-trainable in order to decrease the number of model parameters and avoid overfit. However, I have a special symbol whose embedding I do want to train.
Using the provided Embedding Layer, I can only use the parameter 'trainable' to set the trainability of all embeddings in the following way:
embedding_layer = Embedding(voc_size,
emb_dim,
weights=[embedding_matrix],
input_length=MAX_LEN,
trainable=False)
Is there a Keras-level solution to training only a subset of embeddings?
Please note:
There is not enough data to generate new embeddings for all words.
These answers only relate to native TensorFlow.

Found some nice workaround, inspired by Keith's two embeddings layers.
Main idea:
Assign the special tokens (and the OOV) with the highest IDs. Generate a 'sentence' containing only special tokens, 0-padded elsewhere. Then apply non-trainable embeddings to the 'normal' sentence, and trainable embeddings to the special tokens. Lastly, add both.
Works fine to me.
# Normal embs - '+2' for empty token and OOV token
embedding_matrix = np.zeros((vocab_len + 2, emb_dim))
# Special embs
special_embedding_matrix = np.zeros((special_tokens_len + 2, emb_dim))
# Here we may apply pre-trained embeddings to embedding_matrix
embedding_layer = Embedding(vocab_len + 2,
emb_dim,
mask_zero = True,
weights = [embedding_matrix],
input_length = MAX_SENT_LEN,
trainable = False)
special_embedding_layer = Embedding(special_tokens_len + 2,
emb_dim,
mask_zero = True,
weights = [special_embedding_matrix],
input_length = MAX_SENT_LEN,
trainable = True)
valid_words = vocab_len - special_tokens_len
sentence_input = Input(shape=(MAX_SENT_LEN,), dtype='int32')
# Create a vector of special tokens, e.g: [0,0,1,0,3,0,0]
special_tokens_input = Lambda(lambda x: x - valid_words)(sentence_input)
special_tokens_input = Activation('relu')(special_tokens_input)
# Apply both 'normal' embeddings and special token embeddings
embedded_sequences = embedding_layer(sentence_input)
embedded_special = special_embedding_layer(special_tokens_input)
# Add the matrices
embedded_sequences = Add()([embedded_sequences, embedded_special])

I haven't found a nice solution like a mask for the Embedding layer. But here's what I've been meaning to try:
Two embedding layers - one trainable and one not
The non-trainable one has all the Glove embeddings for in-vocab words and zero vectors for others
The trainable one only maps the OOV words and special symbols
The output of these two layers is added (I was thinking of this like ResNet)
The Conv/LSTM/etc below the embedding is unchanged
That would get you a solution with a small number of free parameters allocated to those embeddings.

Related

Fine-tune huggingface transformer to classify synonyms

I have a dataset of synonyms and non-synonyms. These are stored in a list of python dictionaries like {"sentence1": <string>, "sentence2": <string>, "label": <1.0 or 0.0> }. Note that this words (or sentences) do not have to be a single token in the tokenizer.
I want to fine-tune a BERT-based model to take both sentences like: [[CLS], <sentence1_token1>, ...,<sentence1_tokenN>, [SEP], <sentence2_token1>, ..., <sentence2_tokenM>, [SEP]]. I want to take the embedding for the [CLS] token (or the pooled_ouput available in some models) and run it through one or more perceptron layers (MLP).
Once I have this new model with the additional layers I want to train it using my data. I have found some examples and I have been able to create the desired pipeline (using PyTorch's torch.nn for the perceptron layers, although I am open to hear recommendations on what is best).
model = AutoModel.from_pretrained(modelname)
tokenizer = AutoTokenizer.from_pretrained(modelname)
# input_sentences1 is a list of the first sentence of every pair
# input_sentences2 is a list of the second sentence of every pair
input = tokenizer( input_sentences1,
input_sentences2,
add_special_tokens = True,
padding=True,
return_tensors="pt" )
bert_output = model(**input)
# Extract embedding that will go through additional layers
pooled_output = bert_output.pooler_output
pooled_ouput_CLS_embedding = pooled_output[:]
## OR
# sequence_output = bert_output.last_hidden_state
# sequence_ouput_CLS_embedding = sequence_output[:,0,:]
# First layer
linear1 = nn.Linear(768, 256)
linear1_output = linear1(pooled_ouput_CLS_embedding)
# Second layer
linear2 = nn.Linear(256, 1)
linear2_output = linear2(linear1_output)
linear2_output # Random results becuase the layers have not been trained
How do I encapsulate this to facilitate training and how do I perform the fine tuning?

Getting embeddings from wav2vec2 models in HuggingFace

I am trying to get the embeddings from pre-trained wav2vec2 models (e.g., from jonatasgrosman/wav2vec2-large-xlsr-53-german) using my own dataset.
My aim is to use these features for a downstream task (not specifically speech recognition). Namely, since the dataset is relatively small, I would train an SVM with these embeddings for the final classification.
So far I have tried this:
model_name = "facebook/wav2vec2-large-xlsr-53-german"
feature_extractor = Wav2Vec2Processor.from_pretrained(model_name)
model = Wav2Vec2Model.from_pretrained(model_name)
input_values = feature_extractor(train_dataset[:10]["speech"], return_tensors="pt", padding=True,
feature_size=1, sampling_rate=16000 ).input_values
Then, I am not sure whether the embeddings here correspond to the sequence of last_hidden_states:
hidden_states = model(input_values).last_hidden_state
or to the sequence of features of the last conv layer of the model:
features_last_cnn_layer = model(input_values).extract_features
Also, is this the correct way to extract features from a pre-trained model?
How one can get embeddings from a specific layer?
PD: Posting here as the HuggingFace's forum seems to be less active.
Just check the documentation:
last_hidden_state (torch.FloatTensor of shape (batch_size,
sequence_length, hidden_size)) – Sequence of hidden-states at the
output of the last layer of the model.
extract_features (torch.FloatTensor of shape (batch_size,
sequence_length, conv_dim[-1])) – Sequence of extracted feature
vectors of the last convolutional layer of the model.
The last_hidden_state vector represents so called contextualized embeddings (i.e. every feature (CNN output) has a vector representation that is to some extend influenced by the other tokens of the sequence).
The extract_features vector represents the embeddings of your input (after the CNNs).
.
Also, is this the correct way to extract features from a pre-trained
model?
Yes.
How one can get embeddings from a specific layer?
Set output_hidden_states=True:
o = model(input_values,output_hidden_states=True)
o.keys()
Output:
odict_keys(['last_hidden_state', 'extract_features', 'hidden_states'])
The hidden_states value contains the embeddings and the contextualized embeddings of each attention layer.
P.S.: jonatasgrosman/wav2vec2-large-xlsr-53-german model was trained with feat_extract_norm==layer. That means, you should also pass an attention mask to the model:
model_name = "facebook/wav2vec2-large-xlsr-53-german"
feature_extractor = Wav2Vec2Processor.from_pretrained(model_name)
model = Wav2Vec2Model.from_pretrained(model_name)
i= feature_extractor(train_dataset[:10]["speech"], return_tensors="pt", padding=True,
feature_size=1, sampling_rate=16000 )
model(**i)

how to calculate mean of words' glove embedding in a sentence

I have downloaded the glove trained matrix and used it in a Keras layer. however, I need the sentence embedding for another task.
I want to calculate the mean of all the word embeddings that are in that sentence.
what is the most efficient way to do that since there are about 25000 sentences?
also, I don't want to use a Lambda layer in Keras to get the mean of them.
the best way to do this is to use a GlobalAveragePooling1D layer. it receives the embeddings of tokens inside the sentences from the Embedding layer with the shapes (n_sentence, n_token, emb_dim) and computes the average of each token present in the sentence. the result has shape (n_sentence, emb_dim)
here a code example
embedding_dim = 128
vocab_size = 100
sentence_len = 20
embedding_matrix = np.random.uniform(-1,1, (vocab_size,embedding_dim))
test_sentences = np.random.randint(0,vocab_size, (3,sentence_len))
inp = Input((sentence_len))
embedder = Embedding(vocab_size, embedding_dim,
trainable=False, weights=[embedding_matrix])(inp)
avg = GlobalAveragePooling1D()(embedder)
model = Model(inp, avg)
model.summary()
model(test_sentences) # the mean of all the word embeddings inside sentences

What do input layers represent in a Hierarchical Attention Network

I'm trying to grasp the idea of a Hierarchical Attention Network (HAN), most of the code i find online is more or less similar to the one here: https://medium.com/jatana/report-on-text-classification-using-cnn-rnn-han-f0e887214d5f :
embedding_layer=Embedding(len(word_index)+1,EMBEDDING_DIM,weights=[embedding_matrix],
input_length=MAX_SENT_LENGTH,trainable=True)
sentence_input = Input(shape=(MAX_SENT_LENGTH,), dtype='int32', name='input1')
embedded_sequences = embedding_layer(sentence_input)
l_lstm = Bidirectional(LSTM(100))(embedded_sequences)
sentEncoder = Model(sentence_input, l_lstm)
review_input = Input(shape=(MAX_SENTS,MAX_SENT_LENGTH), dtype='int32', name='input2')
review_encoder = TimeDistributed(sentEncoder)(review_input)
l_lstm_sent = Bidirectional(LSTM(100))(review_encoder)
preds = Dense(len(macronum), activation='softmax')(l_lstm_sent)
model = Model(review_input, preds)
My question is: What do the input layers here represent? I'm guessing that input1 represents the sentences wrapped with the embedding layer, but in that case what is input2? Is it the output of the sentEncoder? In that case it should be a float, or if it's another layer of embedded words, then it should be wrapped with an embedding layer as well.
The HAN model processes the text in a hierarchy: it takes a document already splitted into sentences (that's why the shape of input2 is (MAX_SENTS,MAX_SENT_LENGTH)); then it processes each sentence independently using sentEncoder model (that's why the shape of input1 is (MAX_SENT_LENGTH,)), and finally it processes all the encoded sentences together.
So in your code the whole model is stored in model and its input layer is input2 which you would fed with documents which have been splitted into sentences and their words have been integer encoded (to make it compatible with the embedding layer). The other input layer belongs to the sentEncoder model which is used inside the model (and not directly by you):
review_encoder = TimeDistributed(sentEncoder)(review_input)
Masoud's answer is correct but I'll rewrite it here in my own words:
The data (X_train) is fed as indexes to the model and is received by
input2
X_train is then forwarded to the encoder model and is received by
input1
input1 is wrapped by an embedding layer so the indexes are converted
to vectors
So input2 is more a proxy of the model's input.

Using pretrained gensim Word2vec embedding in keras

I have trained word2vec in gensim. In Keras, I want to use it to make matrix of sentence using that word embedding. As storing the matrix of all the sentences is very space and memory inefficient. So, I want to make embedding layer in Keras to achieve this so that It can be used in further layers(LSTM). Can you tell me in detail how to do this?
PS: It is different from other questions because I am using gensim for word2vec training instead of keras.
Let's say you have following data that you need to encode
docs = ['Well done!',
'Good work',
'Great effort',
'nice work',
'Excellent!',
'Weak',
'Poor effort!',
'not good',
'poor work',
'Could have done better.']
You must then tokenize it using the Tokenizer from Keras like this and find the vocab_size
t = Tokenizer()
t.fit_on_texts(docs)
vocab_size = len(t.word_index) + 1
You can then enocde it to sequences like this
encoded_docs = t.texts_to_sequences(docs)
print(encoded_docs)
You can then pad the sequences so that all the sequences are of a fixed length
max_length = 4
padded_docs = pad_sequences(encoded_docs, maxlen=max_length, padding='post')
Then use the word2vec model to make embedding matrix
# load embedding as a dict
def load_embedding(filename):
# load embedding into memory, skip first line
file = open(filename,'r')
lines = file.readlines()[1:]
file.close()
# create a map of words to vectors
embedding = dict()
for line in lines:
parts = line.split()
# key is string word, value is numpy array for vector
embedding[parts[0]] = asarray(parts[1:], dtype='float32')
return embedding
# create a weight matrix for the Embedding layer from a loaded embedding
def get_weight_matrix(embedding, vocab):
# total vocabulary size plus 0 for unknown words
vocab_size = len(vocab) + 1
# define weight matrix dimensions with all 0
weight_matrix = zeros((vocab_size, 100))
# step vocab, store vectors using the Tokenizer's integer mapping
for word, i in vocab.items():
weight_matrix[i] = embedding.get(word)
return weight_matrix
# load embedding from file
raw_embedding = load_embedding('embedding_word2vec.txt')
# get vectors in the right order
embedding_vectors = get_weight_matrix(raw_embedding, t.word_index)
Once you have the embedding matrix you can use it in Embedding layer like this
e = Embedding(vocab_size, 100, weights=[embedding_vectors], input_length=4, trainable=False)
This layer can be used in making a model like this
model = Sequential()
e = Embedding(vocab_size, 100, weights=[embedding_matrix], input_length=4, trainable=False)
model.add(e)
model.add(Flatten())
model.add(Dense(1, activation='sigmoid'))
# compile the model
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['acc'])
# summarize the model
print(model.summary())
# fit the model
model.fit(padded_docs, labels, epochs=50, verbose=0)
All the codes are adapted from this awesome blog post. follow it to know more about Embeddings using Glove
For using word2vec see this post
With the new Gensim version this is pretty easy:
w2v_model.wv.get_keras_embedding(train_embeddings=False)
there you have your Keras embedding layer
My code for gensim-trained w2v model. Assume all words trained in the w2v model is now a list variable called all_words.
from keras.preprocessing.text import Tokenizer
import gensim
import pandas as pd
import numpy as np
from itertools import chain
w2v = gensim.models.Word2Vec.load("models/w2v.model")
vocab = w2v.wv.vocab
t = Tokenizer()
vocab_size = len(all_words) + 1
t.fit_on_texts(all_words)
def get_weight_matrix():
# define weight matrix dimensions with all 0
weight_matrix = np.zeros((vocab_size, w2v.vector_size))
# step vocab, store vectors using the Tokenizer's integer mapping
for i in range(len(all_words)):
weight_matrix[i + 1] = w2v[all_words[i]]
return weight_matrix
embedding_vectors = get_weight_matrix()
emb_layer = Embedding(vocab_size, output_dim=w2v.vector_size, weights=[embedding_vectors], input_length=FIXED_LENGTH, trainable=False)

Categories