I'm trying to grasp the idea of a Hierarchical Attention Network (HAN), most of the code i find online is more or less similar to the one here: https://medium.com/jatana/report-on-text-classification-using-cnn-rnn-han-f0e887214d5f :
embedding_layer=Embedding(len(word_index)+1,EMBEDDING_DIM,weights=[embedding_matrix],
input_length=MAX_SENT_LENGTH,trainable=True)
sentence_input = Input(shape=(MAX_SENT_LENGTH,), dtype='int32', name='input1')
embedded_sequences = embedding_layer(sentence_input)
l_lstm = Bidirectional(LSTM(100))(embedded_sequences)
sentEncoder = Model(sentence_input, l_lstm)
review_input = Input(shape=(MAX_SENTS,MAX_SENT_LENGTH), dtype='int32', name='input2')
review_encoder = TimeDistributed(sentEncoder)(review_input)
l_lstm_sent = Bidirectional(LSTM(100))(review_encoder)
preds = Dense(len(macronum), activation='softmax')(l_lstm_sent)
model = Model(review_input, preds)
My question is: What do the input layers here represent? I'm guessing that input1 represents the sentences wrapped with the embedding layer, but in that case what is input2? Is it the output of the sentEncoder? In that case it should be a float, or if it's another layer of embedded words, then it should be wrapped with an embedding layer as well.
The HAN model processes the text in a hierarchy: it takes a document already splitted into sentences (that's why the shape of input2 is (MAX_SENTS,MAX_SENT_LENGTH)); then it processes each sentence independently using sentEncoder model (that's why the shape of input1 is (MAX_SENT_LENGTH,)), and finally it processes all the encoded sentences together.
So in your code the whole model is stored in model and its input layer is input2 which you would fed with documents which have been splitted into sentences and their words have been integer encoded (to make it compatible with the embedding layer). The other input layer belongs to the sentEncoder model which is used inside the model (and not directly by you):
review_encoder = TimeDistributed(sentEncoder)(review_input)
Masoud's answer is correct but I'll rewrite it here in my own words:
The data (X_train) is fed as indexes to the model and is received by
input2
X_train is then forwarded to the encoder model and is received by
input1
input1 is wrapped by an embedding layer so the indexes are converted
to vectors
So input2 is more a proxy of the model's input.
Related
I have a Keras model that text Text as an Input, vectorizes it via the TextVectorization-Layer via multi_hot encoding. The problem is that this layer outputs a Tensor with the shape (None,batch_size). The next layer, SimpleRNN, requires a Tensor with 3 Dimensions. How can I convert this output without using an Embedding-Layer?
I already tried different Layers to work with my Input, reshaping the np-arrays. Here is my current code:
X_Train = np.asarray(X_Train)
X_Test = np.asarray(X_Test)
Y_Train = np.asarray(Y_Train)
Y_Test = np.asarray(Y_Test)
vectorizer = TextVectorization(standardize=None,split=None,output_mode="multi_hot")
vectorizer.adapt(X_Train)
model = Sequential()
model.add(Input(shape=(1),dtype="string"))
model.add(vectorizer)
print(vectorizer.output_shape)
#model.add(Embedding(datasetSize,2))
model.add(layers.SimpleRNN(nodes,activation=activationFunc))
model.add(Dense(1,activation=None,use_bias=False))
model.compile(optimizer=optimizerFunc,loss=lossFunc)
model.summary()
It works using the Embedding-Layer, but I need it to work without at.
My data is basically a csv-File with the format
Word
Weight
word1
1
word2
2
The goal is to let the network predict the weight of any given word based on this data.
Any idea how to make it work with this?
I have a dataset of synonyms and non-synonyms. These are stored in a list of python dictionaries like {"sentence1": <string>, "sentence2": <string>, "label": <1.0 or 0.0> }. Note that this words (or sentences) do not have to be a single token in the tokenizer.
I want to fine-tune a BERT-based model to take both sentences like: [[CLS], <sentence1_token1>, ...,<sentence1_tokenN>, [SEP], <sentence2_token1>, ..., <sentence2_tokenM>, [SEP]]. I want to take the embedding for the [CLS] token (or the pooled_ouput available in some models) and run it through one or more perceptron layers (MLP).
Once I have this new model with the additional layers I want to train it using my data. I have found some examples and I have been able to create the desired pipeline (using PyTorch's torch.nn for the perceptron layers, although I am open to hear recommendations on what is best).
model = AutoModel.from_pretrained(modelname)
tokenizer = AutoTokenizer.from_pretrained(modelname)
# input_sentences1 is a list of the first sentence of every pair
# input_sentences2 is a list of the second sentence of every pair
input = tokenizer( input_sentences1,
input_sentences2,
add_special_tokens = True,
padding=True,
return_tensors="pt" )
bert_output = model(**input)
# Extract embedding that will go through additional layers
pooled_output = bert_output.pooler_output
pooled_ouput_CLS_embedding = pooled_output[:]
## OR
# sequence_output = bert_output.last_hidden_state
# sequence_ouput_CLS_embedding = sequence_output[:,0,:]
# First layer
linear1 = nn.Linear(768, 256)
linear1_output = linear1(pooled_ouput_CLS_embedding)
# Second layer
linear2 = nn.Linear(256, 1)
linear2_output = linear2(linear1_output)
linear2_output # Random results becuase the layers have not been trained
How do I encapsulate this to facilitate training and how do I perform the fine tuning?
I am trying to get the embeddings from pre-trained wav2vec2 models (e.g., from jonatasgrosman/wav2vec2-large-xlsr-53-german) using my own dataset.
My aim is to use these features for a downstream task (not specifically speech recognition). Namely, since the dataset is relatively small, I would train an SVM with these embeddings for the final classification.
So far I have tried this:
model_name = "facebook/wav2vec2-large-xlsr-53-german"
feature_extractor = Wav2Vec2Processor.from_pretrained(model_name)
model = Wav2Vec2Model.from_pretrained(model_name)
input_values = feature_extractor(train_dataset[:10]["speech"], return_tensors="pt", padding=True,
feature_size=1, sampling_rate=16000 ).input_values
Then, I am not sure whether the embeddings here correspond to the sequence of last_hidden_states:
hidden_states = model(input_values).last_hidden_state
or to the sequence of features of the last conv layer of the model:
features_last_cnn_layer = model(input_values).extract_features
Also, is this the correct way to extract features from a pre-trained model?
How one can get embeddings from a specific layer?
PD: Posting here as the HuggingFace's forum seems to be less active.
Just check the documentation:
last_hidden_state (torch.FloatTensor of shape (batch_size,
sequence_length, hidden_size)) – Sequence of hidden-states at the
output of the last layer of the model.
extract_features (torch.FloatTensor of shape (batch_size,
sequence_length, conv_dim[-1])) – Sequence of extracted feature
vectors of the last convolutional layer of the model.
The last_hidden_state vector represents so called contextualized embeddings (i.e. every feature (CNN output) has a vector representation that is to some extend influenced by the other tokens of the sequence).
The extract_features vector represents the embeddings of your input (after the CNNs).
.
Also, is this the correct way to extract features from a pre-trained
model?
Yes.
How one can get embeddings from a specific layer?
Set output_hidden_states=True:
o = model(input_values,output_hidden_states=True)
o.keys()
Output:
odict_keys(['last_hidden_state', 'extract_features', 'hidden_states'])
The hidden_states value contains the embeddings and the contextualized embeddings of each attention layer.
P.S.: jonatasgrosman/wav2vec2-large-xlsr-53-german model was trained with feat_extract_norm==layer. That means, you should also pass an attention mask to the model:
model_name = "facebook/wav2vec2-large-xlsr-53-german"
feature_extractor = Wav2Vec2Processor.from_pretrained(model_name)
model = Wav2Vec2Model.from_pretrained(model_name)
i= feature_extractor(train_dataset[:10]["speech"], return_tensors="pt", padding=True,
feature_size=1, sampling_rate=16000 )
model(**i)
i had trained and tested a CNN for sentiment analysis. The train and test data were prepared the same way, tokenizing the sentence and giving unique integers:
tokenizer = Tokenizer(filters='$%&()*/:;<=>#[\\]^`{|}~\t\n')
tokenizer.fit_on_texts(text)
vocab_size = len(tokenizer.word_index) + 1
sequences = tokenizer.texts_to_sequences(text)
Then pre-trained glove model to create embedding matrix for CNN as:
filepath_glove = 'glove.twitter.27B.200d.txt'
glove_vocab = []
glove_embd=[]
embedding_dict = {}
file = open(filepath_glove,'r',encoding='UTF-8')
for line in file.readlines():
row = line.strip().split(' ')
vocab_word = row[0]
glove_vocab.append(vocab_word)
embed_vector = [float(i) for i in row[1:]] # convert to list of float
embedding_dict[vocab_word]=embed_vector
file.close()
for word, index in tokenizer.word_index.items():
`embedding_matrix[index] = embedding_dict[word]`
At this point i also used the test sentences to create this matrix which was later passed as weights into embedding layer:
e= Embedding(vocab_size, 200, input_length=maxSeqLength, weights=[embedding_matrix], trainable=False)(inp)
Now i want to reload my model and test with some new data but it would mean that embedding matrix wont include some words from new data.This makes me wonder that if even before i shouldnt have had included test data while creating embedding matrix? And if not,how does the embedding layer work for those new words?This part is similar to this question but i couldnt find answer:
How does the Keras Embedding Layer work if word is not found?
Thanks
It´s quite simple. You are providing the vocab_size, which is the number of words, the embedding layer knows. If you pass an index, which is out of bounds of the vocab_size (new word), it will be ignored, or an error will be thrown by keras.
This answers your question regarding if you should include all data for your embedding matrix. Yes, you should.
I'm working on text summarization, starting with the pointer-generator network described in the paper: https://arxiv.org/pdf/1704.04368.pdf, with a code release at: https://github.com/abisee/pointer-generator
I want to add hierarchical encoding to this network to be able to handle larger input documents for summarization. Right now, input documents are truncated at a length of 400 words because LSTMs have a keeping memory over very long inputs. We want to reweight the attention distribution considering sentence input.
The reweighting I am considering is from the paper: https://arxiv.org/pdf/1602.06023.pdf where the final distribution is
where "P_a^w(j) is the word-level attention weight at jth position of the source document, and s(j) is the ID of the sentence at jth word position, P_a^s(l)is the sentence-level attention weight for the lth sentence in the source, Nd is the number of words in the source document, and P_a(j) is the re-scaled attention at the jth word position"
The Pointer-Generator attention is calculated as
then
softmax(e^t_ j)
So I think I need to make another Pt-gen attention LSTM, feed the outputs through the same attention calculation with softmax as the words, and then at the end reweight as described in equation 1.
I have added a new bidirectionalLSTM under def _add_encoder. I am wondering how I should handle input into the sentence LSTM. The model needs to have some kind of indication of the position of the sentence and the words in it. How should I structure the sentences to be processed by the sentence LSTM?
def _add_encoder(self, encoder_inputs, seq_len):
"""Add a single-layer bidirectional LSTM encoder to the graph.
Args:
encoder_inputs: A tensor of shape [batch_size, <=max_enc_steps, emb_size].
seq_len: Lengths of encoder_inputs (before padding). A tensor of shape [batch_size].
Returns:
encoder_outputs:
A tensor of shape [batch_size, <=max_enc_steps, 2*hidden_dim]. It's 2*hidden_dim because it's the concatenation of the forwards and backwards states.
fw_state, bw_state:
Each are LSTMStateTuples of shape ([batch_size,hidden_dim],[batch_size,hidden_dim])
"""
with tf.variable_scope('encoder'):
cell_fw = tf.contrib.rnn.LSTMCell(self._hps.hidden_dim, initializer=self.rand_unif_init, state_is_tuple=True)
cell_bw = tf.contrib.rnn.LSTMCell(self._hps.hidden_dim, initializer=self.rand_unif_init, state_is_tuple=True)
(encoder_outputs, (fw_st, bw_st)) = tf.nn.bidirectional_dynamic_rnn(cell_fw, cell_bw, encoder_inputs, dtype=tf.float32, sequence_length=seq_len, swap_memory=True)
encoder_outputs = tf.concat(axis=2, values=encoder_outputs) # concatenate the forwards and backwards states
sentence_encoder_inputs = ???
sentence_lens = len(sentences_encoder_inputs)
### NEW CODE ###
sentence_cell_fw = tf.contrib.rnn.LSTMCell(self._hps.hidden_dim, initializer=self.rand_unif_init, state_is_tuple=True)
sentence_cell_bw = tf.contrib.rnn.LSTMCell(self._hps.hidden_dim, initializer=self.rand_unif_init, state_is_tuple=True)
(sentence_encoder_outputs, (fw_st, bw_st)) = tf.nn.bidirectional_dynamic_rnn(cell_fw, cell_bw, sentence_encoder_inputs, dtype=tf.float32, sequence_length=sentence_lens, swap_memory=True)
sentence_encoder_outputs = tf.concat(axis=2, values=sentence_encoder_outputs) # concatenate the forwards and backwards states
return encoder_outputs, sentence_encoder_outputs, fw_st, bw_st