I have questions about google implementation of the Transformer here.
In train_step(input, tar) function: the inp dimension is a 256*40 tensor and the transformer returns a 256*39*8089 tensor. Is each row in inp a sentence? I expected Transformer to take a batch of sentences (a batch_size of 2D matrix in which each row is a word) and calculate attention weights and outputs at once and then pass them to decoder (see here. ). However, I cannot see that being implemented in the code.
In train_step(input, tar) function: the "predictions" is a 256*39*8089 tensor. Is it [batch size, max number of words in a sentence, target vocab size]? How does loss_function calculate loss while this format is different from ```tar_real`` which is [256 * 39]?
In def evaluate(inp_sentence): Why in each iteration it sends the Transformer the entire encoder input? What I expect is that the encoder calculates attention weights and output once and then inside the for loop we send the output of the attentions and the predictions so far.
Thank you
Related
Normally all inputs fed to BiLSTM are of shape [batch_size, time_steps, input_size].
However, I'm working on a problem of Automatic Grading of an Essay in which there's an extra dimension called number of sentences in each essay. So in my case, a typical batch after embedding using word2vec, is of shape [2,16,25,300].
Here, there are 2 essays in each batch (batch_size=2), each essay has 16 sentences, each sentence is 25 words long(time_step=25) and I'm using word2vec of 300 dimensions (input_size=300).
So clearly I need to loop this batch over dimension 16 somehow such that the shape of input becomes [2,25,300] in each iteration. I have tried for a long time but I haven't been able to find a way to do it. For example, if you make a loop over tf.nn.bidirectional_dynamic_rnn(), it'll give error in second iteration saying that tf.nn.bidirectional_dynamic_rnn() kernel already exists. I can't directly make a for loop over sentence_dimension because those are tensors of shape [None,None,None,300] and I gave values just for the sake of simplicity. If there any other way to do it? Thanks. Please note that I am not using Keras or any other framework.
Here's a sample encoding layer for reference.
def bidirectional_lstm(input_data):
cell = tf.nn.rnn_cell.LSTMCell(num_units=200, state_is_tuple=True)
outputs, states = tf.nn.bidirectional_dynamic_rnn(cell_fw=cell,
cell_bw=cell,
dtype=tf.float32,
inputs=input_data)
return tf.concat(outputs,2)
embedded is of shape [2,16,25,300].
And here's a sample input
for batch_i, texts_batch in enumerate(get_batches(X_train, batch_size)): ## get_batches() is yielding batches one by one
## X_train is of shape [2,16,25] ([batch_size,sentence_length,num_words])
## word2vec
embeddings = tf.nn.embedding_lookup(word_embedding_matrix, texts_batch)
## embeddings shape is now [2,16,25,300] ([batch_size,sentence_length,num_words,word2vec_dim])
## Now I need some kind of loop here to loop it over sentence dimension. Can't use for loop since these are all placeholders with dimensions None
## ??? output = bidirectional_lstm(embeddings,3,200,0.7) ??? This would be correct if there was no sentence dimension.
I'm working on text summarization, starting with the pointer-generator network described in the paper: https://arxiv.org/pdf/1704.04368.pdf, with a code release at: https://github.com/abisee/pointer-generator
I want to add hierarchical encoding to this network to be able to handle larger input documents for summarization. Right now, input documents are truncated at a length of 400 words because LSTMs have a keeping memory over very long inputs. We want to reweight the attention distribution considering sentence input.
The reweighting I am considering is from the paper: https://arxiv.org/pdf/1602.06023.pdf where the final distribution is
where "P_a^w(j) is the word-level attention weight at jth position of the source document, and s(j) is the ID of the sentence at jth word position, P_a^s(l)is the sentence-level attention weight for the lth sentence in the source, Nd is the number of words in the source document, and P_a(j) is the re-scaled attention at the jth word position"
The Pointer-Generator attention is calculated as
then
softmax(e^t_ j)
So I think I need to make another Pt-gen attention LSTM, feed the outputs through the same attention calculation with softmax as the words, and then at the end reweight as described in equation 1.
I have added a new bidirectionalLSTM under def _add_encoder. I am wondering how I should handle input into the sentence LSTM. The model needs to have some kind of indication of the position of the sentence and the words in it. How should I structure the sentences to be processed by the sentence LSTM?
def _add_encoder(self, encoder_inputs, seq_len):
"""Add a single-layer bidirectional LSTM encoder to the graph.
Args:
encoder_inputs: A tensor of shape [batch_size, <=max_enc_steps, emb_size].
seq_len: Lengths of encoder_inputs (before padding). A tensor of shape [batch_size].
Returns:
encoder_outputs:
A tensor of shape [batch_size, <=max_enc_steps, 2*hidden_dim]. It's 2*hidden_dim because it's the concatenation of the forwards and backwards states.
fw_state, bw_state:
Each are LSTMStateTuples of shape ([batch_size,hidden_dim],[batch_size,hidden_dim])
"""
with tf.variable_scope('encoder'):
cell_fw = tf.contrib.rnn.LSTMCell(self._hps.hidden_dim, initializer=self.rand_unif_init, state_is_tuple=True)
cell_bw = tf.contrib.rnn.LSTMCell(self._hps.hidden_dim, initializer=self.rand_unif_init, state_is_tuple=True)
(encoder_outputs, (fw_st, bw_st)) = tf.nn.bidirectional_dynamic_rnn(cell_fw, cell_bw, encoder_inputs, dtype=tf.float32, sequence_length=seq_len, swap_memory=True)
encoder_outputs = tf.concat(axis=2, values=encoder_outputs) # concatenate the forwards and backwards states
sentence_encoder_inputs = ???
sentence_lens = len(sentences_encoder_inputs)
### NEW CODE ###
sentence_cell_fw = tf.contrib.rnn.LSTMCell(self._hps.hidden_dim, initializer=self.rand_unif_init, state_is_tuple=True)
sentence_cell_bw = tf.contrib.rnn.LSTMCell(self._hps.hidden_dim, initializer=self.rand_unif_init, state_is_tuple=True)
(sentence_encoder_outputs, (fw_st, bw_st)) = tf.nn.bidirectional_dynamic_rnn(cell_fw, cell_bw, sentence_encoder_inputs, dtype=tf.float32, sequence_length=sentence_lens, swap_memory=True)
sentence_encoder_outputs = tf.concat(axis=2, values=sentence_encoder_outputs) # concatenate the forwards and backwards states
return encoder_outputs, sentence_encoder_outputs, fw_st, bw_st
I am referreing to this sample code
in the code snippet below:
embeddings = tf.Variable(tf.random_uniform([vocabulary_size, embedding_size], -1.0, 1.0))
embed = tf.nn.embedding_lookup(embeddings, train_inputs)
# Construct the variables for the NCE loss
nce_weights = tf.Variable(tf.truncated_normal([vocabulary_size, embedding_size],stddev=1.0 / math.sqrt(embedding_size)))
nce_biases = tf.Variable(tf.zeros([vocabulary_size]))
loss = tf.reduce_mean(
tf.nn.nce_loss(weights=nce_weights,
biases=nce_biases,
labels=train_labels,
inputs=embed,
num_sampled=num_sampled,
num_classes=vocabulary_size))
optimizer = tf.train.GradientDescentOptimizer(1.0).minimize(loss)
Now NCE_Loss function is nothing but a single hidden layer neural network with softmax at the optput layer [knowing is takes only a few negative sample]
This part of the graph will only update the weights of the network, it is not doing anything to the "embeddings" matrix/ tensor.
so ideally once the network is trained we must again pass it once through the embeddings_matrix first and then multiply by the transpose of the "nce_weights" [considering it as the same weight auto-encoder, at input & output layers] to reach to the hidden layer representation of each word, which we are are calling word2vec (?)
But if look at the later part of the code, the value of the embeddings matrix is being used a word representation. This
Even the tensorflow doc for NCE loss, mentions input (to which we are passing embed, which uses embeddings) as just the 1st layer input activation values.
inputs: A Tensor of shape [batch_size, dim]. The forward activations of the input network.
A normal back propagation stops at the first layer of the network,
does this implementation of NCE loss, goes beyond and propagates the loss to the input values (and hence to the embedding) ?
This seems an extra step?
Refer this for why I am calling it an extra step, he has a same explanation.
Want I have figured out reading and going through tensorflow is that
though the entire thing is single hidden layer neural network, a auto-encoder indeed. But the weights are not tied, which I assumed.
The encoder is made of the weight matrix embeddings and the decoder is made of the nce_weights. And now embed is nothing but the hidden layer output, given by multiplying input with embeddings.
So with this, embeddings and nce_weights both will be updated in the graph. And we can choose any of the two weight matrix, embeddings is more preferred here.
Edit1:
Actually for both tf.nn.nce_loss and tf.nn.sampled_softmax_loss, the parameters, weights and bias are for the input Weights(tranpose) X + bias, to objective function, which can be logistic regression/ softmax function [refer].
But the back-propagation/ gradient descent happens till the very base of the graph you are building and does not stop at the weights and bias of the function only. Hence the input parameter in both tf.nn.nce_loss and tf.nn.sampled_softmax_loss are also updated which in-turn is build of embeddings matrix.
I have data that has various conversations between two people. Each sentence has some type of classification. I am attempting to use an NLP net to classify each sentence of the conversation. I tried a convolution net and get decent results (not ground breaking tho). I figured that since this a back and forth conversation, and LSTM net may produce better results, because what was previously said may have a large impact on what follows.
If I follow the structure above, I would assume that I am doing a many-to-many. My data looks like.
X_train = [[sentence 1],
[sentence 2],
[sentence 3]]
Y_train = [[0],
[1],
[0]]
Data has been processed using word2vec. I then design my network as follows..
model = Sequential()
model.add(Embedding(len(vocabulary),embedding_dim,
input_length=X_train.shape[1]))
model.add(LSTM(88))
model.add(Dense(1,activation='sigmoid'))
model.compile(optimizer='rmsprop',loss='binary_crossentropy',
metrics['accuracy'])
model.fit(X_train,Y_train,verbose=2,nb_epoch=3,batch_size=15)
I assume that this setup will feed one batch of sentences in at a time. However, if in model.fit, shuffle is not equal to false its receiving shuffled batches, so why is an LSTM net even useful in this case? From research on the subject, to achieve a many-to-many structure one would need to change the LSTM layer too
model.add(LSTM(88,return_sequence=True))
and the output layer would need to be...
model.add(TimeDistributed(Dense(1,activation='sigmoid')))
When switching to this structure I get an error on the input size. I'm unsure of how to reformat the data to meet this requirement, and also how to edit the embedding layer to receive the new data format.
Any input would be greatly appreciated. Or if you have any suggestions on a better method, I am more than happy to hear them!
Your first attempt was good. The shuffling takes place between sentences, the only shuffle the training samples between them so that they don't always come in in the same order. The words inside sentences are not shuffled.
Or maybe I didn't understand the question correctly?
EDIT :
After a better understanding of the question, here is my proposition.
Data preparation : You slice your corpus in blocks of n sentences (they can overlap).
You should then have a shape like (number_blocks_of_sentences, n, number_of_words_per_sentence) so basically a list of 2D arrays which contain blocks of n sentences. n shouldn't be too big because LSTM can't handle huge number of elements in the sequence when training (vanishing gradient).
Your targets should be an array of shape (number_blocks_of_sentences, n, 1) so also a list of 2D arrays containing the class of each sentence in your block of sentences.
Model :
n_sentences = X_train.shape[1] # number of sentences in a sample (n)
n_words = X_train.shape[2] # number of words in a sentence
model = Sequential()
# Reshape the input because Embedding only accepts shape (batch_size, input_length) so we just transform list of sentences in huge list of words
model.add(Reshape((n_sentences * n_words,),input_shape = (n_sentences, n_words)))
# Embedding layer - output shape will be (batch_size, n_sentences * n_words, embedding_dim) so each sample in the batch is a big 2D array of words embedded
model.add(Embedding(len(vocabaulary), embedding_dim, input_length = n_sentences * n_words ))
# Recreate the sentence shaped array
model.add(Reshape((n_sentences, n_words, embedding_dim)))
# Encode each sentence - output shape is (batch_size, n_sentences, 88)
model.add(TimeDistributed(LSTM(88)))
# Go over lines and output hidden layer which contains info about previous sentences - output shape is (batch_size, n_sentences, hidden_dim)
model.add(LSTM(hidden_dim, return_sequence=True))
# Predict output binary class - output shape is (batch_size, n_sentences, 1)
model.add(TimeDistributed(Dense(1,activation='sigmoid')))
...
This should be a good start.
I hope this helps
I want to build a network which takes in sentences as input to predict the sentiment. So my input looks something like (num of samples x num of sentences x num of words). I then want to feed this in an embedding layer to learn the word vectors which can be then summed to get sentence vector. Is this type of architecture possible in keras? or Tensorflow? From the documentation Keras's embedding layer only takes in input (nb_samples, sequence_length). Is there any work around possible?
I guess this class resolves for Keras:
class AnyShapeEmbedding(Embedding):
'''
This Embedding works with inputs of any number of dimensions.
This can be accomplished by simply changing the output shape computation.
'''
##overrides
def compute_output_shape(self, input_shape):
return input_shape + (self.output_dim,)