I have a precipitation map timeseries dataset with input shape (None, seq_length =7, c = 75, w=112, h=112) output shape (None, lead_times = 60, c=51, w=28, h=28). The model (Conv downsampler + ConvGRU + Axial Attention) predicts precipitation in a 28x28 region in the middle with 51 categorical precipitation intervals and is conditioned with 60 different lead times (5, 10, ..., 300 minutes).
Right now my forward pass looks like this:
def forward(self, imgs):
"""It takes a rank 5 tensor
- imgs [bs, seq_len, channels, h, w]
"""
# Compute all timesteps, probably can be parallelized
res = []
for i in range(self.forecast_steps):
x_i = self.encode_timestep(imgs, i)
out = self.head(x_i)
res.append(out)
res = torch.stack(res, dim=1)
return res
Here imgs is the input tensor without lead time encoding, so only 15 channels. The imgs is then one-hot encoded for each respective lead time and the output is the entire predicted time series (5-300min). However this leads to severe memory issues even with batch_size = 1 so I want the forward loop to only do one random lead time at a time. I am training this with pytorch-lightning module for easier parallelization so I don't have much control of the training loop.
The issue is that the effective batch size with this training loop is 60*batch_size. The paper solves this by only doing one random lead time per sample, which now makes sense to me. This solves the memory issue by allowing effective minimum batch size to be 1. How can pass a random integer (the lead time) to the forward pass and couple it with the correct Y when pytorch-lightning computes the loss?
I want
y_hat = forward(self, X[n], lead_time=random)
...
loss(y_hat-Y[n,lead_time,:,:])
My code is available at https://github.com/ValterFallenius/metnet.
I figured out how to fix it. Once explaining the problem to someone else, I realized how simple the solution was...
def forward(self, imgs,lead_time):
"""It takes a rank 5 tensor
- imgs [bs, seq_len, channels, h, w]
- lead_time #random int between 0,self.forecast_steps
"""
x_i = self.encode_timestep(imgs, lead_time)
out = self.head(x_i)
res.append(out)
res = torch.stack(res, dim=1)
return res
The trick was to simply add the lead_time variable to the training_step method:
def training_step(self, batch, batch_idx):
x, y = batch
lead_time = np.random.randint(0,self.forecast_steps)
y_hat = self(x.float(),lead_time)
loss = F.mse_loss(y_hat, y[:,lead_time])
pbar = {"training_loss": loss}
return {"loss": loss, "progress_bar": pbar}
Related
I am trying to use a shared LSTM layer with state in a Keras model, but it seems that the internal state is modified by each parallel use. This raises two questions:
When training a model with a shared LSTM layer and using stateful=True, are the parallel uses updating the same state also during training?
If my observation is valid, is there a way to use weight-sharing LSTMs such that the state is stored independently for each of the parallel uses?
The code below exemplifies the problem with three sequences sharing the LSTM. The prediction of a full input is compared with the result from splitting the prediction input into two halves and feeding them into the network consecutively.
What can be observed, is that the a1 is the same as the first half of aFull, meaning that the the uses of the LSTM really are in parallel with independent states during the first prediction. I.e., z1 is not affected by the parallel call creating z2 and z3. But a2 is different from the second half of aFull, so there is some interaction between the states of the parallel uses.
What I was hoping is that the concatenation of the two pieces a1 and a2 would be the same as the result from calling the prediction with a longer input sequence, but this doesn't seem to be the case. A further concern is that when this kind of interaction takes place in the prediction, is it also happening during the training.
import keras
import keras.backend as K
import numpy as np
nOut = 3
xShape = (3, 50, 4)
inShape = (xShape[0], None, xShape[2])
batchInShape = (1, ) + inShape
x = np.random.randn(*xShape)
# construct network
xIn = keras.layers.Input(shape=inShape, batch_shape=batchInShape)
# shared LSTM layer
sharedLSTM = keras.layers.LSTM(units=nOut, stateful=True, return_sequences=True, return_state=False)
# split the input on the first axis
x1 = keras.layers.Lambda(lambda x: x[:,0,:,:])(xIn)
x2 = keras.layers.Lambda(lambda x: x[:,1,:,:])(xIn)
x3 = keras.layers.Lambda(lambda x: x[:,2,:,:])(xIn)
# pass each input through the LSTM
z1 = sharedLSTM(x1)
z2 = sharedLSTM(x2)
z3 = sharedLSTM(x3)
# add a singleton dimension
y1 = keras.layers.Lambda(lambda x: K.expand_dims(x, axis=1))(z1)
y2 = keras.layers.Lambda(lambda x: K.expand_dims(x, axis=1))(z2)
y3 = keras.layers.Lambda(lambda x: K.expand_dims(x, axis=1))(z3)
# combine the outputs
y = keras.layers.Concatenate(axis=1)([y1, y2, y3])
model = keras.models.Model(inputs=xIn, outputs=y)
model.compile(loss='mse', optimizer='adam')
model.summary()
# no need to train, since we're interested only what is happening mechanically
# reset to a known state and predict for full input
model.reset_states()
aFull = model.predict(x[np.newaxis,:,:,:])
# reset to a known state and predict for the same input, but in two pieces
model.reset_states()
a1 = model.predict(x[np.newaxis,:,:xShape[1]//2,:])
a2 = model.predict(x[np.newaxis,:,xShape[1]//2:,:])
# combine the pieces
aSplit = np.concatenate((a1, a2), axis=2)
print('full diff: {}, first half diff: {}, second half diff: {}'.format(str(np.sum(np.abs(aFull - aSplit))), str(np.sum(np.abs(aFull[:,:,:xShape[1]//2,:] - aSplit[:,:,:xShape[1]//2,:]))), str(np.sum(np.abs(aFull[:,:,xShape[1]//2:,:] - aSplit[:,:,xShape[1]//2:,:])))))
Update: The behaviour described above was observed with Keras using Tensorflow 1.14 and 1.15 as the backend. Running the same code with tf2.0 (with the adjusted imports) changes the result so that a1 is no longer the same as the first half of aFull. This can be still accomplished by setting stateful=False in the layer instantiation.
This would suggest to me that the way I'm trying to use the recursive layer with shared parameters, but own states for parallel uses, is not really possible like this.
Update 2: It seems that the same functionality has been missed by also other earlier: closed, unanswered question at Keras' github.
For a comparison, here is a scribbling in pytorch (the first time I've tried to use it) implementing a simple network with N parallel LSTMs sharing the weights, but having independent states. In this case the states are stored explicitly in a list and provided to the LSTM cell manually.
import torch
import numpy as np
class sharedLSTM(torch.nn.Module):
def __init__(self, batchSz, nBands, nDims, outDim):
super(sharedLSTM, self).__init__()
self.internalLSTM = torch.nn.LSTM(input_size=nDims, hidden_size=outDim, num_layers=1, bias=True, batch_first=True)
allStates = list()
for bandIdx in range(nBands):
h_0 = torch.zeros(1, batchSz, outDim)
c_0 = torch.zeros(1, batchSz, outDim)
allStates.append((h_0, c_0))
self.allStates = allStates
self.nBands = nBands
def forward(self, x):
allOut = list()
for dimIdx in range(self.nBands):
thisSlice = x[:,dimIdx,:,:] # (batchSz, nSteps, nFeats)
thisState = self.allStates[dimIdx]
thisY, thisState = self.internalLSTM(thisSlice, thisState)
self.allStates[dimIdx] = thisState
allOut.append(thisY[:,None,:,:]) # => (batchSz, 1, nSteps, nFeats)
y = torch.cat(allOut, dim=1) # => (batchSz, nDims, nSteps, nFeats)
return y
def resetStates(self):
for bandIdx in range(nBands):
self.allStates[bandIdx][0][:] = 0.0
self.allStates[bandIdx][1][:] = 0.0
batchSz = 5
nBands = 3
nFeats = 4
nOutDims = 2
net = sharedLSTM(batchSz, nBands, nFeats, nOutDims)
net = net.float()
print(net)
N = 20
x = torch.from_numpy(np.random.rand(batchSz, nBands, N, nFeats)).float()
x1 = x[:, :, :N//2, :]
x2 = x[:, :, N//2:, :]
aa = net.forward(x)
net.resetStates()
a1 = net.forward(x1)
a2 = net.forward(x2)
print('(with reset) first half abs diff: {}'.format(str(torch.sum(torch.abs(a1 - aa[:,:,:N//2,:])).detach().numpy())))
print('(with reset) second half abs diff: {}'.format(str(torch.sum(torch.abs(a2 - aa[:,:,N//2:,:])).detach().numpy())))
Result: the output is the same regardless if we do the prediction in one go or in pieces.
I've tried to replicate this in Keras using sub-classing, but without success:
import keras
import numpy as np
class sharedLSTM(keras.Model):
def __init__(self, batchSz, nBands, nDims, outDim):
super(sharedLSTM, self).__init__()
self.internalLSTM = keras.layers.LSTM(units=outDim, stateful=True, return_sequences=True, return_state=True)
self.internalLSTM.build((batchSz, None, nDims))
self.internalLSTM.reset_states()
allStates = list()
allSlicers = list()
for bandIdx in range(nBands):
allStates.append(None)
allSlicers.append(keras.layers.Lambda(lambda x, b: x[:, :, b, :], arguments = {'b' : bandIdx}))
self.allStates = allStates
self.allSlicers = allSlicers
self.Concat = keras.layers.Lambda(lambda x: keras.backend.concatenate(x, axis=2))
self.nBands = nBands
def call(self, x):
allOut = list()
for bandIdx in range(self.nBands):
thisSlice = self.allSlicers[bandIdx]( x )
thisState = self.allStates[bandIdx]
thisY, *thisState = self.internalLSTM(thisSlice, initial_state=thisState)
self.allStates[bandIdx] = thisState.copy()
allOut.append(thisY[:,:,None,:])
y = self.Concat( allOut )
return y
batchSz = 1
nBands = 3
nFeats = 4
nOutDims = 2
N = 20
model = sharedLSTM(batchSz, nBands, nFeats, nOutDims)
model.compile(optimizer='SGD', loss='mae')
x = np.random.rand(batchSz, N, nBands, nFeats)
x1 = x[:, :N//2, :, :]
x2 = x[:, N//2:, :, :]
aa = model.predict(x)
model.reset_states()
a1 = model.predict(x1)
a2 = model.predict(x2)
print('(with reset) first half abs diff: {}'.format(str(np.sum(np.abs(a1 - aa[:,:N//2,:,:])))))
print('(with reset) second half abs diff: {}'.format(str(np.sum(np.abs(a2 - aa[:,N//2:,:,:])))))
If you now ask "why don't you then use torch and shut up?", the answer is that the surrounding experimental framework has been built assuming Keras and changing it would be a non-negligible amount of work.
Based on my current understanding of the behaviour of LSTMs (and other RNNs) in Keras is that using a shared LSTM layer in a stateful=True mode does not work as one would expect and there is only one state variable that gets updated through all the parallel uses. So the answers to the questions appear to be:
Yes, they are. The processing runs over one of the many parallel sequences, stores the state at the end, and uses this as the initial state for the second parallel sequence, and so forth.
Yes, but it requires some work. See below for the details.
I've managed to accomplish handling the states in two ways. First is deriving sub-classes from Keras' LSTM and LSTMCell, and overloading LSTMCell.call() to handle the parallel data streams by splitting the input, and storing and recovering the state of each parallel stream. A drawback here is that the input shape to an RNN is fixed to be 3D, which means that the parallel inputs need to be reshaped into the feature dimension along with the real features.
The second approach is to create a wrapper Layer not completely dissimilar to the sharedLSTM-Model in the question, containing slicing of the input to parallel streams, calling the internal LSTM with the correct state for each stream, and storing the returned states. The state storage update in the list works through add_update() call inserted into the end of call(). This add_update() does not (seem to) work with Model, hence Layer. However, when run with Keras <2.3, the weights of the nested layers are not tracked or updated, so Keras 2.3+ or TF2 is needed.
I'm trying to implement the recurrent neural network with numpy.
My current input and output designs are as follow:
x is of shape: (sequence length, batch size, input dimension)
h : (number of layers, number of directions, batch size, hidden size)
initial weight: (number of directions, 2 * hidden size, input size + hidden size)
weight: (number of layers -1, number of directions, hidden size, directions*hidden size + hidden size)
bias: (number of layers, number of directions, hidden size)
I have looked up pytorch API of RNN the as reference (https://pytorch.org/docs/stable/nn.html?highlight=rnn#torch.nn.RNN), but have slightly changed it to include initial weight as input. (output shapes are supposedly the same as in pytorch)
While it is running, I cannot determine whether it is behaving right, as I am inputting randomly generated numbers as input.
In particular, I am not so certain whether my input shapes are designed correctly.
Could any expert give me a guidance?
def rnn(xs, h, w0, w=None, b=None, num_layers=2, nonlinearity='tanh', dropout=0.0, bidirectional=False, training=True):
num_directions = 2 if bidirectional else 1
batch_size = xs.shape[1]
input_size = xs.shape[2]
hidden_size = h.shape[3]
hn = []
y = [None]*len(xs)
for l in range(num_layers):
for d in range(num_directions):
if l==0 and d==0:
wi = w0[d, :hidden_size, :input_size].T
wh = w0[d, hidden_size:, input_size:].T
wi = np.reshape(wi, (1,)+wi.shape)
wh = np.reshape(wh, (1,)+wh.shape)
else:
wi = w[max(l-1,0), d, :, :hidden_size].T
wh = w[max(l-1,0), d, :, hidden_size:].T
for i,x in enumerate(xs):
if l==0 and d==0:
ht = np.tanh(np.dot(x, wi) + np.dot(h[l, d], wh) + b[l, d][np.newaxis])
ht = np.reshape(ht,(batch_size, hidden_size)) #otherwise, shape is (bs,1,hs)
else:
ht = np.tanh(np.dot(y[i], wi) + np.dot(h[l, d], wh) + b[l, d][np.newaxis])
y[i] = ht
hn.append(ht)
y = np.asarray(y)
y = np.reshape(y, y.shape+(1,))
return np.asarray(y), np.asarray(hn)
Regarding the shape, it probably makes sense if that's how PyTorch does it, but the Tensorflow way is a bit more intuitive - (batch_size, seq_length, input_size) - batch_size sequences of seq_length length where each element has input_size size. Both approaches can work, so I guess it's a matter of preferences.
To see whether your rnn is behaving appropriately, I'd just print the hidden state at each time step, run it on some small random data (e.g. 5 vectors, 3 elements each) and compare the results with your manual calculations.
Looking at your code, I'm unsure if it does what it's supposed to, but instead of doing this on your own based on an existing API, I'd recommend you read and try to replicate this awesome tutorial from wildml (in part 2 there's a pure numpy implementation).
I am trying to implement a RNN in Tensorflow (0.11), based on this paper.
They have a Theano implementation here, that I am comparing my implementation to. When I try to run their Theano implementation, it finishes 10 epochs in about 1 hour. My Tensorflow implementation needs about 17 hours just to finish 1 epoch. I am wondering if anyone could look at my code and tell me if there are some obvious problems that are slowing it down.
The purpose of the RNN is to predict the next item a user is going to click on, given his previous clicks. The items are represented by unique IDs that are given as input to the RNN as a 1-HOT vector.
So the RNN is built like this:
[INPUT (1-HOT representation, size 37803)] -> [GRU layer (size 100)] -> [FeedForward layer]
and the ouput from the FF layer is a vector with the same size as the input vector, where high values indicate that the item corresponding to that index is very likely to be the next one clicked.
num_hidden = 100
x = tf.placeholder(tf.float32, [None, max_length, n_items], name="InputX")
y = tf.placeholder(tf.float32, [None, max_length, n_items], name="TargetY")
session_length = tf.placeholder(tf.int32, [None], name="SeqLenOfInput")
output, state = rnn.dynamic_rnn(
rnn_cell.GRUCell(num_hidden),
x,
dtype=tf.float32,
sequence_length=session_length
)
layer = {'weights':tf.Variable(tf.random_normal([num_hidden, n_items])),
'biases':tf.Variable(tf.random_normal([n_items]))}
output = tf.reshape(output, [-1, num_hidden])
prediction = tf.matmul(output, layer['weights'])
y_flat = tf.reshape(y, [-1, n_items])
final_output = tf.nn.softmax_cross_entropy_with_logits(prediction,y_flat)
cost = tf.reduce_sum(final_output)
optimizer = tf.train.AdamOptimizer().minimize(cost)
Both implementations are tested on the same hardware. Both implementations utilize the GPU.
EDIT:
The Theano model has the same structure. (1-HOT input -> GRU layer with 100 units -> FeedForward)
I tested the Theano version with the same parameters as I used in my model (using cross entropy for the loss, batch size=200, adam optimizer, with the same learning rate, no dropout in either model) but the speed difference is still the same.
EDIT (2016-12-07):
Using file queues to queue batches instead of using feed_dict helped alot.
I still need to do other optimizations to make it faster. Anyways, here is how I used file queues to make it faster.
# Create filename_queue
filename_queue = tf.train.string_input_producer(train_files, shuffle=True)
min_after_dequeue = 1024
capacity = min_after_dequeue + 3*batch_size
examples_queue = tf.RandomShuffleQueue(
capacity=capacity,
min_after_dequeue=min_after_dequeue,
dtypes=[tf.string])
# Create multiple readers to populate the queue of examples
enqueue_ops = []
for i in range(n_readers):
reader = tf.TextLineReader()
_key, value = reader.read(filename_queue)
enqueue_ops.append(examples_queue.enqueue([value]))
tf.train.queue_runner.add_queue_runner(
tf.train.queue_runner.QueueRunner(examples_queue, enqueue_ops))
example_string = examples_queue.dequeue()
# Default values, and type of the columns, first is sequence_length
# +1 since first field is sequence length
record_defaults = [[0]]*(max_sequence_length+1)
enqueue_examples = []
for thread_id in range(n_preprocess_threads):
example = tf.decode_csv(value, record_defaults=record_defaults)
# Split the row into input/target values
sequence_length = example[0]
features = example[1:-1]
targets = example[2:]
enqueue_examples.append([sequence_length, features, targets])
# Batch together examples
session_length, x_unparsed, y_unparsed = tf.train.batch_join(
enqueue_examples,
batch_size=batch_size,
capacity=2*n_preprocess_threads*batch_size)
# Parse the examples in a batch
x = tf.one_hot(x_unparsed, depth=n_classes)
y = tf.one_hot(y_unparsed, depth=n_classes)
# From here on, x, y and session_length can be used in the model
I have a batch of sequences (say 50) of input elements having variable lengths. I am feeding all elements from all batches at time T as input and comparing network output with all elements from all batches at time T+1.
Whenever some sequences from the batch are finished, I replace them with next sequences and reset corresponding hidden state.
Following is the pseudo-code to represent what I am doing roughly.
# elements at time T
x = placeholder(shape=[batch_size, n_features])
# elements at time T+1
y = placeholder(shape=[batch_size, n_features])
I have defined RNN unit as follows:
def RNN_unit(previous_hidden_state, x):
r = tf.sigmoid(tf.matmul(x, Wr) + br)
z = tf.sigmoid(tf.matmul(x, Wz) + bz)
h_ = tf.tanh(tf.matmul(x, Wx) + tf.matmul(previous_hidden_state, Wh) * r)
current_hidden_state = tf.mul((1 - z), h_) + tf.mul(previous_hidden_state, z)
return current_hidden_state
And using it as:
hidden_state = RNN_unit(hidden_state, x)
output = activation_function(hidden_state)
cost = loss_function(output, y)
train_step = some_optimizer.minimize(cost)
Now when I feed the network in following fashion
while data_is_available:
x_batch = elements at time T
y_batch = elements at time T+1
sess.run(train_step, feed_dict={x:x_batch, y:y_batch})
# Now I want to reset those hidden states in a batch
# for which sequence is finished at time step T+1
# CODE TO RESET THOSE ROWS OF HIDDEN STATE HERE...
When I try something like following:
for rowindex in rows_to_be_reset:
hidden_state_matrix = sess.run(hidden_state)
hidden_state_matrix[rowindex, :] = 0
hidden_state = tf.constant(hidden_state_matrix)
But as hidden_state is dependent on x placeholder, I am unable to fetch it's value matrix without feeding actual x_batch instances.
I know I can do it in theano but not sure how to do the same in tensorflow.
How should I define a node of hidden_state in such case?
I am trying to use the tensorflow LSTM model to make next word predictions.
As described in this related question (which has no accepted answer) the example contains pseudocode to extract next word probabilities:
lstm = rnn_cell.BasicLSTMCell(lstm_size)
# Initial state of the LSTM memory.
state = tf.zeros([batch_size, lstm.state_size])
loss = 0.0
for current_batch_of_words in words_in_dataset:
# The value of state is updated after processing each batch of words.
output, state = lstm(current_batch_of_words, state)
# The LSTM output can be used to make next word predictions
logits = tf.matmul(output, softmax_w) + softmax_b
probabilities = tf.nn.softmax(logits)
loss += loss_function(probabilities, target_words)
I am confused about how to interpret the probabilities vector. I modified the __init__ function of the PTBModel in ptb_word_lm.py to store the probabilities and logits:
class PTBModel(object):
"""The PTB model."""
def __init__(self, is_training, config):
# General definition of LSTM (unrolled)
# identical to tensorflow example ...
# omitted for brevity ...
# computing the logits (also from example code)
logits = tf.nn.xw_plus_b(output,
tf.get_variable("softmax_w", [size, vocab_size]),
tf.get_variable("softmax_b", [vocab_size]))
loss = seq2seq.sequence_loss_by_example([logits],
[tf.reshape(self._targets, [-1])],
[tf.ones([batch_size * num_steps])],
vocab_size)
self._cost = cost = tf.reduce_sum(loss) / batch_size
self._final_state = states[-1]
# my addition: storing the probabilities and logits
self.probabilities = tf.nn.softmax(logits)
self.logits = logits
# more model definition ...
Then printed some info about them in the run_epoch function:
def run_epoch(session, m, data, eval_op, verbose=True):
"""Runs the model on the given data."""
# first part of function unchanged from example
for step, (x, y) in enumerate(reader.ptb_iterator(data, m.batch_size,
m.num_steps)):
# evaluate proobability and logit tensors too:
cost, state, probs, logits, _ = session.run([m.cost, m.final_state, m.probabilities, m.logits, eval_op],
{m.input_data: x,
m.targets: y,
m.initial_state: state})
costs += cost
iters += m.num_steps
if verbose and step % (epoch_size // 10) == 10:
print("%.3f perplexity: %.3f speed: %.0f wps, n_iters: %s" %
(step * 1.0 / epoch_size, np.exp(costs / iters),
iters * m.batch_size / (time.time() - start_time), iters))
chosen_word = np.argmax(probs, 1)
print("Probabilities shape: %s, Logits shape: %s" %
(probs.shape, logits.shape) )
print(chosen_word)
print("Batch size: %s, Num steps: %s" % (m.batch_size, m.num_steps))
return np.exp(costs / iters)
This produces output like this:
0.000 perplexity: 741.577 speed: 230 wps, n_iters: 220
(20, 10000) (20, 10000)
[ 14 1 6 589 1 5 0 87 6 5 3 5 2 2 2 2 6 2 6 1]
Batch size: 1, Num steps: 20
I was expecting the probs vector to be an array of probabilities, with one for each word in the vocabulary (eg with shape (1, vocab_size)), meaning that I could get the predicted word using np.argmax(probs, 1) as suggested in the other question.
However, the first dimension of the vector is actually equal to the number of steps in the unrolled LSTM (20 if the small config settings are used), which I'm not sure what to do with. To access to the predicted word, do I just need to use the last value (because it's the output of the final step)? Or is there something else that I'm missing?
I tried to understand how the predictions are made and evaluated by looking at the implementation of seq2seq.sequence_loss_by_example, which must perform this evaluation, but this ends up calling gen_nn_ops._sparse_softmax_cross_entropy_with_logits, which doesn't seem to be included in the github repo, so I'm not sure where else to look.
I'm quite new to both tensorflow and LSTMs, so any help is appreciated!
The output tensor contains the concatentation of the LSTM cell outputs for each timestep (see its definition here). Therefore you can find the prediction for the next word by taking chosen_word[-1] (or chosen_word[sequence_length - 1] if the sequence has been padded to match the unrolled LSTM).
The tf.nn.sparse_softmax_cross_entropy_with_logits() op is documented in the public API under a different name. For technical reasons, it calls a generated wrapper function that does not appear in the GitHub repository. The implementation of the op is in C++, here.
I am implementing seq2seq model too.
So lets me try to explain with my understanding:
The outputs of your LSTM model is a list (with length num_steps) of 2D tensor of size [batch_size, size].
The code line:
output = tf.reshape(tf.concat(1, outputs), [-1, size])
will produce a new output which is a 2D tensor of size [batch_size x num_steps, size].
For your case, batch_size = 1 and num_steps = 20 --> output shape is [20, size].
Code line:
logits = tf.nn.xw_plus_b(output, tf.get_variable("softmax_w", [size, vocab_size]), tf.get_variable("softmax_b", [vocab_size]))
<=> output[batch_size x num_steps, size] x softmax_w[size, vocab_size] will output logits of size [batch_size x num_steps, vocab_size].
For your case, logits of size [20, vocab_size]
--> probs tensor has same size as logits by [20, vocab_size].
Code line:
chosen_word = np.argmax(probs, 1)
will output chosen_word tensor of size [20, 1] with each value is the next prediction word index of current word.
Code line:
loss = seq2seq.sequence_loss_by_example([logits], [tf.reshape(self._targets, [-1])], [tf.ones([batch_size * num_steps])])
is to compute the softmax cross entropy loss for batch_size of sequences.