I am writing a custom loss function for semi supervised learning on cifar-10 dataset, for which I need to duplicate columns of my tensor for creating a sort of mask which I then multiply with the activation values to later sum over.
My loss function is a sum of entropy and cross entropy for unlabelled and labeled samples. I add an extra class and set it to 1 for unlabelled samples.
I then create a mask for identifying row indices of unlabelled samples from the y_true tensor. From that I should get a (n_samples, 1) tensor which I need to repeat/duplicate/copy to a (n_samples, 11) tensor that I can multiply with the activation values in y_pred
Loss function code:
a = np.ones((mini_batch_size, 1)) * 10
a_var = K.variable(value=a)
v = K.cast(K.equal(K.cast(K.argmax(y_true, axis=1), 'float32'), a_var), 'float32')
e_loss = K.sum(K.concatenate([v,v,v,v,v,v,v,v,v,v,v], axis=-1) * K.log(y_pred) * y_pred)
m_u = K.sum(K.cast(K.equal(K.cast(K.argmax(y_true, axis=1), 'float32'), a_var), 'float32'))
b = np.ones((mini_batch_size, 1)) * 10
b_var = K.variable(value=b)
v2 = K.cast(K.not_equal(K.cast(K.argmax(y_true, axis=1), 'float32'), b_var), 'float32')
ce_loss = K.sum(K.concatenate([v2, v2, v2, v2, v2, v2, v2, v2, v2, v2, v2], axis=1) * K.log(y_pred))
m_l = K.variable(value=float(mini_batch_size), dtype='float32') #- m_u
return -((e_loss/m_u) + (ce_loss/m_l))
The error I get is:
InvalidArgumentError: 2 root error(s) found.
(0) Invalid argument: Incompatible shapes: [40,11] vs. [40,440]
[[{{node loss_36/dense_74_loss/mul_2}}]]
[[metrics_28/acc/Mean/_2627]]
(1) Invalid argument: Incompatible shapes: [40,11] vs. [40,440]
[[{{node loss_36/dense_74_loss/mul_2}}]]
0 successful operations.
0 derived errors ignored.
My batch size is 40.
I need my concatenated tensor to be of size [40, 11] not [40, 440]
I don't have real data to test whether the loss properly works, but this got rid of that InvalidArgumentError and did work with model.fit() for a dense model.
Few changes I did,
You don't have to repeat your v 11 times to multiply that with y_pred. All you need is reshape it to (-1,1) - (Will save you memory)
Got rid of all the K.variables. Now this is something I want to check with you, you are not trying to optimize a_var and b_var right (i.e. that's not a part of the model)? (Apparently, that's what's causing the issue. I need to dive deeper to see why). It seems the whole point of a_var and b_var is to perform boolean logics equal and not_equal, which works just fine with the constant.
Made m_l a K.constant
def loss_fn(y_true, y_pred):
v = K.cast(K.equal(K.cast(K.argmax(y_true, axis=-1), 'float32'), 10), 'float32')
e_loss = K.sum(K.reshape(v, (-1,1)) * K.log(y_pred) * y_pred)
m_u = K.sum(K.cast(K.equal(K.cast(K.argmax(y_true, axis=-1), 'float32'), 10), 'float32'))
v2 = K.cast(K.not_equal(K.cast(K.argmax(y_true, axis=-1), 'float32'), 10), 'float32')
ce_loss = K.sum(K.reshape(v2, (-1,1)) * K.log(y_pred))
m_l = K.constant(value=float(mini_batch_size), dtype='float32') #- m_u
return -((e_loss/m_u) + (ce_loss/m_l))
Note: Depending on the batch size within the loss function is a bad idea. Try to get rid of any batch_size dependent operations (especially for shape of tensors). You can see that I only have kept mini_batch_size to set m_l. But I would suggest setting this to some constant instead of min_batch_size. Because, if a batch with <40 comes through, you are using a different loss function for that batch. And your results aren't comparable between different batch sizes, as your loss function changes.
Related
I've encountered a problem while writing Siamese net. Definition of the net takes as an input 2 vectors which represents 2 pieces of text. The vectors length is padded and different with respect to batches (in batch 1: vectors length = 32, in batch 2: vectors length = 64 and so on).
# model definition
def create_model(vocab_size=512, d_model=128):
def normalize(x):
norm = tf.norm(x, axis=-1, keepdims=True)
return tf.divide(x, norm)
component = tf.keras.Sequential([
tf.keras.layers.Embedding(vocab_size, d_model),
tf.keras.layers.LSTM(d_model),
tf.keras.layers.Lambda(lambda x: tf.reduce_mean(x, axis=1)),
tf.keras.layers.Lambda(normalize),
])
# due to the variability in text, input shape differs with respect to batch
inputs = [tf.keras.Input(shape=(None,)) for _ in range(2)]
outputs = tf.tuple([component(ins) for ins in inputs])
return tf.keras.Model(inputs=inputs, outputs=outputs)
# loss function
class MyLoss(tf.keras.losses.Loss):
def __init__(self):
super().__init__(name='TripletLoss')
def call(self, y_true, y_pred):
# >>> HERE IS THE PROBLEM, y_pred has different shape then I'd expect,
# its shape is (batch_size,) instead of (2, batch_size)
l, r = y_pred
# compute and return loss
return loss
When calling Model#fit(loss=MyLoss(), ...) the parameter passed to the MyLoss#call is a projection of the first coordinate of the model prediction, i.e. model.predict(z) returns [x, y] where x, y are vectors with length equal to the batch size. I'd expected that y_pred passed as a parameter to Loss#call would have had that exact value, that is [x,y], but it equals to the first vector of the given list, that is x. Furthermore I've looked up at the call stack and I've spotted that before y_pred is passed to the MyLoss#call it has expected value ([x,y]) which changes to the x in the keras' Loss.__call__ body.
I tried to reshape input, but other problems arised.
I saw this question: Implementing custom loss function in keras with condition And I need to do the same thing but with code that seems to need loops.
I have a custom numpy function which calculates the mean Euclid distance from the mean vector. I wrote this based on the paper https://arxiv.org/pdf/1801.05365.pdf:
import numpy as np
def mean_euclid_distance_from_mean_vector(n_vectors):
dists = []
for (i, v) in enumerate(n_vectors):
n_vectors_rest = n_vectors[np.arange(len(n_vectors)) != i]
print("rest of vectors: ")
print(n_vectors_rest)
# calculate mean vector
mean_rest = n_vectors_rest.mean(axis=0)
print("mean rest vector")
print(mean_rest)
dist = v - mean_rest
print("dist vector")
print(dist)
dists.append(dist)
# dists is now a matrix of distance vectors (distance from the mean vector)
dists = np.array(dists)
print("distance vector matrix")
print(dists)
# here we matmult each vector
# sum them up
# and divide by the total number of elements
result = np.sum([np.matmul(d, d) for d in dists]) / dists.size
return result
features = np.array([
[1,2,3,4],
[4,3,2,1]
])
c = mean_euclid_distance_from_mean_vector(features)
print(c)
I need this function however to work inside tensorflow with Keras. So a custom lambda https://www.tensorflow.org/api_docs/python/tf/keras/layers/Lambda
However, I'm not sure how to implement the above in Keras/Tensorflow since it has loops, and the way the paper talked about calculating the m_i seems to require loops like the way I implemented the above.
For reference, the PyTorch version of this code is here: https://github.com/PramuPerera/DeepOneClass
Given a feature map like:
features = np.array([
[1, 2, 3, 4],
[2, 4, 4, 3],
[3, 2, 1, 4],
], dtype=np.float64)
reflecting a batch_size of
batch_size = features.shape[0]
and
k = features.shape[1]
One has that implementing the above Formulas in Tensorflow could be expressed (prototyped) by:
dim = (batch_size, features.shape[1])
def zero(i):
arr = np.ones(dim)
arr[i] = 0
return arr
mapper = [zero(i) for i in range(batch_size)]
elems = (features, mapper)
m = (1 / (batch_size - 1)) * tf.map_fn(lambda x: tf.math.reduce_sum(x[0] * x[1], axis=0), elems, dtype=tf.float64)
pairs = tf.map_fn(lambda x: tf.concat(x, axis=0) , tf.stack([features, m], 1), dtype=tf.float64)
compactness_loss = (1 / (batch_size * k)) * tf.map_fn(lambda x: tf.math.reduce_euclidean_norm(x), pairs, dtype=tf.float64)
with tf.Session() as sess:
print("loss value output is: ", compactness_loss.eval())
Which yields:
loss value output is: [0.64549722 0.79056942 0.64549722]
However a single measure is required for the batch, therefore it is necessary to reduce it; by the summation of all values.
The wanted Compactness Loss function à la Tensorflow is:
def compactness_loss(actual, features):
features = Flatten()(features)
k = 7 * 7 * 512
dim = (batch_size, k)
def zero(i):
z = tf.zeros((1, dim[1]), dtype=tf.dtypes.float32)
o = tf.ones((1, dim[1]), dtype=tf.dtypes.float32)
arr = []
for k in range(dim[0]):
arr.append(o if k != i else z)
res = tf.concat(arr, axis=0)
return res
masks = [zero(i) for i in range(batch_size)]
m = (1 / (batch_size - 1)) * tf.map_fn(
# row-wise summation
lambda mask: tf.math.reduce_sum(features * mask, axis=0),
masks,
dtype=tf.float32,
)
dists = features - m
sqrd_dists = tf.pow(dists, 2)
red_dists = tf.math.reduce_sum(sqrd_dists, axis=1)
compact_loss = (1 / (batch_size * k)) * tf.math.reduce_sum(red_dists)
return compact_loss
Of course the Flatten() could be moved back into the model for convenience and the k could be derived directly from the feature map; this answers your question. You may just have some trouble finding out the the expected values for the model are - feature maps from the VGG16 (or any other architechture) trained against the imagenet for instance?
The paper says:
In our formulation (shown in Figure 2 (e)), starting froma pre-trained deep model, we freeze initial features (gs) and learn (gl) and (hc). Based on the output of the classification sub-network (hc), two losses compactness loss and descriptiveness loss are evaluated. These two losses, introduced in the subsequent sections, are used to assess the quality of the learned deep feature. We use the provided one-class dataset to calculate the compactness loss. An external multi-class reference dataset is used to evaluate the descriptiveness loss.As shown in Figure 3, weights of gl and hc are learned in the proposed method through back-propagation from the composite loss. Once training is converged, system shown in setup in Figure 2(d) is used to perform classification where the resulting model is used as the pre-trained model.
then looking at the "Framework" backbone here plus:
AlexNet Binary and VGG16 Binary (Baseline). A binary CNN is trained by having ImageNet samples and one-class image samples as the two classes using AlexNet andVGG16 architectures, respectively. Testing is performed using k-nearest neighbor, One-class SVM [43], Isolation Forest [3]and Gaussian Mixture Model [3] classifiers.
Makes me wonder whether it would not be reasonable to add suggested the dense layers to both the Secondary and the Reference Networks to a single class output (Sigmoid) or even and binary class output (using Softmax) and using the mean_squared_error as the so called Compactness Loss and binary_cross_entropy as the Descriptveness Loss.
What I am trying to achieve now is to create a custom loss function in Keras that takes in two tensors (y_true, y_pred) with shapes (None, None, None) and (None, None, 3), respectively. However, the None's are so, that the two shapes are always equal for every (y_true, y_pred). From these tensors I want to produce two distance matrices that contain the squared distances between every possible point pair (the third, length 3 dimension contains x, y, and z spatial values) inside them and then return the difference between these distance matrices. The first code I tried was this:
def distanceMatrixLoss1(y_true, y_pred):
distMatrix1 = [[K.sum(K.square(y_true[i] - y_true[j])) for j in range(i + 1, y_true.shape[1])] for j in range(y_true.shape[1])]
distMatrix2 = [[K.sum(K.square(y_pred[i] - y_pred[j])) for j in range(i + 1, y_pred.shape[1])] for j in range(y_pred.shape[1])]
return K.mean(K.square(K.flatten(distMatrix1) - K.flatten(distMatrix2)))
(K is the TensorFlow backend.) Needless to say, I got the following error:
'NoneType' object cannot be interpreted as an integer
This is understandable, since range(None) does not make a lot of sense and y_true.shape[0] or y_pred.shape[0] is None. I searched whether others got somehow the same problem or not and I found that I could use the scan function of TensorFlow:
def distanceMatrixLoss2(y_true, y_pred):
subtractYfromXi = lambda x, y: tf.scan(lambda xi: K.sum(K.square(xi - y)), x)
distMatrix = lambda x, y: K.flatten(tf.scan(lambda yi: subtractYfromXi(x, yi), y))
distMatrix1 = distMatrix(y_true, y_true)
distMatrix2 = distMatrix(y_pred, y_pred)
return K.mean(K.square(distMatrix1-distMatrix2))
What I got from this is a different error, that I do not fully understand.
TypeError: <lambda>() takes 1 positional argument but 2 were given
So this went into the trash too. My last try was using the backend's map_fn function:
def distanceMatrixLoss3(y_true, y_pred):
subtractYfromXi = lambda x, y: K.map_fn(lambda xi: K.sum(K.square(xi - y)), x)
distMatrix = lambda x, y: K.flatten(K.map_fn(lambda yi: subtractYfromXi(x, yi), y))
distMatrix1 = distMatrix(y_true, y_true)
distMatrix2 = distMatrix(y_pred, y_pred)
return K.mean(K.square(distMatrix1-distMatrix2))
This did not throw an error, but when the training started the loss was constant 0 and stayed that way. So now I am out of ideas and I kindly ask you to help me untangle this problem. I have already tried to do the same in Mathematica and also failed (here is the link to the corresponding question, if it helps).
Assuming that dimension 0 is the batch size as usual and you don't want to mix samples.
Assuming that dimension 1 is the one you want to make pairs
Assuming that the last dimension is 3 for all cases although your model returns None.
Iterating tensors is a bad idea. It might be better just to make a 2D matrix from the original 1D, though having repeated values.
def distanceMatrix(true, pred): #shapes (None1, None2, 3)
#------ creating the distance matrices 1D to 2D -- all vs all
true1 = K.expand_dims(true, axis=1) #shapes (None1, 1, None2, 3)
pred1 = K.expand_dims(pred, axis=1)
true2 = K.expand_dims(true, axis=2) #shapes (None1, None2, 1, 3)
pred2 = K.expand_dims(pred, axis=2)
trueMatrix = true1 - true2 #shapes (None1, None2, None2, 3)
predMatrix = pred1 - pred2
#--------- euclidean x, y, z distance
#maybe needs a sqrt?
trueMatrix = K.sum(K.square(trueMatrix), axis=-1) #shapes (None1, None2, None2)
predMatrix = K.sum(K.square(predMatrix), axis=-1)
#-------- loss for each pair
loss = K.square(trueMatrix - predMatrix) #shape (None1, None2, None2)
#----------compensate the duplicated non-diagonals
diagonal = K.eye(K.shape(true)[1]) #shape (None2, None2)
#if Keras complains because the input is a tensor, use `tf.eye`
diagonal = K.expand_dims(diagonal, axis=0) #shape (1, None2, None2)
diagonal = 0.5 + (diagonal / 2.)
loss = loss * diagonal
#--------------
return K.mean(loss, axis =[1,2]) #or just K.mean(loss)
In setting up the model I sometimes see the code:
# Scenario 1
# Define loss and optimizer
loss_op = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(
logits=logits, labels=Y))
or
# Scenario 2
# Evaluate model (with test logits, for dropout to be disabled)
prediction = tf.equal(tf.argmax(prediction, 1), tf.argmax(Y, 1))
accuracy = tf.reduce_mean(tf.cast(prediction, tf.float32))
The definition of tf.reduce_mean states that it "calculates the mean of tensor elements along various dimensions of the tensor." I am confused about what it does in simpler language? When do we need to use it, maybe with reference to # Scenario 1 & 2 ? Thank you
As far as I understand, tensorflow.reduce_mean is the same as numpy.mean. It creates an operation in the underlying tensorflow graph which computes the mean of a tensor.
The most important keyword argument of tensorflow.reduce_mean is axis. Basically, if you have a tensor with shape (4, 3, 2) and axis=1, an empty array with shape (4, 2) will be created, and the mean values along the selected axis will be computed to fill in the empty array. (This is just a pseudo-process to help you make sense of the output, but may not be the actual process)
Here is a simple example to help you understand
import tensorflow as tf
import numpy as np
one = np.linspace(1, 30, 30).reshape(5, 3, 2)
x = tf.placeholder('float32', shape=[5, 3, 2])
op_1 = tf.reduce_mean(x)
op_2 = tf.reduce_mean(x, axis=0)
op_3 = tf.reduce_mean(x, axis=1)
op_4 = tf.reduce_mean(x, axis=2)
with tf.Session() as sess:
print(sess.run(op_1, feed_dict={x: one}))
print(sess.run(op_2, feed_dict={x: one}))
print(sess.run(op_3, feed_dict={x: one}))
print(sess.run(op_4, feed_dict={x: one}))
The first output is a number because we didn't provide an axis. The shapes of the rest of the outputs are (3, 2), (5, 2) and (5, 3), respectively.
reduce_mean can be useful when the target value is a matrix.
User #meTchaikovsky explained the general case of tf.reduce_mean. In both of your cases tf.reduce_mean simply works as any mean calculator i.e,. you're not taking mean along any particular axis of a tensor, you simply divide the sum of the elements in a tensor by number of elements.
Let's decode what exactly is happening in both the cases. For the both the cases assume batch_size = 2 and num_classes = 5, meaning that there are two examples per batch.
Now for the first case, tf.nn.softmax_cross_entropy_with_logits(logits=logits, labels=Y) returns an array of shape (2,).
>>import numpy as np
>>import tensorflow as tf
>>sess= tf.InteractiveSession()
>>batch_size = 2
>>num_classes = 5
>>logits = np.random.rand(batch_size,num_classes)
>>print(logits)
[[0.94108451 0.68186329 0.04000461 0.25996487 0.50391948]
[0.22781201 0.32305269 0.93359371 0.22599208 0.05942905]]
>>labels = np.array([[1,0,0,0,0],[0,1,0,0,0]])
>>print(labels)
[[1 0 0 0 0]
[0 1 0 0 0]]
>>logits_ = tf.placeholder(dtype=tf.float32,shape=(batch_size,num_classes))
>>Y_ = tf.placeholder(dtype=tf.int32,shape=(batch_size,num_classes))
>>loss_op = tf.nn.softmax_cross_entropy_with_logits(logits=logits_, labels=Y_)
>>loss_per_example = sess.run(loss_op,feed_dict={Y_:labels,logits_:logits})
>>print(loss_per_example)
array([1.2028817, 1.6912657], dtype=float32)
You can see that loss_per_example is of shape (2,). If we take the mean of this variable then we can approximate the average loss for the full batch. Hence we calculate
>>loss_per_example_holder = tf.placeholder(dtype=tf.float32,shape=(batch_size))
>>final_loss_per_batch = tf.reduce_mean(loss_per_example_holder)
>>final_loss = sess.run(final_loss_per_batch,feed_dict={loss_per_example_holder:loss_per_example})
>>print(final_loss)
1.4470737
Coming to your second case:
>>predictions_holder = tf.placeholder(dtype=tf.float32,shape=(batch_size,num_classes))
>>labels_holder = tf.placeholder(dtype=tf.int32,shape=(batch_size,num_classes))
>>prediction_tf = tf.equal(tf.argmax(predictions_holder, 1), tf.argmax(labels_holder, 1))
>>labels_match = sess.run(prediction_tf,feed_dict={predictions_holder:logits,labels_holder:labels})
>>print(labels_match)
[ True False]
The above output was expected because only the first example of the variable logits says that the neuron with highest activation (0.9410) is zeroth which is same as labels. Now we want to calculate the accuracy, which means we have to take the average of the variable labels_match.
>>labels_match_holder = tf.placeholder(dtype=tf.float32,shape=(batch_size))
>>accuracy_calc = tf.reduce_mean(tf.cast(labels_match_holder, tf.float32))
>>accuracy = sess.run(accuracy_calc, feed_dict={labels_match_holder:labels_match})
>>print(accuracy)
0.5
I am trying to use the tensorflow LSTM model to make next word predictions.
As described in this related question (which has no accepted answer) the example contains pseudocode to extract next word probabilities:
lstm = rnn_cell.BasicLSTMCell(lstm_size)
# Initial state of the LSTM memory.
state = tf.zeros([batch_size, lstm.state_size])
loss = 0.0
for current_batch_of_words in words_in_dataset:
# The value of state is updated after processing each batch of words.
output, state = lstm(current_batch_of_words, state)
# The LSTM output can be used to make next word predictions
logits = tf.matmul(output, softmax_w) + softmax_b
probabilities = tf.nn.softmax(logits)
loss += loss_function(probabilities, target_words)
I am confused about how to interpret the probabilities vector. I modified the __init__ function of the PTBModel in ptb_word_lm.py to store the probabilities and logits:
class PTBModel(object):
"""The PTB model."""
def __init__(self, is_training, config):
# General definition of LSTM (unrolled)
# identical to tensorflow example ...
# omitted for brevity ...
# computing the logits (also from example code)
logits = tf.nn.xw_plus_b(output,
tf.get_variable("softmax_w", [size, vocab_size]),
tf.get_variable("softmax_b", [vocab_size]))
loss = seq2seq.sequence_loss_by_example([logits],
[tf.reshape(self._targets, [-1])],
[tf.ones([batch_size * num_steps])],
vocab_size)
self._cost = cost = tf.reduce_sum(loss) / batch_size
self._final_state = states[-1]
# my addition: storing the probabilities and logits
self.probabilities = tf.nn.softmax(logits)
self.logits = logits
# more model definition ...
Then printed some info about them in the run_epoch function:
def run_epoch(session, m, data, eval_op, verbose=True):
"""Runs the model on the given data."""
# first part of function unchanged from example
for step, (x, y) in enumerate(reader.ptb_iterator(data, m.batch_size,
m.num_steps)):
# evaluate proobability and logit tensors too:
cost, state, probs, logits, _ = session.run([m.cost, m.final_state, m.probabilities, m.logits, eval_op],
{m.input_data: x,
m.targets: y,
m.initial_state: state})
costs += cost
iters += m.num_steps
if verbose and step % (epoch_size // 10) == 10:
print("%.3f perplexity: %.3f speed: %.0f wps, n_iters: %s" %
(step * 1.0 / epoch_size, np.exp(costs / iters),
iters * m.batch_size / (time.time() - start_time), iters))
chosen_word = np.argmax(probs, 1)
print("Probabilities shape: %s, Logits shape: %s" %
(probs.shape, logits.shape) )
print(chosen_word)
print("Batch size: %s, Num steps: %s" % (m.batch_size, m.num_steps))
return np.exp(costs / iters)
This produces output like this:
0.000 perplexity: 741.577 speed: 230 wps, n_iters: 220
(20, 10000) (20, 10000)
[ 14 1 6 589 1 5 0 87 6 5 3 5 2 2 2 2 6 2 6 1]
Batch size: 1, Num steps: 20
I was expecting the probs vector to be an array of probabilities, with one for each word in the vocabulary (eg with shape (1, vocab_size)), meaning that I could get the predicted word using np.argmax(probs, 1) as suggested in the other question.
However, the first dimension of the vector is actually equal to the number of steps in the unrolled LSTM (20 if the small config settings are used), which I'm not sure what to do with. To access to the predicted word, do I just need to use the last value (because it's the output of the final step)? Or is there something else that I'm missing?
I tried to understand how the predictions are made and evaluated by looking at the implementation of seq2seq.sequence_loss_by_example, which must perform this evaluation, but this ends up calling gen_nn_ops._sparse_softmax_cross_entropy_with_logits, which doesn't seem to be included in the github repo, so I'm not sure where else to look.
I'm quite new to both tensorflow and LSTMs, so any help is appreciated!
The output tensor contains the concatentation of the LSTM cell outputs for each timestep (see its definition here). Therefore you can find the prediction for the next word by taking chosen_word[-1] (or chosen_word[sequence_length - 1] if the sequence has been padded to match the unrolled LSTM).
The tf.nn.sparse_softmax_cross_entropy_with_logits() op is documented in the public API under a different name. For technical reasons, it calls a generated wrapper function that does not appear in the GitHub repository. The implementation of the op is in C++, here.
I am implementing seq2seq model too.
So lets me try to explain with my understanding:
The outputs of your LSTM model is a list (with length num_steps) of 2D tensor of size [batch_size, size].
The code line:
output = tf.reshape(tf.concat(1, outputs), [-1, size])
will produce a new output which is a 2D tensor of size [batch_size x num_steps, size].
For your case, batch_size = 1 and num_steps = 20 --> output shape is [20, size].
Code line:
logits = tf.nn.xw_plus_b(output, tf.get_variable("softmax_w", [size, vocab_size]), tf.get_variable("softmax_b", [vocab_size]))
<=> output[batch_size x num_steps, size] x softmax_w[size, vocab_size] will output logits of size [batch_size x num_steps, vocab_size].
For your case, logits of size [20, vocab_size]
--> probs tensor has same size as logits by [20, vocab_size].
Code line:
chosen_word = np.argmax(probs, 1)
will output chosen_word tensor of size [20, 1] with each value is the next prediction word index of current word.
Code line:
loss = seq2seq.sequence_loss_by_example([logits], [tf.reshape(self._targets, [-1])], [tf.ones([batch_size * num_steps])])
is to compute the softmax cross entropy loss for batch_size of sequences.