I'm currently stuck with an implementation problem of TFRecordReader
This is the setup :
trainQ = tf.train.string_input_producer(fileList)
RecReader = tf.TFRecordReader()
batch_strings = RecReader.read(trainQ)
encoder_inputs,decoder_inputs,enc_len,dec_len = seq['utterance'],seq['response'],con['utter_length'],con['resp_length']
mini_batch = tf.train.batch([encoder_inputs,decoder_inputs,enc_len,dec_len,decoder_inputs],batch_size,2,capacity=50*batch_size,dynamic_pad = True,enqueue_many=False)
encoder_inp,decoder_inp,encoder_lens,decoder_lens,labels = mini_batch
<build rest of the model>
loss = <some loss>
train_ops = <optimizer>.minimize(loss)
Now when I do train_ops.run(), it automatically reads off the queue and trains the model over a batch. But if I want to evaluate some intermediate variable, I cannot do variable.eval() since that would mean a new batch being read off the trainQ queue with different values
One way I can think of circumventing this to use a placeholder to feed parse_single_example and populating the placeholder in the train loop each time. But is there a better way of doing this i.e. evaluating variables without reading off the queue again?
Hope this is not confusing
If you want to evaluate an intermediate layer (call it conv3) that depends on the batch input every 100 iterations, you can do the following:
for step in range(100000):
if step % 100 != 0:
# only run the training operation
# run train_op AND `conv3` at the same time
_, conv3_value = sess.run([train_op, conv3])
The trick here is to call train_op and conv3 in the same call the tf.Session. That way, a batch is read off the training queue to train one step, but it is also used at the same time to compute conv3.
I am implementing an encoder-(dual-)decoder model in tensorflow. The decoder is RNN-type. The input to the decoder is a feature map, the output of the previous time-step and the hidden state of the decoder from the previous time-step. I only only want to trigger the decoder(s) when the prediction from the previous time-step is one of a particular set of tokens.
I have tried using tf.boolean_mask on the prediction of the previous time-step to remove those examples that do not predict a trigger-token. Below is an example:
# initialize input
dec_input = tf.expand_dims([token2integer['<start>']] * target.shape[0], 1)
features = encoder(img_tensor)
hidden = decoder.reset_state(batch_size=target.shape[0])
# make first prediction
predictions, hidden, _ = decoder(dec_input, features, hidden)
# add to total loss
loss += loss_function(dec_input, predictions)
# construct input of next time-step (here with teacher forcing)
dec_input =tf.expand_dims(target[:, i], -1)
#compute mask to only trigger for certain predictions
mask_struc = compute_mask_struc(dec_input)
# apply mask to input
features = tf.boolean_mask(features, mask_struc)
hidden = tf.boolean_mask(hidden, mask_struc)
target = tf.boolean_mask(target, mask_struc )
dec_input = tf.boolean_mask(dec_input, mask_struc )
# make next prediction and so on ...
I have implemented this into a training function. My implementation is working but it is slow. And when I run the function as a graph (with #tf.function) it gets 10x slower. If I remove the boolen_mask and run as a graph (with #tf.function) it is faster than without the #tf.function.
How can I speed up the execution (with or without the #tf.function)?
My ideas:
fix whatever is making the graph execution slow: I don't know how.
find alternative approach (without boolean_mask): I need inspiration
give up and try with PyTorch which I am more familiar with: not guaranteed to be faster.
Recently, I try to learn how to use Tensorflow on multiple GPU by reading the official tutorial. However, there is something that I am confused about. The following code is part of the official tutorial, which calculates the loss on single GPU.
def tower_loss(scope, images, labels):
# Build inference Graph.
logits = cifar10.inference(images)
# Build the portion of the Graph calculating the losses. Note that we will
# assemble the total_loss using a custom function below.
_ = cifar10.loss(logits, labels)
# Assemble all of the losses for the current tower only.
losses = tf.get_collection('losses', scope)
# Calculate the total loss for the current tower.
total_loss = tf.add_n(losses, name='total_loss')
# Attach a scalar summary to all individual losses and the total loss; do the
# same for the averaged version of the losses.
for l in losses + [total_loss]:
# Remove 'tower_[0-9]/' from the name in case this is a multi-GPU training
# session. This helps the clarity of presentation on tensorboard.
loss_name = re.sub('%s_[0-9]*/' % cifar10.TOWER_NAME, '', l.op.name)
tf.summary.scalar(loss_name, l)
return total_loss
The training process is as the following.
def train():
with tf.device('/cpu:0'):
# Create a variable to count the number of train() calls. This equals the
# number of batches processed * FLAGS.num_gpus.
global_step = tf.get_variable(
'global_step', [],
initializer=tf.constant_initializer(0), trainable=False)
# Calculate the learning rate schedule.
num_batches_per_epoch = (cifar10.NUM_EXAMPLES_PER_EPOCH_FOR_TRAIN /
FLAGS.batch_size / FLAGS.num_gpus)
decay_steps = int(num_batches_per_epoch * cifar10.NUM_EPOCHS_PER_DECAY)
# Decay the learning rate exponentially based on the number of steps.
lr = tf.train.exponential_decay(cifar10.INITIAL_LEARNING_RATE,
# Create an optimizer that performs gradient descent.
opt = tf.train.GradientDescentOptimizer(lr)
# Get images and labels for CIFAR-10.
images, labels = cifar10.distorted_inputs()
batch_queue = tf.contrib.slim.prefetch_queue.prefetch_queue(
[images, labels], capacity=2 * FLAGS.num_gpus)
# Calculate the gradients for each model tower.
tower_grads = []
with tf.variable_scope(tf.get_variable_scope()):
for i in xrange(FLAGS.num_gpus):
with tf.device('/gpu:%d' % i):
with tf.name_scope('%s_%d' % (cifar10.TOWER_NAME, i)) as scope:
# Dequeues one batch for the GPU
image_batch, label_batch = batch_queue.dequeue()
# Calculate the loss for one tower of the CIFAR model. This function
# constructs the entire CIFAR model but shares the variables across
# all towers.
loss = tower_loss(scope, image_batch, label_batch)
# Reuse variables for the next tower.
# Retain the summaries from the final tower.
summaries = tf.get_collection(tf.GraphKeys.SUMMARIES, scope)
However, I am confused about the for loop about 'for i in xrange(FLAGS.num_gpus)'. It seems that I have to get a new batch image from batch_queue and calculate every gradient. I think this process is serialized instead of parallel. If there anything wrong with my own understanding? By the way, I can also use the iterator to feed image to my model rather than the dequeue right?
Thank you everybody!
This is a common misconception with Tensorflow's coding model.
What you are showing here is the computation graph's construction, NOT the actual execution.
The block:
for i in xrange(FLAGS.num_gpus):
with tf.device('/gpu:%d' % i):
with tf.name_scope('%s_%d' % (cifar10.TOWER_NAME, i)) as scope:
# Dequeues one batch for the GPU
image_batch, label_batch = batch_queue.dequeue()
# Calculate the loss for one tower of the CIFAR model. This function
# constructs the entire CIFAR model but shares the variables across
# all towers.
loss = tower_loss(scope, image_batch, label_batch)
translates to:
For each GPU device (`for i in range..` & `with device...`):
- build operations needed to dequeue a batch
- build operations needed to run the batch through the network and compute the loss
Note how via tf.get_variable_scope().reuse_variables() you're telling the graph that the variables used for the graph GPU must be shared among all (i.e., all graphs on the multiple devices "reuse" the same variables).
None of this actually runs the network once (note how there is no sess.run()): you're just giving instructions on how data must flow.
Then, when you'll start the actual training (I guess you missed that piece of the code when copying it here) each GPU will pull its own batch and produce the per-tower loss. I guess these losses are averaged somewhere in the subsequent code and the average is the loss passed to the optimizer.
Up until the point where the tower losses are averaged together, everything is independent from the other devices, so getting the batch and computing the loss can be done in parallel. Then the gradients and parameter update is done only once, variables are updated and the cycle repeats.
So, to answer your question, no, per-batch loss computation is not serialized, but since this is synchronous distributed computation you need to collect all losses from all GPUs before being allowed to continue with gradients computation and parameters update, so you still have some part of the graph that cannot be independent.
I am trying to train a double hidden layer CNN with policy gradient. I have a custom loss function for this. I am trying to implement it in Keras (backend TF) using the code for Cartpole game by kkweon. The code is as follows,
def _build_train_fn(network):
action_prob_placeholder = network.output
action_onehot_placeholder = K.placeholder(shape=(None, output_dim),
name = "action_onehot")
discount_reward_placeholder = K.placeholder(shape=(None, ),
name = "discount_reward")
action_prob = K.sum(action_prob_placeholder * action_onehot_placeholder,
log_action_prob = K.log(action_prob)
loss = -log_action_prob * discount_reward_placeholder
loss = K.mean(loss)
adam = optimizers.Adam()
updates = adam.get_updates(loss = loss, params = network.trainable_weights)
train_fn = K.function(inputs=[network.input,
outputs=[], updates=updates)
Now I run the training function inside a for loop, where I take mini batches of episodes and train for that. So I need to build the train function for each mini batch, as it depends on the network output. The batch-wise training inside the for loop goes like this,
if episode_number % batch_size == 0:
mini_batch_states = np.asarray(mini_batch_states).reshape([len(mini_batch_states), 14])
mini_batch_moves = np.asarray(mini_batch_moves)
train_fn = _build_train_fn(network)
train_fn([mini_batch_states, mini_batch_moves, mini_batch_rewards])
The issue is that the memory usage continues to grow on size as the number of mini batches grow. I have tried building the function once before the loop, which solves the memory issue but removes the aspect of dynamic learning and converges to a 0.5 winrate instead of growing consistently.
Also using a
del train_fn
does not solve the memory issue.
I am very new to both TF and Keras. Any help is much appreciated.
I'm implementing an algorithm involving alternating optimization. That is, at each iteration, the algorithm fetches a data batch, and uses the data batch to optimize two losses sequentially. My current implementation with tf.data.Dataaset and tf.data.Iterator is something like this (which is indeed incorrect as detailed below):
data_batch = iterator.get_next()
train_op_1 = get_train_op(data_batch)
train_op_2 = get_train_op(data_batch)
for _ in range(num_steps):
Note that the above is incorrect because each call of sess.run will advance the iterator to get next data batch. So train_op_1 and train_op_2 are indeed using different data batches.
I cannot do something like sess.run([train_op_1, train_op_2]) either, because the two optimization steps need to be sequential (i.e., the 2nd optimization step depends on the latest variable value by the 1st optimization step.)
I'm wondering is there any way to somehow "freeze" the iterator, so that it won't advance in a sess.run call?
I was doing something similar so that is part of my code stripped from some unnecessary stuff. It does a bit more as it has train and validation iterators, but you should get the idea of using is_keep_previous flag. Basically passed as True it fill force reuse of the previous value of the iterator, in case of False it will get new value.
iterator_t = ds_t.make_initializable_iterator()
iterator_v = ds_v.make_initializable_iterator()
iterator_handle = tf.placeholder(tf.string, shape=[], name="iterator_handle")
iterator = tf.data.Iterator.from_string_handle(iterator_handle,
def get_next_item():
# sometimes items need casting
next_elem = iterator.get_next(name="next_element")
x, y = tf.cast(next_elem[0], tf.float32), next_elem[1]
return x, y
def old_data():
# just forward the existing batch
return inputs, target
is_keep_previous = tf.placeholder_with_default(tf.constant(False),shape=[], name="keep_previous_flag")
inputs, target = tf.cond(is_keep_previous, old_data, new_data)
with tf.Session() as sess:
handle_t = sess.run(iterator_t.string_handle())
handle_v = sess.run(iterator_v.string_handle())
# Run data iterator initialisation
while True:
inputs_, target_ = sess.run([inputs, target], feed_dict={iterator_handle: handle_t, is_keep_previous:False})
print(inputs_, target_)
inputs_, target_ = sess.run([inputs, target], feed_dict={iterator_handle: handle_t, is_keep_previous:True})
print(inputs_, target_)
inputs_, target_ = sess.run([inputs, target], feed_dict={iterator_handle: handle_v})
print(inputs_, target_)
except tf.errors.OutOfRangeError:
# now we know we run out of elements in the validationiterator
Use control dependencies when building the graph for train_op_2 so it can see the updated values of the variables.
Or use eager execution.
Let's say I defined a network Net and the example code below runs well.
# ... input processing using TFRecord ... # reading from TFRecord
x, y = tf.train.batch([image, label]) # encode batch
net = Net(x,y) # connect to network
# ... initialize and session ...
for iteration:
loss, _ = sess.run([net.loss, net.train_op])
The Net does not have tf.placeholder, since input is provided by tensors from TFRecord provider. What if I would like to run validation set as well, e.g., every 500 steps? How can I switch input flow?
x, y = tf.train.batch([image, label], ...) # training set
vx, vy = tf.train.batch([vimage, vlabel], ...) # validation set
net = Net(x,y)
for iteration:
loss, _ = sess.run([net.loss, net.train_op])
if step % 500 == 0:
# graph is already defined from input to loss.
# how can I run net.loss with vx and vy??
Only one thing I can imagine is, modifying Net to have placeholders, and every time running like
sess.run([...], feed_dict = {Net.x:sess.run(x), Net.y:sess.run(y)})
sess.run([...], feed_dict = {Net.x:sess.run(vx), Net.y:sess.run(vy)})
However, this seems to me that I lost benefits of using TFRecord (e.g., full TF integration). In the middle of computation flow, I have to stop the flow, run tf.sess, and continue (doesn't this lower speed by forcing to use CPU in the middle?)
I am wondering,
if there is a better way.
if my solution is not that worse than I imagine.
Thanks in advance.
There is a better way (than placeholders). I ran into this issue with the CIFAR10 tutorial in TensorFlow, which I adjusted to check accuracy on the test set simultaneous to the training every 500 batches or so. This is where sharing variables comes in handy.
x, y = tf.train.batch([image, label], ...) # training set
vx, vy = tf.train.batch([vimage, vlabel], ...) # validation set
with tf.variable_scope("model") as scope:
net = Net(x,y)
vnet = Net(vx,vy)
for iteration:
loss, _ = sess.run([net.loss, net.train_op])
if step % 500 == 0:
loss, acc = sess.run([vnet.loss, vnet.accuracy])
By setting the scope to reuse variables on the second call to Net(), you will use the same tensors and values created in the first call but with a different set of inputs. Just make sure that vimage and vlabel aren't reusing tensors from image and label (which could possibly solved by creating their own variable scopes).