Tensorflow distributed training with custom training step

Tensorflow distributed training with custom training step - python

I am facing slow training runs and I have tried to scale up the training procedure by using Tensorflow's Strategy API to utilize all 4 GPUs.
I'm using MirrorStrategy and using experimental_distribute_dataset to partition the dataset.
Nature of my training data is a mix of both sparse matrices and dense matrices. I'm using a generator to construct my dataset (which picks random indices to pick from the data). However, in my current version of TF (2.1) generators don't support sparse matrices. The sparse_matrix does not have a static size and is a Ragged tensor.
This bit is ugly and a workaround, but I'm passing my sparse_matrix_list directly to the train function, and index into it by having a global queue that gets populated by pushing the random indices inside generator.
Now this approach was working fine, but it was way too slow, and I wanted to try training with all GPUs. This gets even more problematic as I have to manually partition the sparse_matrix_list into num_workers splits.
However, the main problem right now is that the training procedure does not seem to be parallel and the replicas (GPUs) seem to be running sequentially.. I verified this through nvidia-smi and logs in the train_process function.
I have no prior experience with distributed training, and not sure why this would be the case, and I would really appreciate it if someone has pointers for a better way of handling this mix of spare and dense data. I'm currently facing a huge bottleneck in fetching my data which underutilizes the GPUs (fluctuates between 10-30%)
def distributed_train_step(inputs, sparse_matrix_list):
per_replica_losses = strategy.experimental_run_v2(train_process, args=(inputs, sparse_matrix_list)
return strategy.reduce(tf.distribute.ReduceOp.SUM, per_replica_losses,
axis=None)
def train_process(inputs, sparse_matrix_list):
worker_id = tf.distribute.get_replica_context().replica_id_in_sync_group
replica_batch_size = inputs.shape[0]
slice_start = replica_batch_size * worker_id
replica_sparse_matrix = sparse_matrix_list[slice_start:slice_start + replica_batch_size]
return train_step(inputs, replica_sparse_matrix)
def train_step(inputs, sparse_matrix_list):
with tf.GradientTape() as tape:
outputs, mu, sigma, feat_out, logit = model(inputs)
loss = K.backend.mean(custom_loss(inputs, sparse_matrix_list)
return loss
def get_batch_data(sparse_matrix_list):
# Queue with the random indices into the training data (List of Lists with each
# entry len == batch_size)
# train_indicie is a global q
next_batch_indicies = train_indicies.get()
batch_sparse_list = sparse_matrix_list[next_batch_indicies]
dist_dataset = strategy.experimental_distribute_dataset(train_dataset)
for batch, inputs in enumerate(dist_dataset, 1):
# sparse_matrix_list is passed to this main "train" function from outside this module.
batch_sparse_matrix_slice = get_batch_data(sparse_matrix_list)
loss = distributed_train_step(inputs, batch_sparse_matrix_slice)

Related

Efficient conditional triggering of decoder in tensorflow

I am implementing an encoder-(dual-)decoder model in tensorflow. The decoder is RNN-type. The input to the decoder is a feature map, the output of the previous time-step and the hidden state of the decoder from the previous time-step. I only only want to trigger the decoder(s) when the prediction from the previous time-step is one of a particular set of tokens.
I have tried using tf.boolean_mask on the prediction of the previous time-step to remove those examples that do not predict a trigger-token. Below is an example:
# initialize input
dec_input = tf.expand_dims([token2integer['<start>']] * target.shape[0], 1)
features = encoder(img_tensor)
hidden = decoder.reset_state(batch_size=target.shape[0])
# make first prediction
predictions, hidden, _ = decoder(dec_input, features, hidden)
# add to total loss
loss += loss_function(dec_input, predictions)
# construct input of next time-step (here with teacher forcing)
dec_input =tf.expand_dims(target[:, i], -1)
#compute mask to only trigger for certain predictions
mask_struc = compute_mask_struc(dec_input)
# apply mask to input
features = tf.boolean_mask(features, mask_struc)
hidden = tf.boolean_mask(hidden, mask_struc)
target = tf.boolean_mask(target, mask_struc )
dec_input = tf.boolean_mask(dec_input, mask_struc )
# make next prediction and so on ...
I have implemented this into a training function. My implementation is working but it is slow. And when I run the function as a graph (with #tf.function) it gets 10x slower. If I remove the boolen_mask and run as a graph (with #tf.function) it is faster than without the #tf.function.
How can I speed up the execution (with or without the #tf.function)?
My ideas:
fix whatever is making the graph execution slow: I don't know how.
find alternative approach (without boolean_mask): I need inspiration
give up and try with PyTorch which I am more familiar with: not guaranteed to be faster.

How to write an efficient custom Keras data generator

I would like to train a convolutional recurrent neural network for video frame prediction. The individual frames are quite big so it is challenging to fit the entire training data in memory at once. As such, I followed some tutorials online to create a custom data generator. When testing it, it seems to work but it is slower by a factor of at least 100 than using the pre-loaded data directly. Since I can only fit about a batch size of 8 on the GPU I understand that the data needs to be generated really fast, however, this does not seem to be the case.
I train my model on a single P100 and have 32 GB of memory available to be used by up to 16 cores.
class DataGenerator(tf.keras.utils.Sequence):
def __init__(self, images, input_images=5, predict_images=5, batch_size=16, image_size=(200, 200),
channels=1):
self.images = images
self.input_images = input_images
self.predict_images = predict_images
self.batch_size = batch_size
self.image_size = image_size
self.channels = channels
self.nr_images = int(len(self.images)-input_images-predict_images)
def __len__(self):
return int(np.floor(self.nr_images) / self.batch_size)
def __getitem__(self, item):
# Randomly select the beginning image of each batch
batch_indices = random.sample(range(0, self.nr_images), self.batch_size)
# Allocate the output images
x = np.empty((self.batch_size, self.input_images,
*self.image_size, self.channels), dtype='uint8')
y = np.empty((self.batch_size, self.predict_images,
*self.image_size, self.channels), dtype='uint8')
# Get the list of input an prediction images
for i in range(self.batch_size):
list_images_input = range(batch_indices[i], batch_indices[i]+self.input_images)
list_images_predict = range(batch_indices[i]+self.input_images,
batch_indices[i]+self.input_images+self.predict_images)
for j, ID in enumerate(list_images_input):
x[i, ] = np.load(np.reshape(self.images[ID], (*self.imagesize, self.channels))
# Read in the prediction images
for j, ID in enumerate(list_images_predict):
y[i, ] = np.load(np.reshape(self.images[ID], (*self.imagesize, self.channels))
return x, y
# Training the model using fit_generator
params = {'batch_size': 8,
'input_images': 5,
'predict_images': 5,
'image_size': (100, 100),
'channels': 1
}
data_path = "input_frames/"
input_images = sorted(glob.glob(data_path + "*.png"))
training_generator = DataGenerator(input_images, **params)
model.fit_generator(generator=training_generator, epochs=10, workers=6)
I would have expected that Keras will prepare the next data batch while the current batch is being processed on the GPU but it does not seem to catch up. In other words, preparing the data before sending it to the GPU seems to be the bottleneck.
Any idea on how to improve the performance of a data generator like this? Is there something missing that guarantees that the data is being prepared in a timely manner?
Thanks a lot!

When you use fit_generator, there is a workers= setting that can be used to scale up the number of generator workers. However you should ensure that the 'item' parameter in getitem is taken into account in order to ensure that the different workers (which are not synchronised) return different values depending on item index. i.e. instead of random sample, perhaps just return a slice of the data based on the index. You can shuffle the entire dataset before starting in order to make sure the dataset order is randomised.

Can you please try with use_multiprocessing=True? These are the numbers I observe on my GTX 1080Ti based system with the data generator you provided.
model.fit_generator(generator=training_generator, epochs=10, workers=6)
148/148 [==============================] - 9s 60ms/step
model.fit_generator(generator=training_generator, epochs=10, workers=6, use_multiprocessing=True)
148/148 [==============================] - 2s 11ms/step

You can try the prefetching of tf.data.Dataset. The prefetching allows you to compute the next batch(es) using your CPU while your GPU computes the gradient descent in the same time. Be careful: you need to change the numpy array into tf.constant in the data generator. Then try:
import tensoflow as tf
generator = DataGenerator(images)
spec = [tf.TypeSpec(shape=(generator.batch_size, generator.input_images,
*generator.image_size, generator.channels), dtype='uint8'),
tf.TypeSpec(shape=(generator.batch_size, generator.predict_images,
*generator.image_size, generator.channels), dtype='uint8')
dataset = tf.data.Dataset.from_generator(DataGenerator, output_signature=spec)
dataset.batch(batch_size).prefetch(-1) # this order is important
# a custom training loop is better than model.fit() otherwise prefetching can fail
def train_loop():
...
You can change the "-1" in prefetch() to another value like 1, 2 or more to get the maximum speed depending on your machine and the batch size.

this blog helps in setting up input data pipeline with tf.data and it also is much more efficient than using ImageDataGenerators and the code is also explained by using a custom data directory.
It also enhances the performance with prefetch, cache.
Prefetch processes the next batch while the current batch is being used.

TensorFlow on multiple GPU

Recently, I try to learn how to use Tensorflow on multiple GPU by reading the official tutorial. However, there is something that I am confused about. The following code is part of the official tutorial, which calculates the loss on single GPU.
def tower_loss(scope, images, labels):
# Build inference Graph.
logits = cifar10.inference(images)
# Build the portion of the Graph calculating the losses. Note that we will
# assemble the total_loss using a custom function below.
_ = cifar10.loss(logits, labels)
# Assemble all of the losses for the current tower only.
losses = tf.get_collection('losses', scope)
# Calculate the total loss for the current tower.
total_loss = tf.add_n(losses, name='total_loss')
# Attach a scalar summary to all individual losses and the total loss; do the
# same for the averaged version of the losses.
for l in losses + [total_loss]:
# Remove 'tower_[0-9]/' from the name in case this is a multi-GPU training
# session. This helps the clarity of presentation on tensorboard.
loss_name = re.sub('%s_[0-9]*/' % cifar10.TOWER_NAME, '', l.op.name)
tf.summary.scalar(loss_name, l)
return total_loss
The training process is as the following.
def train():
with tf.device('/cpu:0'):
# Create a variable to count the number of train() calls. This equals the
# number of batches processed * FLAGS.num_gpus.
global_step = tf.get_variable(
'global_step', [],
initializer=tf.constant_initializer(0), trainable=False)
# Calculate the learning rate schedule.
num_batches_per_epoch = (cifar10.NUM_EXAMPLES_PER_EPOCH_FOR_TRAIN /
FLAGS.batch_size / FLAGS.num_gpus)
decay_steps = int(num_batches_per_epoch * cifar10.NUM_EPOCHS_PER_DECAY)
# Decay the learning rate exponentially based on the number of steps.
lr = tf.train.exponential_decay(cifar10.INITIAL_LEARNING_RATE,
global_step,
decay_steps,
cifar10.LEARNING_RATE_DECAY_FACTOR,
staircase=True)
# Create an optimizer that performs gradient descent.
opt = tf.train.GradientDescentOptimizer(lr)
# Get images and labels for CIFAR-10.
images, labels = cifar10.distorted_inputs()
batch_queue = tf.contrib.slim.prefetch_queue.prefetch_queue(
[images, labels], capacity=2 * FLAGS.num_gpus)
# Calculate the gradients for each model tower.
tower_grads = []
with tf.variable_scope(tf.get_variable_scope()):
for i in xrange(FLAGS.num_gpus):
with tf.device('/gpu:%d' % i):
with tf.name_scope('%s_%d' % (cifar10.TOWER_NAME, i)) as scope:
# Dequeues one batch for the GPU
image_batch, label_batch = batch_queue.dequeue()
# Calculate the loss for one tower of the CIFAR model. This function
# constructs the entire CIFAR model but shares the variables across
# all towers.
loss = tower_loss(scope, image_batch, label_batch)
# Reuse variables for the next tower.
tf.get_variable_scope().reuse_variables()
# Retain the summaries from the final tower.
summaries = tf.get_collection(tf.GraphKeys.SUMMARIES, scope)
However, I am confused about the for loop about 'for i in xrange(FLAGS.num_gpus)'. It seems that I have to get a new batch image from batch_queue and calculate every gradient. I think this process is serialized instead of parallel. If there anything wrong with my own understanding? By the way, I can also use the iterator to feed image to my model rather than the dequeue right?
Thank you everybody!

This is a common misconception with Tensorflow's coding model.
What you are showing here is the computation graph's construction, NOT the actual execution.
The block:
for i in xrange(FLAGS.num_gpus):
with tf.device('/gpu:%d' % i):
with tf.name_scope('%s_%d' % (cifar10.TOWER_NAME, i)) as scope:
# Dequeues one batch for the GPU
image_batch, label_batch = batch_queue.dequeue()
# Calculate the loss for one tower of the CIFAR model. This function
# constructs the entire CIFAR model but shares the variables across
# all towers.
loss = tower_loss(scope, image_batch, label_batch)
translates to:
For each GPU device (`for i in range..` & `with device...`):
- build operations needed to dequeue a batch
- build operations needed to run the batch through the network and compute the loss
Note how via tf.get_variable_scope().reuse_variables() you're telling the graph that the variables used for the graph GPU must be shared among all (i.e., all graphs on the multiple devices "reuse" the same variables).
None of this actually runs the network once (note how there is no sess.run()): you're just giving instructions on how data must flow.
Then, when you'll start the actual training (I guess you missed that piece of the code when copying it here) each GPU will pull its own batch and produce the per-tower loss. I guess these losses are averaged somewhere in the subsequent code and the average is the loss passed to the optimizer.
Up until the point where the tower losses are averaged together, everything is independent from the other devices, so getting the batch and computing the loss can be done in parallel. Then the gradients and parameter update is done only once, variables are updated and the cycle repeats.
So, to answer your question, no, per-batch loss computation is not serialized, but since this is synchronous distributed computation you need to collect all losses from all GPUs before being allowed to continue with gradients computation and parameters update, so you still have some part of the graph that cannot be independent.

Food101 SqueezeNet Caffe2 number of iterations

I am trying to classify the ETH Food-101 dataset using squeezenet in Caffe2. My model is imported from the Model Zoo and I made two types of modifications to the model:
1) Changing the dimensions of the last layer to have 101 outputs
2) The images from the database are in NHWC form and I just flipped the dimensions of the weights to match. (I plan on changing this)
The Food101 dataset has 75,000 images for training and I am currently using a batch size of 128 and a starting learning rate of -0.01 with a gamma of 0.999 and stepsize of 1. What I noticed is that for the first 2000 iterations of the network the accuracy hovered around 1/128 and this took an hour or so to complete.
I added all the weights to the model.params so they can get updated during gradient descent(except for data) and reinitialized all weights as Xavier and biases to constant. I would expect the accuracy to grow fairly quickly in the first hundred to thousand iterations and then tail off as the number of iterations grow. In my case, the learning is staying constant around 0.
When I look at the gradient file I find that the average is on the order of 10^-6 with a standard deviation of 10^-7. This explains the slow learning rate, but I haven't been able to get the gradient to start much higher.
These are the gradient statistics for the first convolution after a few iterations
Min Max Avg Sdev
-1.69821e-05 2.10922e-05 1.52149e-06 5.7707e-06
-1.60263e-05 2.01478e-05 1.49323e-06 5.41754e-06
-1.62501e-05 1.97764e-05 1.49046e-06 5.2904e-06
-1.64293e-05 1.90508e-05 1.45681e-06 5.22742e-06
Here are the core parts of my code:
#init_path is path to init_net protobuf
#pred_path is path to pred_net protobuf
def main(init_path, pred_path):
ws.ResetWorkspace()
data_folder = '/home/myhome/food101/'
#some debug code here
arg_scope = {"order":"NCHW"}
train_model = model_helper.ModelHelper(name="food101_train", arg_scope=arg_scope)
if not debug:
data, label = AddInput(
train_model, batch_size=128,
db=os.path.join(data_folder, 'food101-train-nchw-leveldb'),
db_type='leveldb')
init_net_def, pred_net_def = update_squeeze_net(init_path, pred_path)
#print str(init_net_def)
train_model.param_init_net.AppendNet(core.Net(init_net_def))
train_model.net.AppendNet(core.Net(pred_net_def))
ws.RunNetOnce(train_model.param_init_net)
add_params(train_model, init_net_def)
AddTrainingOperators(train_model, 'softmaxout', 'label')
AddBookkeepingOperators(train_model)
ws.RunNetOnce(train_model.param_init_net)
if debug:
ws.FeedBlob('data', data)
ws.FeedBlob('label', label)
ws.CreateNet(train_model.net)
total_iters = 10000
accuracy = np.zeros(total_iters)
loss = np.zeros(total_iters)
# Now, we will manually run the network for 200 iterations.
for i in range(total_iters):
#try:
conv1_w = ws.FetchBlob('conv1_w')
print conv1_w[0][0]
ws.RunNet("food101_train")
#except RuntimeError:
# print ws.FetchBlob('conv1').shape
# print ws.FetchBlob('pool1').shape
# print ws.FetchBlob('fire2/squeeze1x1_w').shape
# print ws.FetchBlob('fire2/squeeze1x1_b').shape
#softmax = ws.FetchBlob('softmaxout')
#print softmax[i]
#print softmax[i][0][0]
#print softmax[i][0][:5]
#print softmax[64*i]
accuracy[i] = ws.FetchBlob('accuracy')
loss[i] = ws.FetchBlob('loss')
print accuracy[i], loss[i]
My add_params function initializes the weights as follows
#ops allows me to only initialize the weights of specific ops because i initially was going to do last layer training
def add_params(model, init_net_def, ops=[]):
def add_param(op):
for output in op.output:
if "_w" in output:
weight_shape = []
for arg in op.arg:
if arg.name == 'shape':
weight_shape = arg.ints
weight_initializer = initializers.update_initializer(
None,
None,
("XavierFill", {}))
model.create_param(
param_name=output,
shape=weight_shape,
initializer=weight_initializer,
tags=ParameterTags.WEIGHT)
elif "_b" in output:
weight_shape = []
for arg in op.arg:
if arg.name == 'shape':
weight_shape = arg.ints
weight_initializer = initializers.update_initializer(
None,
None,
("ConstantFill", {}))
model.create_param(
param_name=output,
shape=weight_shape,
initializer=weight_initializer,
I find that my loss function fluctuates when I use the full training set, but If i use just one batch and iterate over it several times I find that the loss function goes down but very slowly.

While SqueezeNet has 50x fewer parameters than AlexNet, it is still a very large network. The original paper does not mention a training time, but the SqueezeNet-based SQ required 22 hours to train using two Titan X graphics cards - and that was with some of the weights pre-trained! I haven't gone over your code in detail, but what you describe is expected behavior - your network is able to learn on the single batch, just not as quickly as you expected.
I suggest reusing as many of the weights as possible instead of reinitializing them, just as the creators of SQ did. This is known as transfer learning, and it works because many of the lower-level features (lines, curves, basic shapes) in an image are the same regardless of the image's content, and reusing the weights for these layers saves the network from having to re-learn them from scratch.

When using TFRecord, how can I run intermediate validation check? (a better way?)

Let's say I defined a network Net and the example code below runs well.
# ... input processing using TFRecord ... # reading from TFRecord
x, y = tf.train.batch([image, label]) # encode batch
net = Net(x,y) # connect to network
# ... initialize and session ...
for iteration:
loss, _ = sess.run([net.loss, net.train_op])
The Net does not have tf.placeholder, since input is provided by tensors from TFRecord provider. What if I would like to run validation set as well, e.g., every 500 steps? How can I switch input flow?
x, y = tf.train.batch([image, label], ...) # training set
vx, vy = tf.train.batch([vimage, vlabel], ...) # validation set
net = Net(x,y)
for iteration:
loss, _ = sess.run([net.loss, net.train_op])
if step % 500 == 0:
# graph is already defined from input to loss.
# how can I run net.loss with vx and vy??
Only one thing I can imagine is, modifying Net to have placeholders, and every time running like
sess.run([...], feed_dict = {Net.x:sess.run(x), Net.y:sess.run(y)})
sess.run([...], feed_dict = {Net.x:sess.run(vx), Net.y:sess.run(vy)})
However, this seems to me that I lost benefits of using TFRecord (e.g., full TF integration). In the middle of computation flow, I have to stop the flow, run tf.sess, and continue (doesn't this lower speed by forcing to use CPU in the middle?)
I am wondering,
if there is a better way.
if my solution is not that worse than I imagine.
Thanks in advance.

There is a better way (than placeholders). I ran into this issue with the CIFAR10 tutorial in TensorFlow, which I adjusted to check accuracy on the test set simultaneous to the training every 500 batches or so. This is where sharing variables comes in handy.
x, y = tf.train.batch([image, label], ...) # training set
vx, vy = tf.train.batch([vimage, vlabel], ...) # validation set
with tf.variable_scope("model") as scope:
net = Net(x,y)
scope.reuse_variables()
vnet = Net(vx,vy)
for iteration:
loss, _ = sess.run([net.loss, net.train_op])
if step % 500 == 0:
loss, acc = sess.run([vnet.loss, vnet.accuracy])
By setting the scope to reuse variables on the second call to Net(), you will use the same tensors and values created in the first call but with a different set of inputs. Just make sure that vimage and vlabel aren't reusing tensors from image and label (which could possibly solved by creating their own variable scopes).

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Tensorflow distributed training with custom training step - python

Related

Efficient conditional triggering of decoder in tensorflow

How to write an efficient custom Keras data generator

TensorFlow on multiple GPU

Food101 SqueezeNet Caffe2 number of iterations

When using TFRecord, how can I run intermediate validation check? (a better way?)

Categories

Resources