During training of my data, my GPU utilization is around 40%, and I clearly see that there is a datacopy operation that's using a lot of time, based on tensorflow profiler(see attached picture). I presume that "MEMCPYHtoD" option is copying the batch from CPU to GPU, and is blocking the GPU from being used. Is there anyway to prefetch data to GPU? or is there other problems that I am not seeing?
Here is the code for dataset:
X_placeholder = tf.placeholder(tf.float32, data.train.X.shape)
y_placeholder = tf.placeholder(tf.float32, data.train.y[label].shape)
dataset = tf.data.Dataset.from_tensor_slices({"X": X_placeholder,
"y": y_placeholder})
dataset = dataset.repeat(1000)
dataset = dataset.batch(1000)
dataset = dataset.prefetch(2)
iterator = dataset.make_initializable_iterator()
next_element = iterator.get_next()
Prefetching to a single GPU:
Consider using a more flexible approach than prefetch_to_device, e.g. by explicitly copying to the GPU with tf.data.experimental.copy_to_device(...) and then prefetching. This allows to avoid the restriction that prefetch_to_device must be the last transformation in a pipeline, and allow to incorporate further tricks to optimize the Dataset pipeline performance (e.g. by overriding threadpool distribution).
Try out the experimental tf.contrib.data.AUTOTUNE option for prefetching, which allows the tf.data runtime to automatically tune the prefetch buffer sizes based on your system and environment.
At the end, you might end up doing something like this:
dataset = dataset.apply(tf.data.experimental.copy_to_device("/gpu:0"))
dataset = dataset.prefetch(tf.contrib.data.AUTOTUNE)
I believe you can now fix this problem by using prefetch_to_device. Instead of the line:
dataset = dataset.prefetch(2)
do
dataset = dataset.apply(tf.contrib.data.prefetch_to_device('/gpu:0', buffer_size=2))
Related
I wanted to know how data-loading and data transfer between CPU RAM and GPU memory is handled when I have a tf.data.Dataset and use the fit method of Keras.
Is one batch of data transferred at a time and then forward and backward propagation is done on that batch and then the new batch is sent from CPU RAM to GPU memory?
I know that in Keras's fit method there is max_queue_size, however, it says that
"Used for generator or keras.utils.Sequence input only"
How does tf.data data-loading under the hood work? Does it change anything if instead of using the fit method I create a custom loop like here
Are there links/guides where this is explained in enough detail?
The dataset will be sampled sequentially, unless otherwise specified. Since you're in control of the dataset with tf.data.Dataset, you can perform this operation of it yourself, with tf.data.Dataset.prefetch.
Most dataset input pipelines should end with a call to prefetch. This allows later elements to be prepared while the current element is being processed. This often improves latency and throughput, at the cost of using additional memory to store prefetched elements.
>>> dataset = tf.data.Dataset.range(3)
>>> dataset = dataset.prefetch(2)
>>> list(dataset.as_numpy_iterator())
[0, 1, 2]
I'm implementing an RL algorithm and using tf.data.Dataset(with prefetch) to feed data to the neural network. However, in order to interact with the environment, I have to explicitly feed data through feed_dict to take action. I'm wondering if using feed_dict with Dataset would impair the speed.
Here's a simplified version of my code
# code related to Dataset
ds = tf.data.Dataset.from_generator(buffer, sample_types, sample_shapes)
ds = ds.prefetch(5)
iterator = ds.make_one_shot_iterator()
samples = iterator.get_next(name='samples')
# pass samples to network
# network training, no feed_dict is needed because of Dataset
sess.run([self.opt_op])
# run the actor network to choose an action at the current state.
# manually feed the current state to samples
# will this impair the performance?
action = sess.run(self.action, feed_dict={samples['state']: state})
The is nothing wrong with mixing Dataset and feed_dict. If the state that you provide to feed_dict is large, it might lead to underutilized GPU depending on the size of the data. But it would in no way be related to Dataset being used.
One of the reasons Dataset API exists is to avoid model starvation and improve GPU utilization during training. The starvation might happen for reasons of data being copied from one locations to another: disk to memory, memory to GPU memory, you name it. Dataset tries to start executing bulky IO operations early enough to avoid starving the model when the time comes to process next batch. So, basically, Datasets tries to reduce time between batches.
In your case you probably don't loose any performance from using feed_dict. It seems that you break the execution by environment interaction anyhow (therefore, possibly underutilizing the GPU).
If you would like to be sure, time your performance when you are feeding actual state with feed_dict and than replace state usage with a constant tensor and compare the speed.
1. Problem :
I have a tf.data.Dataset that I give to a Keras model (tf.python.keras) with train_on_batch.
My dataset looks like this :
Generate TFRecord path > tf.data.TFRecordDataset > Parse single example > Batch(2) > Map(merge) > Map(normalize) > Map(split to inputs,labels) > Batch(batch_size) > Prefetch(1)
I used RunMetadata to output a Timeline readable with Chrome.
Looks like IteratorGetNext is only ran on the CPU and is eating a significant amount of time.
(I can't post images, IteratorGetNext took 617ms, MEMCPYHtoD took 58ms and training took 500ms)
I can't seem to find a way to get IteratorGetNext to run on the GPU, even partially. Currently, CPU is used at 100% and GPU at 40-60% at most.
I would expect something like :
Read from disk > Move from CPU to GPU > Preprocess.
I am currently using only one GPU, but I plan to use more GPUs later so a scalable solution would be perfect !
By the way, I am using tensorflow-gpu 1.13.1 on Windows 10 with CUDA 10.0 and python 3.6.7. I am not using eager mode.
I haven't tried on Ubuntu but it is a possibility.
2. What I tried :
I tried using prefetch_to_device and copy_to_device from tf.data.experimental, in several places in the pipeline.
When using copy_to_device, IteratorGetNext took twice as long. It looked like it was copying on the GPU to only copy back to the CPU because the MEMCPYHtoD was still present after IteratorGetNext.
I tried replacing Keras' train_on_batch with session.run(train_op) but it did not really improve, the only change I noticed was that some prefetching actually happened, reducing IteratorGetNext time for a few samples (independent of the amount I put in "prefetch").
By the way, prefetch(1) or prefetch(tf.data.experimental.AUTOTUNE) did not seem to have any impact.
I tried session.run both with and without copy_to_device.
I also tried to put the building of the dataset in with tf.device("/gpu:0").
3. Some code :
dataset = tf.data.Dataset.from_generator(self.random_shard_filepath_generator,
output_types=tf.string,
output_shapes=())
dataset = tf.data.TFRecordDataset(dataset)
dataset = dataset.map(lambda serialized_shard: self.parse_shard(serialized_shard, output_labels))
dataset = dataset.batch(self.shards_per_sample)
dataset = dataset.map(self.join_shards_randomly)
dataset = dataset.map(self.normalize_batch)
dataset = dataset.map(self.split_batch_io)
dataset = dataset.batch(batch_size).prefetch(1)
autoencoder.train_on_batch(dataset)
Finally, I would add that my model may just not be big enough and I could improve the ratio by just making it "bigger", but it does not feel like a great solution.
-- Edit :
I had :
...
dataset = dataset.batch(batch_size).prefetch(1)
autoencoder.train_on_batch(dataset)
Which I changed to :
...
dataset = dataset.batch(batch_size).prefetch(1)
dataset_iterator = dataset.make_initializable_iterator()
dataset_initializer = dataset_iterator.initializer
session.run(dataset_initializer)
x, y = dataset_iterator
autoencoder.train_on_batch(x, y)
Thanks to EdoardoG for making me try MultiDeviceIterator which made me create an Iterator outside of Keras' train_on_batch.
Now IteratorGetNext only takes about 0.05ms where it took previously about 600ms.
As far as I know, Dataset API operations are usually run on CPU, so it's actually normal that you cannot run your input pipeline on the GPU.
Someone has written an iterator which could solve your problem.
Wrap your NN code using with tf.device('/gpu:0'): where gpu:0 is is the first gpu in your system.
If you want to use multiple GPUs:
for d in ['/device:GPU:2', '/device:GPU:3']:
with tf.device(d):
<your code here>
Some helpful guidelines from tensorflow's website
So I'm reading a bit more about moving data from the CPU -> GPU in Tensorflow, and I see that feed_dict is still slow:
https://github.com/tensorflow/tensorflow/issues/2919
The immediate options I see for "moving" Python variables over to the GPU are:
#1. Tensorflow constant
a = tf.constant(data, name='a')
#2. Tensorflow Variable
b = tf.Variable(data, name='b')
#3. Tensorflow placeholder
c = tf.placeholder(dtype=dtype, shape=[x,y,z ...], name='c')
Options #1 and #2 aren't practical for very large dataset variables (since you're actually preloading the data into memory), as we'll quickly exceed the 2GB graph limit. That currently makes #3 the better choice for getting large Python vars into Tensorflow, but then you're forced into using feed_dict.
Are there other options for moving Python variables to the GPU besides #1, #2, and #3? I'm referring to using...
with tf.device('/gpu:0'):
# create tensorflow object(s), whether it's tf.Variable, tf.constant, etc
If I'm understanding correctly, we can use the input pipeline features to work around this issue? I'm referring to these two here:
https://datascience.stackexchange.com/questions/17559/input-pipeline-for-tensorflow-on-gpu
https://stackoverflow.com/a/38956678/7093236
Is there anything I can do to further enhance the speed of putting everything on the Tensorflow side?
Best way is to use tensorflow Queue to speed up data transfer.
You can do the following step even if you don't have label files
# data_files and labels_files are list, this may be some data files path, and labels values.
filename_queue = tf.train.slice_input_producer([data_files, label_files], shuffle=True)
# filename_queue = tf.train.slice_input_producer(data_files, shuffle=True)
# Some steps to decode the files and process
......
data, label = some_function(filename_queue)
# Define batch size and get batch for processing
image_batch, label_batch = tf.train.shuffle_batch([data, label], batch_size=batch_size, num_threads=num_threads)
The Dataset API is the future-proof way of moving data to the GPU. All reasonable optimizations, like those explained in the tensorflow performance guide, will eventually be available there.
I want to use Keras for a real-time training and prediction setting. In my scenario I get real-time data via MQTT that should be used to train a (LSTM) Neural Network and/or to apply them to the to get a prediction.
I am using the Tensorflow backend with GPU support and fairly potent GPU capacity but in my scenario Keras does not really profit from GPU acceleration. (I did some performance tests by using the examples in the keras repository to make sure that the GPU acceleration works in general). In my first approach, I used the model.train_on_batch(...) method to train the network with each item coming via MQTT:
model = load_model()
def on_message(msg):
"""
Method called by MQTT client each time new data comes in
"""
if msg.topic == 'my/topic':
X, Y = prepare_data(msg.payload)
prediction = model.predict(X)
loss = model.train_on_batch(X, Y)
send_to_visualization_tool(prediction, loss)
One training step in this setting takes about 200ms. However, when I introduce a buffer e.g. buffering 100 data points, the training time for the whole batch only increases slightly. This suggests that the setup time for batch training has a huge overhead. I also noticed that when using size 1 batches, the CPU consumption is quite high, while the GPU is hardly used at all.
As an alternative I now introduced a synchronized Queue, where the MQTT client pushes data, whenever data comes in and the Neural Network then consumes all data as a batch, that came in while processing the previous batch:
train_data_queue = Queue.Queue()
# MQTT client running in separate thread
def on_message(msg):
train_data_queue.put(msg.payload)
model = load_model()
while True:
train_data_batch = dequeue_all(train_data_queue) # dequeue all items from queue
# or block until at least one
# item is present
X, Y = prepare_data(train_data_batch)
predictions = model.predict_on_batch(X)
losses = model.train_on_batch(X, Y)
send_to_visualization_tool(predictions, losses)
This approach works okay but it would be nice if I could get rid of the additional complexity of synchronized Queues and multi threading. I.e. get first approach work.
My question therefore is: Is there a way to reduce the overhead of one batch trainings? E.g. by reimplementing the model in pure tensorflow?
Or can you think of a better way to do real-time training with Keras?
The performance of keras should be broadly similar to the performance of raw tensorflow, so I do not recommend rewriting your model.
Indeed modern hardware usually takes about the same time to train with a single example as it does with a batch of examples, which is why we spend so much effort batching things up. You can get rid of the complexity of synchronized queues if you want to use tf.contrib.batching.batch_function but you'll still need to feed it from many threads if you want to get the extra throughput.