Tensorflow Dataset API not using GPU - python

1. Problem :
I have a tf.data.Dataset that I give to a Keras model (tf.python.keras) with train_on_batch.
My dataset looks like this :
Generate TFRecord path > tf.data.TFRecordDataset > Parse single example > Batch(2) > Map(merge) > Map(normalize) > Map(split to inputs,labels) > Batch(batch_size) > Prefetch(1)
I used RunMetadata to output a Timeline readable with Chrome.
Looks like IteratorGetNext is only ran on the CPU and is eating a significant amount of time.
(I can't post images, IteratorGetNext took 617ms, MEMCPYHtoD took 58ms and training took 500ms)
I can't seem to find a way to get IteratorGetNext to run on the GPU, even partially. Currently, CPU is used at 100% and GPU at 40-60% at most.
I would expect something like :
Read from disk > Move from CPU to GPU > Preprocess.
I am currently using only one GPU, but I plan to use more GPUs later so a scalable solution would be perfect !
By the way, I am using tensorflow-gpu 1.13.1 on Windows 10 with CUDA 10.0 and python 3.6.7. I am not using eager mode.
I haven't tried on Ubuntu but it is a possibility.
2. What I tried :
I tried using prefetch_to_device and copy_to_device from tf.data.experimental, in several places in the pipeline.
When using copy_to_device, IteratorGetNext took twice as long. It looked like it was copying on the GPU to only copy back to the CPU because the MEMCPYHtoD was still present after IteratorGetNext.
I tried replacing Keras' train_on_batch with session.run(train_op) but it did not really improve, the only change I noticed was that some prefetching actually happened, reducing IteratorGetNext time for a few samples (independent of the amount I put in "prefetch").
By the way, prefetch(1) or prefetch(tf.data.experimental.AUTOTUNE) did not seem to have any impact.
I tried session.run both with and without copy_to_device.
I also tried to put the building of the dataset in with tf.device("/gpu:0").
3. Some code :
dataset = tf.data.Dataset.from_generator(self.random_shard_filepath_generator,
output_types=tf.string,
output_shapes=())
dataset = tf.data.TFRecordDataset(dataset)
dataset = dataset.map(lambda serialized_shard: self.parse_shard(serialized_shard, output_labels))
dataset = dataset.batch(self.shards_per_sample)
dataset = dataset.map(self.join_shards_randomly)
dataset = dataset.map(self.normalize_batch)
dataset = dataset.map(self.split_batch_io)
dataset = dataset.batch(batch_size).prefetch(1)
autoencoder.train_on_batch(dataset)
Finally, I would add that my model may just not be big enough and I could improve the ratio by just making it "bigger", but it does not feel like a great solution.
-- Edit :
I had :
...
dataset = dataset.batch(batch_size).prefetch(1)
autoencoder.train_on_batch(dataset)
Which I changed to :
...
dataset = dataset.batch(batch_size).prefetch(1)
dataset_iterator = dataset.make_initializable_iterator()
dataset_initializer = dataset_iterator.initializer
session.run(dataset_initializer)
x, y = dataset_iterator
autoencoder.train_on_batch(x, y)
Thanks to EdoardoG for making me try MultiDeviceIterator which made me create an Iterator outside of Keras' train_on_batch.
Now IteratorGetNext only takes about 0.05ms where it took previously about 600ms.

As far as I know, Dataset API operations are usually run on CPU, so it's actually normal that you cannot run your input pipeline on the GPU.
Someone has written an iterator which could solve your problem.

Wrap your NN code using with tf.device('/gpu:0'): where gpu:0 is is the first gpu in your system.
If you want to use multiple GPUs:
for d in ['/device:GPU:2', '/device:GPU:3']:
with tf.device(d):
<your code here>
Some helpful guidelines from tensorflow's website

Related

how to enable keras fit() multiprocessing properly?

when I run fit() with multiprocessing=True i always get a deadlock and the following warning:
WARNING:tensorflow:multiprocessing can interact badly with TensorFlow, causing nondeterministic deadlocks. For high performance data pipelines tf.data is recommended.
how to run it properly?
Since it says "tf.data", i wonder if transforming my data into this format will make multiprocessing work. What specifically is meant/how to convert it?
my dataset: (reproducable)
Input_shape, labels =(20,4), 6
LEN_X.LEN_Y = 20000.3000
train_X,train_Y = np.asarray([np.random.random(Input_shape) for x in range(LEN_X )]), np.random.random((LEN_X ,labels))
validation_X,validation_Y = np.asarray([np.random.random(Input_shape) for x in range(LEN_Y)]), np.random.random((LEN_Y,labels))
sampleW = np.random.random((LEN_X ,1))
The multiprocessing doesn't accelerate the model itself. It only accelerates the data loading. And data loading delay is not a problem when all your data is already in-memory.
You could still use multiprocessing, however, but you must make sure that the underlying dataset is thread-safe and you have to carefully craft the data pipeline. That is quite time consuming. So, instead I suggest you speed up the model itself.
For that, you should look into:
changing all except last layer activations to RELU.
tweaking batch size. (optimal number depends on your hardware, and is almost always less than or equal to 32)
using Batch normalization to speed up convergence.
using higher learning rate (be careful not to overdo this step).
if you need faster convolutions, consider using Kaggle notebooks or vast.ai for GPU-enabled computations.
last but not least, try using a simpler, smaller model.
Comment down here if you have any additional questions.
Cheers.

Keras with Tensorflow: Use memory as it's needed [ResourceExhaustedError]

So I'm trying to tain my CNN with mutilple datasets and it seams that when I add enough data (such as when I add multiple sets as one or when I try to add the one that has over a million samples) it throws a ResourceExhaustedError.
As for the instructions here, I tried adding
from keras.backend.tensorflow_backend import set_session
import tensorflow as tf
config = tf.ConfigProto()
config.gpu_options.per_process_gpu_memory_fraction = 0.3
set_session(tf.Session(config=config))
to my code but this doesn't seam to make a difference.
I see 0.3 after printing out config.gpu_options.per_process_gpu_memory_fraction so that part seams to be ok.
I even threw in a config.gpu_options.allow_growth = True for good mesure but it doesn't seam to want to do anything but attempt to use all the memory at once only to find that it isn't enough.
The computer I'm trying to use to train this CNN has 4 GTX1080 Ti's with 12gb of dedicated memory each.
EDIT: I'm sorry for not specifying how I was loading the data, I honestly didn't realise there was more than one way. When I was learning, they always had examples where the loaded the datasets that were already built in and it took me a while to realise how to load a self-supplied dataset.
The way I'm doing it is that I'm creating two numpy arrays . One has the path or each image and the other has the corresponding label. Here's the most basic example of this:
data_dir = "folder_name"
# There is a folder for every form and in that folder is every line of that form
for filename in glob.glob(os.path.join(data_dir, '*', '*')):
# the format for file names are: "{author id}-{form id}-{line number}.png"
# filename is a path to the file so .split('\\')[-1] get's the raw file name without the path and .split('-')[0] get's the author id
authors.append(filename.split('\\')[-1].split('-')[0])
files.append(filename)
#keras requires numpy arrays
img_files = np.asarray(files)
img_targets = np.asarray(authors)
Are you sure you're not using a giant batch_size?
"Adding data": honestly I don't know what that means and if you could please describe exactly, with code, what you're doing here, it would be of help.
The number of samples should not cause any problems with GPU memory at all. What does cause a problem is a big batch_size.
Loading a huge dataset could cause a CPU RAM problem, not related with keras/tensorflow. A problem with a numpy array that is too big. (You can test this by simply loading your data "without creating any models")
If that is your problem, you should use a generator to load batches gradually. Again, since there is absolutely no code in your question, we can't help much.
But these are two forms of simply creating a generator for images:
Use the existing ImageDataGenerator and it's flow_from_directory() methods, explained here
Create your own coded generator, which can be:
A loop with yield
A class derived from keras.utils.Sequence
A quick example of a loop generator:
def imageBatchGenerator(imageFiles, imageLabels, batch_size):
while True:
batches = len(imageFiles) // batch_size
if len(imageFiles) % batch_size > 0:
batches += 1
for b in range(batches):
start = b * batch_size
end = (b+1) * batch_size
images = loadTheseImagesIntoNumpy(imageFiles[start:end])
labels = imageLabels[start:end]
yield images,labels
Warning: even with generators, you must make sure your batch size is not too big!
Using it:
model.fit_generator(imageBatchGenerator(files,labels,batchSize), steps_per_epoch = theNumberOfBatches, epochs= ....)
Dividing your model among GPUs
You should be able to decide which layers are processed by which GPU, this "could" probably optimize your RAM usage.
Example, when creating the model:
with tf.device('/gpu:0'):
createLayersThatGoIntoGPU0
with tf.device('/gpu:1'):
createLayersThatGoIntoGPU1
#you will probably need to go back to a previous GPU, as you must define your layers in a proper sequence
with tf.device('/cpu:0'):
createMoreLayersForGPU0
#and so on
I'm not sure this would be better or not, but it's worth trying too.
See more details here: https://keras.io/getting-started/faq/#how-can-i-run-a-keras-model-on-multiple-gpus
The ResourceExhaustedError is raised because you're trying to allocate more memory than is available in your GPUs or main memory. The memory allocation is approximately equal to your network footprint (to estimate this, save a checkpoint and look at the file size) plus your batch size multiplied by the size of a single element of your dataset.
It's difficult to answer your question directly without some more information about your setup, but there is one element of this question that caught my attention: you said that you get the error when you "add enough data" or "use a big enough dataset." That's odd. Notice that the size of your dataset is not included in the calculation for memory allocation footprint. Thus, the size of the dataset shouldn't matter. Since it does, that seems to imply that you are somehow attempting to load your entire dataset into GPU memory or main memory. If you're doing this, that's the origin of your problem. To fix it, use the TensorFlow Dataset API. Using a Dataset sidesteps this limited memory resources by hiding the data behind an Iterator that only yields batches when called. Alternatively, you could use the older feed_dict and QueueRunner data feeding structure, but I don't recommend it. You can find some examples of this here.
If you are already using the Dataset API, you'll need to post more of your code as an edit to your question for us to help you.
There is no setting that magically allows you more memory than your GPU has. It looks to me that your inputs are just to big to fit in the GPU RAM (along with all the required state and gradients).
you should use the config.gpu_options.allow_growth = True but not in order to get more memory, just to get an idea of how much memory you need per input length. start with a small length, see with nvidia-smi how much RAM does your GPU take and then increase the length. do that again and again until you understand what is the maximal length of inputs (batch size) that your GPU can hold.

Keras model predict iteration getting slower.

Hi I have some problem about Keras with python 3.6
My enviroment is keras with Python and Only CPU.
but the problem is when I iterate same Keras model for predict some diferrent input, its getting slower and slower..
my code is so simple just like that
for i in range(100):
model.predict(x)
the First run is fast. it takes 2 seconds may be. but second run takes 3 seconds and Third takes 5 seconds... its getting slower and slower even if I use same input.
what can I iterate predict keras model hold fast? I don't want any getting slower.. it will be very critical.
How can I Fix IT??
Try using the __call__ method directly. The documentation of the predict method states the following:
For small numbers of inputs that fit in one batch, directly use __call__() for faster execution, e.g., model(x).
I see the performance is critical in this case. So, if it doesn't help, you could use OpenVINO which is optimized for Intel hardware but it should work with any CPU. Your performance should be much better than using Keras directly.
It's rather straightforward to convert the Keras model to OpenVINO. The full tutorial on how to do it can be found here. Some snippets below.
Install OpenVINO
The easiest way to do it is using PIP. Alternatively, you can use this tool to find the best way in your case.
pip install openvino-dev[tensorflow2]
Save your model as SavedModel
OpenVINO is not able to convert HDF5 model, so you have to save it as SavedModel first.
import tensorflow as tf
from custom_layer import CustomLayer
model = tf.keras.models.load_model('model.h5', custom_objects={'CustomLayer': CustomLayer})
tf.saved_model.save(model, 'model')
Use Model Optimizer to convert SavedModel model
The Model Optimizer is a command-line tool that comes from OpenVINO Development Package. It converts the Tensorflow model to IR, which is a default format for OpenVINO. You can also try the precision of FP16, which should give you better performance without a significant accuracy drop (change data_type). Run in the command line:
mo --saved_model_dir "model" --data_type FP32 --output_dir "model_ir"
Run the inference
The converted model can be loaded by the runtime and compiled for a specific device e.g. CPU or GPU (integrated into your CPU like Intel HD Graphics). If you don't know what is the best choice for you, use AUTO.
# Load the network
ie = Core()
model_ir = ie.read_model(model="model_ir/model.xml")
compiled_model_ir = ie.compile_model(model=model_ir, device_name="CPU")
# Get output layer
output_layer_ir = compiled_model_ir.output(0)
# Run inference on the input image
result = compiled_model_ir([input_image])[output_layer_ir]
Disclaimer: I work on OpenVINO.
If your model calls the fit function in batches, there are different samples in the same batch with slightly different times over the course of the iteration, and then you try again and again to get more and more groups of predictive model performance time will be longer and longer.

GPU under utilization using tensorflow dataset

During training of my data, my GPU utilization is around 40%, and I clearly see that there is a datacopy operation that's using a lot of time, based on tensorflow profiler(see attached picture). I presume that "MEMCPYHtoD" option is copying the batch from CPU to GPU, and is blocking the GPU from being used. Is there anyway to prefetch data to GPU? or is there other problems that I am not seeing?
Here is the code for dataset:
X_placeholder = tf.placeholder(tf.float32, data.train.X.shape)
y_placeholder = tf.placeholder(tf.float32, data.train.y[label].shape)
dataset = tf.data.Dataset.from_tensor_slices({"X": X_placeholder,
"y": y_placeholder})
dataset = dataset.repeat(1000)
dataset = dataset.batch(1000)
dataset = dataset.prefetch(2)
iterator = dataset.make_initializable_iterator()
next_element = iterator.get_next()
Prefetching to a single GPU:
Consider using a more flexible approach than prefetch_to_device, e.g. by explicitly copying to the GPU with tf.data.experimental.copy_to_device(...) and then prefetching. This allows to avoid the restriction that prefetch_to_device must be the last transformation in a pipeline, and allow to incorporate further tricks to optimize the Dataset pipeline performance (e.g. by overriding threadpool distribution).
Try out the experimental tf.contrib.data.AUTOTUNE option for prefetching, which allows the tf.data runtime to automatically tune the prefetch buffer sizes based on your system and environment.
At the end, you might end up doing something like this:
dataset = dataset.apply(tf.data.experimental.copy_to_device("/gpu:0"))
dataset = dataset.prefetch(tf.contrib.data.AUTOTUNE)
I believe you can now fix this problem by using prefetch_to_device. Instead of the line:
dataset = dataset.prefetch(2)
do
dataset = dataset.apply(tf.contrib.data.prefetch_to_device('/gpu:0', buffer_size=2))

How to train my neural network faster by running CPU and GPU in parallel

I'm trying to train a (pretty big) neural network using a GPU. The network is written in pytorch. I use python 3.6.3 running on ubuntu 16.04. Currently, the code is running, but it's taking about twice as long as it should to run because my data-grabbing process using the CPU is run in series to the training process using the GPU. Essentially, I grab a mini-batch from file using a mini-batch generator, send that mini-batch to the GPU and then train the network on that minibatch. I've timed the two processes (grabbing a mini batch and training on that mini batch), and they are similar in how long they take (both take around 200ms). I'd like to do something similar to keras' fit_generator method which runs the data-grabbing in parallel to the training (it creates a que of minibatches that can be sent to the GPU when the GPU wants to train on that mini batch). What is the best way to do that? For concreteness, my data generator code and training code run something like this (pseudocode):
#This generator opens a file, grabs and yields a mini batch
def data_gen(PATH,batch_size=32):
with h5py.File(PATH,'r') as f:
for mini-batch in mini-batches:
X = f['X'][mini-batch]
Y = f['Y'][mini-batch]
yield (X,Y)
for epoch in range(epochs):
for data in data_gen(PATH):
mini_X,mini_Y = data
mini_X = autograd.Variable(torch.Tensor(mini_X))
mini_Y = autograd.Variable(torch.Tensor(mini_Y))
out = net(mini_X)
loss = F.binary_cross_entropy(out,mini_Y)
loss.backward()
optimizer.step()
Something like that. As you can see, I use the data_gen as an actual generator for the for-loop, so it's being run sequentially with the training. I would like to run it in parallel and have it generate a que of minibatches which I can then feed to my network. Currently, it takes more than 5 hours to run one epoch, I think with a parallelized version of this, I could get that down to 3 hours or less. I looked into multiprocessing on python, but the explanation on the official documentation was a bit dense for me since I have only limited prior experience in parallel computing. If there's some resources I could take a look at, pointing me towards those resources would be very helpful too! Thanks.
You will need to use threads for the data generation. The idea is to let the CPU handle the data generation (usually loading) while your GPU does the training. That been said, it is not the CPU that will slow things down. It is the constant reading and writing of files. If you are using a dataset make sure the files are copied or extracted into contiguous blocks on your file system. If your files are defragmented across your hard drive, loading them will be a bottleneck regardless of the multi-threading mechanism you are using. With SSD hard drives it is not noticeable.

Categories