So I'm reading a bit more about moving data from the CPU -> GPU in Tensorflow, and I see that feed_dict is still slow:
https://github.com/tensorflow/tensorflow/issues/2919
The immediate options I see for "moving" Python variables over to the GPU are:
#1. Tensorflow constant
a = tf.constant(data, name='a')
#2. Tensorflow Variable
b = tf.Variable(data, name='b')
#3. Tensorflow placeholder
c = tf.placeholder(dtype=dtype, shape=[x,y,z ...], name='c')
Options #1 and #2 aren't practical for very large dataset variables (since you're actually preloading the data into memory), as we'll quickly exceed the 2GB graph limit. That currently makes #3 the better choice for getting large Python vars into Tensorflow, but then you're forced into using feed_dict.
Are there other options for moving Python variables to the GPU besides #1, #2, and #3? I'm referring to using...
with tf.device('/gpu:0'):
# create tensorflow object(s), whether it's tf.Variable, tf.constant, etc
If I'm understanding correctly, we can use the input pipeline features to work around this issue? I'm referring to these two here:
https://datascience.stackexchange.com/questions/17559/input-pipeline-for-tensorflow-on-gpu
https://stackoverflow.com/a/38956678/7093236
Is there anything I can do to further enhance the speed of putting everything on the Tensorflow side?
Best way is to use tensorflow Queue to speed up data transfer.
You can do the following step even if you don't have label files
# data_files and labels_files are list, this may be some data files path, and labels values.
filename_queue = tf.train.slice_input_producer([data_files, label_files], shuffle=True)
# filename_queue = tf.train.slice_input_producer(data_files, shuffle=True)
# Some steps to decode the files and process
......
data, label = some_function(filename_queue)
# Define batch size and get batch for processing
image_batch, label_batch = tf.train.shuffle_batch([data, label], batch_size=batch_size, num_threads=num_threads)
The Dataset API is the future-proof way of moving data to the GPU. All reasonable optimizations, like those explained in the tensorflow performance guide, will eventually be available there.
Related
When I was training my model with data loaded by flow_from_directory with tensorflow, I accidentally deleted a few images from my training set directory, and it soon gave me the warning that it cannot find the file.
so it seems like it is actually reading the images during training, but since my dataset is not a large one, and my memory is only 40% used, I hope to slightly increase my training speed. Is there a way to tell tensorflow to prefetch more images to memory before training starts instead of reading images that current batch needs? Or is there an intentional reason that my memory is not used
You can change some of the parameters like batch_size in flow_from_directory which is default to 32.
And also after creating dataset you can increase the batch size and prefetch batches number also here dataset.batch(batch_size).prefetch(1)
If your dataset is small you can cache the dataset using dataset.cache() after loading and preprocessing data but before shuffling,repeating,batching, and prefetching so that each instance will be read and preprocessed once instead of once per epoch.
You can also check this documentation to optimize working with tf.data
So I'm trying to tain my CNN with mutilple datasets and it seams that when I add enough data (such as when I add multiple sets as one or when I try to add the one that has over a million samples) it throws a ResourceExhaustedError.
As for the instructions here, I tried adding
from keras.backend.tensorflow_backend import set_session
import tensorflow as tf
config = tf.ConfigProto()
config.gpu_options.per_process_gpu_memory_fraction = 0.3
set_session(tf.Session(config=config))
to my code but this doesn't seam to make a difference.
I see 0.3 after printing out config.gpu_options.per_process_gpu_memory_fraction so that part seams to be ok.
I even threw in a config.gpu_options.allow_growth = True for good mesure but it doesn't seam to want to do anything but attempt to use all the memory at once only to find that it isn't enough.
The computer I'm trying to use to train this CNN has 4 GTX1080 Ti's with 12gb of dedicated memory each.
EDIT: I'm sorry for not specifying how I was loading the data, I honestly didn't realise there was more than one way. When I was learning, they always had examples where the loaded the datasets that were already built in and it took me a while to realise how to load a self-supplied dataset.
The way I'm doing it is that I'm creating two numpy arrays . One has the path or each image and the other has the corresponding label. Here's the most basic example of this:
data_dir = "folder_name"
# There is a folder for every form and in that folder is every line of that form
for filename in glob.glob(os.path.join(data_dir, '*', '*')):
# the format for file names are: "{author id}-{form id}-{line number}.png"
# filename is a path to the file so .split('\\')[-1] get's the raw file name without the path and .split('-')[0] get's the author id
authors.append(filename.split('\\')[-1].split('-')[0])
files.append(filename)
#keras requires numpy arrays
img_files = np.asarray(files)
img_targets = np.asarray(authors)
Are you sure you're not using a giant batch_size?
"Adding data": honestly I don't know what that means and if you could please describe exactly, with code, what you're doing here, it would be of help.
The number of samples should not cause any problems with GPU memory at all. What does cause a problem is a big batch_size.
Loading a huge dataset could cause a CPU RAM problem, not related with keras/tensorflow. A problem with a numpy array that is too big. (You can test this by simply loading your data "without creating any models")
If that is your problem, you should use a generator to load batches gradually. Again, since there is absolutely no code in your question, we can't help much.
But these are two forms of simply creating a generator for images:
Use the existing ImageDataGenerator and it's flow_from_directory() methods, explained here
Create your own coded generator, which can be:
A loop with yield
A class derived from keras.utils.Sequence
A quick example of a loop generator:
def imageBatchGenerator(imageFiles, imageLabels, batch_size):
while True:
batches = len(imageFiles) // batch_size
if len(imageFiles) % batch_size > 0:
batches += 1
for b in range(batches):
start = b * batch_size
end = (b+1) * batch_size
images = loadTheseImagesIntoNumpy(imageFiles[start:end])
labels = imageLabels[start:end]
yield images,labels
Warning: even with generators, you must make sure your batch size is not too big!
Using it:
model.fit_generator(imageBatchGenerator(files,labels,batchSize), steps_per_epoch = theNumberOfBatches, epochs= ....)
Dividing your model among GPUs
You should be able to decide which layers are processed by which GPU, this "could" probably optimize your RAM usage.
Example, when creating the model:
with tf.device('/gpu:0'):
createLayersThatGoIntoGPU0
with tf.device('/gpu:1'):
createLayersThatGoIntoGPU1
#you will probably need to go back to a previous GPU, as you must define your layers in a proper sequence
with tf.device('/cpu:0'):
createMoreLayersForGPU0
#and so on
I'm not sure this would be better or not, but it's worth trying too.
See more details here: https://keras.io/getting-started/faq/#how-can-i-run-a-keras-model-on-multiple-gpus
The ResourceExhaustedError is raised because you're trying to allocate more memory than is available in your GPUs or main memory. The memory allocation is approximately equal to your network footprint (to estimate this, save a checkpoint and look at the file size) plus your batch size multiplied by the size of a single element of your dataset.
It's difficult to answer your question directly without some more information about your setup, but there is one element of this question that caught my attention: you said that you get the error when you "add enough data" or "use a big enough dataset." That's odd. Notice that the size of your dataset is not included in the calculation for memory allocation footprint. Thus, the size of the dataset shouldn't matter. Since it does, that seems to imply that you are somehow attempting to load your entire dataset into GPU memory or main memory. If you're doing this, that's the origin of your problem. To fix it, use the TensorFlow Dataset API. Using a Dataset sidesteps this limited memory resources by hiding the data behind an Iterator that only yields batches when called. Alternatively, you could use the older feed_dict and QueueRunner data feeding structure, but I don't recommend it. You can find some examples of this here.
If you are already using the Dataset API, you'll need to post more of your code as an edit to your question for us to help you.
There is no setting that magically allows you more memory than your GPU has. It looks to me that your inputs are just to big to fit in the GPU RAM (along with all the required state and gradients).
you should use the config.gpu_options.allow_growth = True but not in order to get more memory, just to get an idea of how much memory you need per input length. start with a small length, see with nvidia-smi how much RAM does your GPU take and then increase the length. do that again and again until you understand what is the maximal length of inputs (batch size) that your GPU can hold.
I've been studying mnist estimator code (https://github.com/tensorflow/tensorflow/blob/master/tensorflow/examples/tutorials/layers/cnn_mnist.py)
and after training or 150 000 steps with this code, logs produced by an estimator have 31M in size. (13M for each weight checkpoint and 5M for graph definition).
While tinkering with code I wrote my own train_input_fn using tf.data.Dataset.from_tensor_slices().
My code here:
def my_train_input_fn():
mnist = tf.contrib.learn.datasets.load_dataset("mnist")
images = mnist.train.images # Returns np.array
labels = np.asarray(mnist.train.labels, dtype=np.int32)
dataset = tf.data.Dataset.from_tensor_slices(
({"x": images}, labels))
dataset = dataset.shuffle(50000).repeat().batch(100)
return dataset
And, my logs, even before one step of the training, only after graph initalization, had size over 1,5G! (165M for ckpt-meta, around 600M for each events.out.tfevents and for graph.pbtxt files).
After a little research I found out that the function from_tensor_slices() is not appropriate for larger datasets, because it creates constants in the execution graph.
Note that the above code snippet will embed the features and labels
arrays in your TensorFlow graph as tf.constant() operations. This
works well for a small dataset, but wastes memory---because the
contents of the array will be copied multiple times---and can run into
the 2GB limit for the tf.GraphDef protocol buffer.
source:
https://www.tensorflow.org/programmers_guide/datasets
But mnist dataset has only around 13M in size. So why my graph definition has 600M, not only those additional 13M embedded as constants? And why events file is so big?
The original dataset producing code (https://github.com/tensorflow/tensorflow/blob/r1.8/tensorflow/python/estimator/inputs/numpy_io.py) doesn't produce such large logs files. I guess it is because of queues usage. But now queues are deprecated and we should use tf.Dataset instead of queues, right? What is the correct method of creating such dataset from file containing images (not from TFRecord)? Should I use tf.data.FixedLengthRecordDataset?
I had a similar issue, I solved using tf.data.Dataset.from_generator
or tf.data.Dataset.range and then dataset.map to get the particular value.
E.g. with generator
def generator():
for sample in zip(*datasets_tuple):
yield sample
dataset = tf.data.Dataset.from_generator(generator,
output_types=output_types, output_shapes=output_shapes)
I am using TensorFlow V1.7 with the new high-level Estimator interface. I was able to create and train my own network with my own dataset.
However, the policy I use to I load images just doesn't seem right to me.
The approach I have used so far (largely inspired by the MNIST tutorial) is to load all images into memory since the beginning
(here is a tiny code snippet to give you an idea):
for filename in os.listdir(folder):
filepath = os.path.join(folder, filename)
# using OpenCV to read image
images.append(cv2.imread(filepath, cv2.IMREAD_GRAYSCALE))
labels.append(<corresponding label>)
# shuffle samples and labels in the same way
temp = list(zip(images, labels))
random.shuffle(temp)
images, labels = zip(*temp)
return images, labels
This means that I have to load into memory all my training set, containing something like 32k images, before training the net.
However since my batch size is 100 the net will not need more than 100 images at a time.
This approach seems quite weird to me. I understand that this way secondary memory is only accessed once, maximizing the performances; however, if my dataset was really big, this could overload my RAM, couldn't it?
As a consequence, I would like to use a lazy approach, only loading images when they are needed (i.e. when they happen to be in a batch).
How can I do this? I have searched the TF documentation, but I have not found anything so far.
Is there something I'm missing?
It's advised to use the Dataset module, which provides you the ability (among other things) to use queues, prefetching of a small number of examples to memory, number of threads and much more.
During training of my data, my GPU utilization is around 40%, and I clearly see that there is a datacopy operation that's using a lot of time, based on tensorflow profiler(see attached picture). I presume that "MEMCPYHtoD" option is copying the batch from CPU to GPU, and is blocking the GPU from being used. Is there anyway to prefetch data to GPU? or is there other problems that I am not seeing?
Here is the code for dataset:
X_placeholder = tf.placeholder(tf.float32, data.train.X.shape)
y_placeholder = tf.placeholder(tf.float32, data.train.y[label].shape)
dataset = tf.data.Dataset.from_tensor_slices({"X": X_placeholder,
"y": y_placeholder})
dataset = dataset.repeat(1000)
dataset = dataset.batch(1000)
dataset = dataset.prefetch(2)
iterator = dataset.make_initializable_iterator()
next_element = iterator.get_next()
Prefetching to a single GPU:
Consider using a more flexible approach than prefetch_to_device, e.g. by explicitly copying to the GPU with tf.data.experimental.copy_to_device(...) and then prefetching. This allows to avoid the restriction that prefetch_to_device must be the last transformation in a pipeline, and allow to incorporate further tricks to optimize the Dataset pipeline performance (e.g. by overriding threadpool distribution).
Try out the experimental tf.contrib.data.AUTOTUNE option for prefetching, which allows the tf.data runtime to automatically tune the prefetch buffer sizes based on your system and environment.
At the end, you might end up doing something like this:
dataset = dataset.apply(tf.data.experimental.copy_to_device("/gpu:0"))
dataset = dataset.prefetch(tf.contrib.data.AUTOTUNE)
I believe you can now fix this problem by using prefetch_to_device. Instead of the line:
dataset = dataset.prefetch(2)
do
dataset = dataset.apply(tf.contrib.data.prefetch_to_device('/gpu:0', buffer_size=2))