I wanted to know how data-loading and data transfer between CPU RAM and GPU memory is handled when I have a tf.data.Dataset and use the fit method of Keras.
Is one batch of data transferred at a time and then forward and backward propagation is done on that batch and then the new batch is sent from CPU RAM to GPU memory?
I know that in Keras's fit method there is max_queue_size, however, it says that
"Used for generator or keras.utils.Sequence input only"
How does tf.data data-loading under the hood work? Does it change anything if instead of using the fit method I create a custom loop like here
Are there links/guides where this is explained in enough detail?
The dataset will be sampled sequentially, unless otherwise specified. Since you're in control of the dataset with tf.data.Dataset, you can perform this operation of it yourself, with tf.data.Dataset.prefetch.
Most dataset input pipelines should end with a call to prefetch. This allows later elements to be prepared while the current element is being processed. This often improves latency and throughput, at the cost of using additional memory to store prefetched elements.
>>> dataset = tf.data.Dataset.range(3)
>>> dataset = dataset.prefetch(2)
>>> list(dataset.as_numpy_iterator())
[0, 1, 2]
Related
when I run fit() with multiprocessing=True i always get a deadlock and the following warning:
WARNING:tensorflow:multiprocessing can interact badly with TensorFlow, causing nondeterministic deadlocks. For high performance data pipelines tf.data is recommended.
how to run it properly?
Since it says "tf.data", i wonder if transforming my data into this format will make multiprocessing work. What specifically is meant/how to convert it?
my dataset: (reproducable)
Input_shape, labels =(20,4), 6
LEN_X.LEN_Y = 20000.3000
train_X,train_Y = np.asarray([np.random.random(Input_shape) for x in range(LEN_X )]), np.random.random((LEN_X ,labels))
validation_X,validation_Y = np.asarray([np.random.random(Input_shape) for x in range(LEN_Y)]), np.random.random((LEN_Y,labels))
sampleW = np.random.random((LEN_X ,1))
The multiprocessing doesn't accelerate the model itself. It only accelerates the data loading. And data loading delay is not a problem when all your data is already in-memory.
You could still use multiprocessing, however, but you must make sure that the underlying dataset is thread-safe and you have to carefully craft the data pipeline. That is quite time consuming. So, instead I suggest you speed up the model itself.
For that, you should look into:
changing all except last layer activations to RELU.
tweaking batch size. (optimal number depends on your hardware, and is almost always less than or equal to 32)
using Batch normalization to speed up convergence.
using higher learning rate (be careful not to overdo this step).
if you need faster convolutions, consider using Kaggle notebooks or vast.ai for GPU-enabled computations.
last but not least, try using a simpler, smaller model.
Comment down here if you have any additional questions.
Cheers.
I am facing a problem of improving the training speed / efficiency of a Tensorflow implementation of point cloud object detection algorithm.
The input data is a [8000, 100, 9] float32 tensor, with a size roughly 27MB per sample. On a batch size of 5, data loading becomes a bottleneck in training as most of the time GPU utlization rate is 0% until data arrives.
I have tried the following methods to increase data loading speed.
Use num_parallel_calls in tf.Dataset .map API, and use multiple threads for reading this big tensor. The problem is .map wraps a py_fun which is subject to Global Interpreter Lock and thus multi-threading does not improve I/O efficiency.
Use tf.Dataset .interleave API. Since it's also multi-threading based, it has the same problem as 2.
Use TFRecord format. This is even slower than method 1 and 2. Possibility is TFRecord will convert tensor to numpy, then serialize numpy to bytes, then wrap this bytes to tensorflow structure and write to disk. Numpy to Tensor takes a long time for my data as measured by tf.convert_to_tensor().
Any suggestions how to move forward would be helpful. Thanks!
Follow up on comments
Am I using slow disks? Data is stored on a mounted disk. Could be a reason.
Can the data be fit into GPU memory? Unfortunately no. There are ~70,000 samples. I tried cache a small dataset into RAM and GPU utlization rate is 30%~40%, which is probably the highest expectation for this particular network.
Some ideas:
You should use a combination of 1,2 and 3. If you save your files as TFRecords, you can read them in parallel, that's what they are designed for. Then, you will be able to use num_parallel_calls and interleave, because that way you don't have to wrap a py_func.
.map doesn't have to wrap a .py_func, you could for example use tf.keras.utils.get_file. That way you also avoid using py_func and use num_parallel_calls efficiently. I still recommend using TFRecords, they are designed for this use case.
Another option is to use an SSD to store your data instead of a Hard Disk.
You can also look into the .cache function of the tf.Dataset API. Maybe you can try loading a random subset of the data, training multiple eopchs on that, and then in the mean time fetch another subset of the data (using tf.prefetch), and then train multiple epochs on that, and so on. This idea is more of a long shot as it might affect performance, but it just might work in your case.
I'm implementing an RL algorithm and using tf.data.Dataset(with prefetch) to feed data to the neural network. However, in order to interact with the environment, I have to explicitly feed data through feed_dict to take action. I'm wondering if using feed_dict with Dataset would impair the speed.
Here's a simplified version of my code
# code related to Dataset
ds = tf.data.Dataset.from_generator(buffer, sample_types, sample_shapes)
ds = ds.prefetch(5)
iterator = ds.make_one_shot_iterator()
samples = iterator.get_next(name='samples')
# pass samples to network
# network training, no feed_dict is needed because of Dataset
sess.run([self.opt_op])
# run the actor network to choose an action at the current state.
# manually feed the current state to samples
# will this impair the performance?
action = sess.run(self.action, feed_dict={samples['state']: state})
The is nothing wrong with mixing Dataset and feed_dict. If the state that you provide to feed_dict is large, it might lead to underutilized GPU depending on the size of the data. But it would in no way be related to Dataset being used.
One of the reasons Dataset API exists is to avoid model starvation and improve GPU utilization during training. The starvation might happen for reasons of data being copied from one locations to another: disk to memory, memory to GPU memory, you name it. Dataset tries to start executing bulky IO operations early enough to avoid starving the model when the time comes to process next batch. So, basically, Datasets tries to reduce time between batches.
In your case you probably don't loose any performance from using feed_dict. It seems that you break the execution by environment interaction anyhow (therefore, possibly underutilizing the GPU).
If you would like to be sure, time your performance when you are feeding actual state with feed_dict and than replace state usage with a constant tensor and compare the speed.
I have trained a 3D convnet using mxnet. I saved the network architecture and parameters with an intention of testing more data with it to check its performance. Since I am not training, I do not want to obtain batches of the dataset. How do I get the network to read in the entire dataset as input? Just passing the network the dataset object directly is only a 4D tensor whereas the network wants 5D. Right now I am using the dataloader but setting batch size as the entire dataset, and I feel like there is a more efficient way to do this.
DataLoader requires either a batch_size or a BatchSampler. In theory, you could write a BatchSampler that fetches the entire dataset as one batch, though I don't think you'll see a significant performance gain if your batch size is significantly large. Additionally, using batches is beneficial if you have more than one worker - have you considered using num_workers > 0 to take advantage of parallel processing?
So I'm trying to tain my CNN with mutilple datasets and it seams that when I add enough data (such as when I add multiple sets as one or when I try to add the one that has over a million samples) it throws a ResourceExhaustedError.
As for the instructions here, I tried adding
from keras.backend.tensorflow_backend import set_session
import tensorflow as tf
config = tf.ConfigProto()
config.gpu_options.per_process_gpu_memory_fraction = 0.3
set_session(tf.Session(config=config))
to my code but this doesn't seam to make a difference.
I see 0.3 after printing out config.gpu_options.per_process_gpu_memory_fraction so that part seams to be ok.
I even threw in a config.gpu_options.allow_growth = True for good mesure but it doesn't seam to want to do anything but attempt to use all the memory at once only to find that it isn't enough.
The computer I'm trying to use to train this CNN has 4 GTX1080 Ti's with 12gb of dedicated memory each.
EDIT: I'm sorry for not specifying how I was loading the data, I honestly didn't realise there was more than one way. When I was learning, they always had examples where the loaded the datasets that were already built in and it took me a while to realise how to load a self-supplied dataset.
The way I'm doing it is that I'm creating two numpy arrays . One has the path or each image and the other has the corresponding label. Here's the most basic example of this:
data_dir = "folder_name"
# There is a folder for every form and in that folder is every line of that form
for filename in glob.glob(os.path.join(data_dir, '*', '*')):
# the format for file names are: "{author id}-{form id}-{line number}.png"
# filename is a path to the file so .split('\\')[-1] get's the raw file name without the path and .split('-')[0] get's the author id
authors.append(filename.split('\\')[-1].split('-')[0])
files.append(filename)
#keras requires numpy arrays
img_files = np.asarray(files)
img_targets = np.asarray(authors)
Are you sure you're not using a giant batch_size?
"Adding data": honestly I don't know what that means and if you could please describe exactly, with code, what you're doing here, it would be of help.
The number of samples should not cause any problems with GPU memory at all. What does cause a problem is a big batch_size.
Loading a huge dataset could cause a CPU RAM problem, not related with keras/tensorflow. A problem with a numpy array that is too big. (You can test this by simply loading your data "without creating any models")
If that is your problem, you should use a generator to load batches gradually. Again, since there is absolutely no code in your question, we can't help much.
But these are two forms of simply creating a generator for images:
Use the existing ImageDataGenerator and it's flow_from_directory() methods, explained here
Create your own coded generator, which can be:
A loop with yield
A class derived from keras.utils.Sequence
A quick example of a loop generator:
def imageBatchGenerator(imageFiles, imageLabels, batch_size):
while True:
batches = len(imageFiles) // batch_size
if len(imageFiles) % batch_size > 0:
batches += 1
for b in range(batches):
start = b * batch_size
end = (b+1) * batch_size
images = loadTheseImagesIntoNumpy(imageFiles[start:end])
labels = imageLabels[start:end]
yield images,labels
Warning: even with generators, you must make sure your batch size is not too big!
Using it:
model.fit_generator(imageBatchGenerator(files,labels,batchSize), steps_per_epoch = theNumberOfBatches, epochs= ....)
Dividing your model among GPUs
You should be able to decide which layers are processed by which GPU, this "could" probably optimize your RAM usage.
Example, when creating the model:
with tf.device('/gpu:0'):
createLayersThatGoIntoGPU0
with tf.device('/gpu:1'):
createLayersThatGoIntoGPU1
#you will probably need to go back to a previous GPU, as you must define your layers in a proper sequence
with tf.device('/cpu:0'):
createMoreLayersForGPU0
#and so on
I'm not sure this would be better or not, but it's worth trying too.
See more details here: https://keras.io/getting-started/faq/#how-can-i-run-a-keras-model-on-multiple-gpus
The ResourceExhaustedError is raised because you're trying to allocate more memory than is available in your GPUs or main memory. The memory allocation is approximately equal to your network footprint (to estimate this, save a checkpoint and look at the file size) plus your batch size multiplied by the size of a single element of your dataset.
It's difficult to answer your question directly without some more information about your setup, but there is one element of this question that caught my attention: you said that you get the error when you "add enough data" or "use a big enough dataset." That's odd. Notice that the size of your dataset is not included in the calculation for memory allocation footprint. Thus, the size of the dataset shouldn't matter. Since it does, that seems to imply that you are somehow attempting to load your entire dataset into GPU memory or main memory. If you're doing this, that's the origin of your problem. To fix it, use the TensorFlow Dataset API. Using a Dataset sidesteps this limited memory resources by hiding the data behind an Iterator that only yields batches when called. Alternatively, you could use the older feed_dict and QueueRunner data feeding structure, but I don't recommend it. You can find some examples of this here.
If you are already using the Dataset API, you'll need to post more of your code as an edit to your question for us to help you.
There is no setting that magically allows you more memory than your GPU has. It looks to me that your inputs are just to big to fit in the GPU RAM (along with all the required state and gradients).
you should use the config.gpu_options.allow_growth = True but not in order to get more memory, just to get an idea of how much memory you need per input length. start with a small length, see with nvidia-smi how much RAM does your GPU take and then increase the length. do that again and again until you understand what is the maximal length of inputs (batch size) that your GPU can hold.