I'm running Tensorflow Object Detection API to train my own detector using the object_detection/train.py script, found here. The problem is that I'm getting CUDA_ERROR_OUT_OF_MEMORY constantly.
I found some suggestions to reduce the batch size so the trainer consumes less memory, but I reduced from 16 to 4 and I'm still getting the same error. The difference is that when using batch_size=16, the error was thrown in step ~18 and now it is been thrown in step ~70. EDIT: setting batch_size=1 didn't solve the problem, as I still got the error at step ~2700.
What can I do to make it run smoothly until I stop the training proccess? I don't really need to get a fast training.
EDIT:
I'm currently using a GTX 750 Ti 2GB for this. The GPU is not being used for anything else than training and providing monitor image. Currently, I'm using only 80 images for training and 20 images for evaluation.
I think is not about batch_size, because you can start the training at first place.
open a terminal ans run
nvidia-smi -l
to check if there are other process kick in when this error happens. if you set batch_size=16, you can find out pretty quick.
Found the solution for my problem. The batch_size was not the problem, but a higher batch_size made the training memory consumption increase faster, because I was using the config.gpu_options.allow_growth = True configuration. This setting allows Tensorflow to increase memory consumption when needed and tries to use until 100% of GPU memory.
The problem was that I was running the eval.py script at the same time (as recommended in their tutorial) and it was using parte of the GPU memory. When the train.py script tried to use all 100%, the error was thrown.
I solved it by settings the maximum use percentage to 70% for the training proccess. It also solved the problem of stuttering while training. This may not be the optimum value for my GPU, but it is configurable using config.gpu_options.per_process_gpu_memory_fraction = 0.7 setting, for example.
Another option is to dedicate the GPU for training and use the CPU for evaluation.
Disadvantage: Evaluation will consume large portion of your CPU, but only for a few seconds every time a training checkpoint is created, which is not often.
Advantage: 100% of your GPU is used for training all the time
To target CPU, set this environment variable before you run the evaluation script:
export CUDA_VISIBLE_DEVICES=-1
You can explicitly set the evaluate batch job size to 1 in pipeline.config to consume less memory:
eval_config: {
metrics_set: "coco_detection_metrics"
use_moving_averages: false
batch_size: 1;
}
If you're still having issues, TensorFlow may not be releasing GPU memory between training runs. Try restarting your terminal or IDE and try again. This answer has more details.
Related
I am trying to train a ResNet50 model using keras with tensorflow backend. I'm using a sagemaker GPU instance ml.p3.2xlarge but my training time is extremely long. I am using conda_tensorflow_p36 kernel and I have verified that I have tensorflow-gpu installed.
When inspecting the output of nvidia-smi I see the process is on the GPU, but the utilization is never above 0%.
Tensorflow also recognizes the GPU.
Screenshot of training time.
Is sagemaker in fact using the GPU even though the usage is 0%?
Could the long epoch training time be caused by another issue?
Looks like you've completed 8 steps and it just takes very long. What's your step time?
It might be due to data loading. Where ia data stored? Try to take data loading out of the picture by caching and feeding a single image to the DNN repeatedly and see if that helps.
I am using tensorflow-2 gpu with tf.data.Dataset.
Training on small models works.
When training a bigger model, everything works at first : gpu is used, the first epoch works with no trouble (but I am using most of my gpu memory).
At validation time, I run into a CUDA_ERROR_OUT_OF_MEMORY with various allocation with a smaller and smaller amount of bytes that could not be allocated (ranging from 922Mb to 337Mb).
I currently have no metrics and no callbacks and am using tf.keras.Model.fit.
If I remove the validation data, the training continues.
What is my issue ? how can I debug this ?
In tf1, I could use RunOptions(report_tensor_allocations_upon_oom=True), does any equivalent exist in tf2 ?
This occurs with tensorflow==2.1.0 .
These did not occur in
2.0 alpha TensorFlow but in 2.0.
Pip installs tensorflow-gpu==2.0.0: has
leaked memory!
Pip install tensorflow-gpu==2.0.0-alpha:
it's all right!
Try it out
We have Ubuntu 18.04 installed machine with an RTX 2080 Ti GPU with about 3-4 users using it remotely. Is it possible to give a maximum threshold GPU usage per user (say 60%) so any other could use the rest?
We are running tensorflow deep learning models if it helps to suggest an alternative.
My apologies for taking so long to come back here to answer the question, even after figuring out a way to do this.
It is indeed possible to threshold the GPU usage with tensorflow's per_process_gpu_memory_fraction. [Hence I edited the question]
Following snippet assigns 46% of GPU memory for the user.
init = tf.global_variables_initializer()
gpu_options = tf.GPUOptions(per_process_gpu_memory_fraction=0.46)
with tf.Session(config=tf.ConfigProto(gpu_options=gpu_options)) as sess:
sess.run(init)
############
#Training happens here#
############
Currently, we have 2 users using the same GPU simultaneously without any issues. We have assigned 46% per each. Make sure you don't make it 50-50% (aborted, core dumped error occurs if you do so). Try to keep around 300MB memory in idle.
And as a matter of fact, this GPU division does not slow down the training process. Surprisingly, it offers about the same speed as if the full memory is used, at least according to our experience. Although, this may change with high dimensional data.
I am following instructions for TensorFlow Retraining for Poets. GPU utilization seemed low so I instrumented the retrain.py script per the instructions in Using GPU. The log verifies that the TF graph is being built on GPU. I am retraining for a large number of classes and images. Please help me tweak the parameters in TF and the retraining script to utilize GPU.
I am aware of this question that I should decrement the batch size. It is not obvious what constitutes "batch size" for this script. I have 60 classes and 1MM training images. It starts by making 1MM bottleneck files. That part is CPU and slow and I understand that. Then it trains in 4,000 steps where it takes 100 images per time in the step. Is this the batch? Will GPU utilization go up if I reduce the number of images per step?
Your help would be really appreciated!
I usually do the things below.
Check if you are using GPU.
tf.test.is_gpu_available()
Monitor GPU usage.
watch -n 0.1 nvidia-smi
If your CPU usage is low. Write this after
train_batches = train.shuffle(SHUFFLE_BUFFER_SIZE).batch(BATCH_SIZE)
train_batches = train_batches.prefetch(1) # This will prefetch one batch
If your GPU usage is still low.
batch_size = 128
If your GPU is still low. May be:
Your graph is too simple to use more GPU.
Code bug or package bug.
Let's go one by one with your questions:
Batch size is the number of images on which the training/testing/validation is done at a time. You can find the respective parameters and their default values defined in the script:
parser.add_argument(
'--train_batch_size',
type=int,
default=100,
help='How many images to train on at a time.'
)
parser.add_argument(
'--test_batch_size',
type=int,
default=-1,
help="""\
How many images to test on. This test set is only used once, to evaluate
the final accuracy of the model after training completes.
A value of -1 causes the entire test set to be used, which leads to more
stable results across runs.\
"""
)
parser.add_argument(
'--validation_batch_size',
type=int,
default=100,
help="""\
How many images to use in an evaluation batch. This validation set is
used much more often than the test set, and is an early indicator of how
accurate the model is during training.
A value of -1 causes the entire validation set to be used, which leads to
more stable results across training iterations, but may be slower on large
training sets.\
"""
)
So if you want to decrease training batch size, you should run the script with this parameter among others:
python -m retrain --train_batch_size=16
I also recommend you to specify the number of the batch size as a power of 2 (16, 32, 64, 128, ...). And this number depends on the GPU you are using. The less memory the GPU has the lesser batch size you should use. With 8Gb in the GPU, you can try a batch size of 16.
To discover whether you are using GPUs at all you can follow the steps in the Tensorflow documentation you mentioned - just put
tf.debugging.set_log_device_placement(True)
as the first statement of your script.
Device placement logging causes any Tensor allocations or operations will be printed.
I am currently training a recurrent net on Tensorflow for a text classification problem and am running into performance and out of memory issues. I am on AWS g2.8xlarge with Ubuntu 14.04, and a recent nightly build of tensorflow (which I downloaded on Aug 25).
1) Performance issue:
On the surface, both the CPU and GPU are highly under-utilized. I've run multiple tests on this (and have used line_profiler and memory_profiler in the process). Train durations scale linearly with number of epochs, so I tested with 1 epoch. For RNN config = 1 layer, 20 nodes, training time = 146 seconds.
Incidentally, that number is about 20 seconds higher/slower than the same test run on a g2.2xlarge!
Here is a snapshot of System Monitor and nvidia-smi (updated every 2 seconds) about 20 seconds into the run:
SnapshotEarlyPartOfRun
As you can see, GPU utilization is at 19%. When I use nvprof, I find that the total GPU process time is about 27 seconds or so. Also, except for one vCPU, all others are very under-utilized. The numbers stay around this level, till the end of the epoch where I measure error across the entire training set, sending GPU utilization up to 45%.
Unless I am missing something, on the surface, each device is sitting around waiting for something to happen.
2) Out of memory issue:
If I increase the number of nodes to 200, it gives me an Out of Memory error which happens on the GPU side. As you can see from the above snapshots, only one of the four GPUs is used. I've found that the way to get tensorflow to use the GPU has to do with how you assign the model. If you don't specify anything, tensorflow will assign it to a GPU. If you specify a GPU, only it will be used. Tensorflow does not like it when I assign it to multiple devices with a "for d in ['/gpu:0',...]". I get into an issue with re-using the embedding variable. I would like to use all 4 GPUs for this (without setting up distributed tensorflow). Here is the snapshot of the Out of memory error:
OutofMemoryError
Any suggestions you may have for both these problems would be greatly appreciated!
Re (1), to improve GPU utilization did you try increasing the batch size and / or shortening the sequences you use for training?
Re (2), to use multiple GPUs you do need to manually assign the ops to GPU devices, I believe. The right way is to place ops on specific GPUs by doing
with g.Device("/gpu:0"):
...
with g.Device("/gpu:1"):
...