I'm trying to train a Mask-RCNN (ResNet101) model on a single GPU using the TensorFlow Object Detection API.
This the configuration I'm using for Mask-RCNN.
Immediately after training starts, it takes up ~24GB of CPU RAM. At the 10 minute mark, when the first round of evaluation begins, all 32GB of my CPU RAM fill up and the process gets killed.
My dataset consists of 1 sample in the training set (i.e. 1 image) and 1 sample in the eval set. The images are 775 x 522 pixels. Each sample (image + boxes + masks) amounts to no more than 2MB on disk. I've done this to ensure that the dataset's effect on memory consumption is minimised.
I've also attempted to train a Faster RCNN model on the same dataset, and it works as expected (using 2-3 GB of CPU RAM).
This is the configuration I'm using for Faster RCNN
Why is the memory usage during the training and evaluation of Mask-RCNN so high (compared to Faster RCNN's), and what can I do to reduce it?
Related
I have access to a 2018 Lambda machine with 24 CPUs and 4 GPUs and a 4 TB SSD. I also have access to a 2022 Dell with 40 CPUs and 3 A6000 GPUs and a 1TB NVMe SSD in addition to a 1TB SSD and 2 8TB HDDs.
If I restrict my Python/Keras/TensorFlow code to use 1 CPU and 1 GPU, it takes about 9-10 minutes per epoch to run with about 40k parameters. If I increase the number of CPUs available or the number of GPUs available or both, it still takes 9-10 minutes per epoch. In the beginning, the disks are heavily taxed as the data is read into memory, but apparently, it all fits into memory because, after a while, it appears that there is no disk activity.
I used atop to monitor resource usage and that is how I found out about the disk activity.
I used htop to monitor CPU activity and I found that if only one CPU is used, it is heavily taxed. If 2 CPUs are used, they are each taxed at about 60% capacity. For 10 CPUs, each is used at about 13% capacity, and with 20 CPUs, each is used at about 7% capacity.
I used nvtop to monitor GPU activity and it appears, on the Dell, that one GPU is utilized at about 10-25% capacity while the other two are rarely used. In fact, it appears that no data is read into the memory of GPU1 or GPU2. Only GPU0 appears to be doing something real. Previously using nvidia-smi revealed similar behavior on the Lambda machine.
I am using the Keras model.fit with my own data generator function. I would think that using multiple CPUs would cause the data generator to be run in parallel on several CPUs to keep the queue full, but the CPUs seem mostly idle.
So, the CPUs appear to be underutilized, the GPUs appear to be underutilized, and the disks appear to be underutilized. Either there is some other source of bottlenecking or the main memory/PCI-bus can't keep up. Or maybe Keras just isn't running more than one copy of the data generator. At this point, I'm at a complete loss.
BTW, I'm using Python 3.8 and TensorFlow 2.10.
Any help is appreciated.
Edit: Some more information:
I can't say exactly what the application is, but the network currently has two Siamese branches each having 17 convolutional layers performing 3x3 convolutions in a telescoping arrangement. The input images are 69x69 and the smallest layer is 5x5. The input layer has 3 channels and subsequent layers each have 16 channels. Following that are three dense layers with 25, 20, and 10 units. Following that, the two Siamese branches are combined in a product layer with 10 outputs and those outputs are combined by another dense layer with 1 output. There is also a custom loss function which is similar to the binary cross-entropy loss. The input dataset consists of about 600,000 images.
On the advice of a colleague, I turned off all the GPUs and am running purely on CPUs. Under that condition, it appears that all 40 CPUs on the Dell are being utilized at between 20% and 50% of capacity with one CPU sometimes using 95+% of capacity. It appears that under this condition, the time increases to about 12-13 minutes per epoch.
I should also mention that I'm using a batch size of 500. I've tried increasing or decreasing the batch size, but the run time does not appear to be greatly affected. If anything, it just gets slower.
Thanks.
I am switching to training on GPUs and found that with arbitrary, and not very big, batch size of training will crash. With 256x256 RGB images in a UNET, a batch of 32 causes an out of memory crash, while 16 works successfully. The amount of memory consumed was surprising as I never ran into an out-of-memory on a 16 GB RAM system. Is tensorflow free to use SWAP?
How can I check the amount of total memory available on a GPU? Many guides online only look at memory used.
How does one estimate the memory needs? Image size (pixelschannelsdtype)* batch + parameter size * float?
Many thanks,
Bogdan
For a project I'm working on, I am using an altered version of Mask RCNN to train a model that will find objects in an image. These images are relatively small, about 300 x 200 pixels, and I train them for a relatively long time, around 100 epochs.
However, my main question relates to the batch size and how Tensorflow allocates memory on the gpu for the validation stage per epoch. I want to increase my batch size to help better smooth out the validation curve, as well as increase the accuracy of the overall model. However, if I increase my batch size to drastically, I get a OOM: GPU out-of-momory and keras_scratch_graph error. I'm currently working with two NVIDIA Quadro P5000s that have 16GB of vram each. having about 3 images per GPU, I can have a max batch size of 6 before it errors out. I've looked around and most people either say to just decrease the batch size, which I would prefer not to do, or enable GPU growth, which I couldn't get to work either. I could decrease the complexity of my model to decrease the size of tensors that are being evaluated, but I don't want to risk it as it could cause my accuracy to decrease, or loss to increase.
Is there a way that I can offset some images onto my physical systems memory, or am I purely limited to the amount of ram I have available on my GPU? Are their any more compact or robust methods out there that could solve this issue?
I am following instructions for TensorFlow Retraining for Poets. GPU utilization seemed low so I instrumented the retrain.py script per the instructions in Using GPU. The log verifies that the TF graph is being built on GPU. I am retraining for a large number of classes and images. Please help me tweak the parameters in TF and the retraining script to utilize GPU.
I am aware of this question that I should decrement the batch size. It is not obvious what constitutes "batch size" for this script. I have 60 classes and 1MM training images. It starts by making 1MM bottleneck files. That part is CPU and slow and I understand that. Then it trains in 4,000 steps where it takes 100 images per time in the step. Is this the batch? Will GPU utilization go up if I reduce the number of images per step?
Your help would be really appreciated!
I usually do the things below.
Check if you are using GPU.
tf.test.is_gpu_available()
Monitor GPU usage.
watch -n 0.1 nvidia-smi
If your CPU usage is low. Write this after
train_batches = train.shuffle(SHUFFLE_BUFFER_SIZE).batch(BATCH_SIZE)
train_batches = train_batches.prefetch(1) # This will prefetch one batch
If your GPU usage is still low.
batch_size = 128
If your GPU is still low. May be:
Your graph is too simple to use more GPU.
Code bug or package bug.
Let's go one by one with your questions:
Batch size is the number of images on which the training/testing/validation is done at a time. You can find the respective parameters and their default values defined in the script:
parser.add_argument(
'--train_batch_size',
type=int,
default=100,
help='How many images to train on at a time.'
)
parser.add_argument(
'--test_batch_size',
type=int,
default=-1,
help="""\
How many images to test on. This test set is only used once, to evaluate
the final accuracy of the model after training completes.
A value of -1 causes the entire test set to be used, which leads to more
stable results across runs.\
"""
)
parser.add_argument(
'--validation_batch_size',
type=int,
default=100,
help="""\
How many images to use in an evaluation batch. This validation set is
used much more often than the test set, and is an early indicator of how
accurate the model is during training.
A value of -1 causes the entire validation set to be used, which leads to
more stable results across training iterations, but may be slower on large
training sets.\
"""
)
So if you want to decrease training batch size, you should run the script with this parameter among others:
python -m retrain --train_batch_size=16
I also recommend you to specify the number of the batch size as a power of 2 (16, 32, 64, 128, ...). And this number depends on the GPU you are using. The less memory the GPU has the lesser batch size you should use. With 8Gb in the GPU, you can try a batch size of 16.
To discover whether you are using GPUs at all you can follow the steps in the Tensorflow documentation you mentioned - just put
tf.debugging.set_log_device_placement(True)
as the first statement of your script.
Device placement logging causes any Tensor allocations or operations will be printed.
I am currently training a recurrent net on Tensorflow for a text classification problem and am running into performance and out of memory issues. I am on AWS g2.8xlarge with Ubuntu 14.04, and a recent nightly build of tensorflow (which I downloaded on Aug 25).
1) Performance issue:
On the surface, both the CPU and GPU are highly under-utilized. I've run multiple tests on this (and have used line_profiler and memory_profiler in the process). Train durations scale linearly with number of epochs, so I tested with 1 epoch. For RNN config = 1 layer, 20 nodes, training time = 146 seconds.
Incidentally, that number is about 20 seconds higher/slower than the same test run on a g2.2xlarge!
Here is a snapshot of System Monitor and nvidia-smi (updated every 2 seconds) about 20 seconds into the run:
SnapshotEarlyPartOfRun
As you can see, GPU utilization is at 19%. When I use nvprof, I find that the total GPU process time is about 27 seconds or so. Also, except for one vCPU, all others are very under-utilized. The numbers stay around this level, till the end of the epoch where I measure error across the entire training set, sending GPU utilization up to 45%.
Unless I am missing something, on the surface, each device is sitting around waiting for something to happen.
2) Out of memory issue:
If I increase the number of nodes to 200, it gives me an Out of Memory error which happens on the GPU side. As you can see from the above snapshots, only one of the four GPUs is used. I've found that the way to get tensorflow to use the GPU has to do with how you assign the model. If you don't specify anything, tensorflow will assign it to a GPU. If you specify a GPU, only it will be used. Tensorflow does not like it when I assign it to multiple devices with a "for d in ['/gpu:0',...]". I get into an issue with re-using the embedding variable. I would like to use all 4 GPUs for this (without setting up distributed tensorflow). Here is the snapshot of the Out of memory error:
OutofMemoryError
Any suggestions you may have for both these problems would be greatly appreciated!
Re (1), to improve GPU utilization did you try increasing the batch size and / or shortening the sequences you use for training?
Re (2), to use multiple GPUs you do need to manually assign the ops to GPU devices, I believe. The right way is to place ops on specific GPUs by doing
with g.Device("/gpu:0"):
...
with g.Device("/gpu:1"):
...