GPU utilization 0% during TensorFlow retraining for poets

GPU utilization 0% during TensorFlow retraining for poets - python

I am following instructions for TensorFlow Retraining for Poets. GPU utilization seemed low so I instrumented the retrain.py script per the instructions in Using GPU. The log verifies that the TF graph is being built on GPU. I am retraining for a large number of classes and images. Please help me tweak the parameters in TF and the retraining script to utilize GPU.
I am aware of this question that I should decrement the batch size. It is not obvious what constitutes "batch size" for this script. I have 60 classes and 1MM training images. It starts by making 1MM bottleneck files. That part is CPU and slow and I understand that. Then it trains in 4,000 steps where it takes 100 images per time in the step. Is this the batch? Will GPU utilization go up if I reduce the number of images per step?
Your help would be really appreciated!

I usually do the things below.
Check if you are using GPU.
tf.test.is_gpu_available()
Monitor GPU usage.
watch -n 0.1 nvidia-smi
If your CPU usage is low. Write this after
train_batches = train.shuffle(SHUFFLE_BUFFER_SIZE).batch(BATCH_SIZE)
train_batches = train_batches.prefetch(1) # This will prefetch one batch
If your GPU usage is still low.
batch_size = 128
If your GPU is still low. May be:
Your graph is too simple to use more GPU.
Code bug or package bug.

Let's go one by one with your questions:
Batch size is the number of images on which the training/testing/validation is done at a time. You can find the respective parameters and their default values defined in the script:
parser.add_argument(
'--train_batch_size',
type=int,
default=100,
help='How many images to train on at a time.'
)
parser.add_argument(
'--test_batch_size',
type=int,
default=-1,
help="""\
How many images to test on. This test set is only used once, to evaluate
the final accuracy of the model after training completes.
A value of -1 causes the entire test set to be used, which leads to more
stable results across runs.\
"""
)
parser.add_argument(
'--validation_batch_size',
type=int,
default=100,
help="""\
How many images to use in an evaluation batch. This validation set is
used much more often than the test set, and is an early indicator of how
accurate the model is during training.
A value of -1 causes the entire validation set to be used, which leads to
more stable results across training iterations, but may be slower on large
training sets.\
"""
)
So if you want to decrease training batch size, you should run the script with this parameter among others:
python -m retrain --train_batch_size=16
I also recommend you to specify the number of the batch size as a power of 2 (16, 32, 64, 128, ...). And this number depends on the GPU you are using. The less memory the GPU has the lesser batch size you should use. With 8Gb in the GPU, you can try a batch size of 16.
To discover whether you are using GPUs at all you can follow the steps in the Tensorflow documentation you mentioned - just put
tf.debugging.set_log_device_placement(True)
as the first statement of your script.
Device placement logging causes any Tensor allocations or operations will be printed.

Related

Cleaning Google TPU memory (python)

My python code has two steps. In each step, I train a neural network (primarily using from mesh_transformer.transformer_shard import CausalTransformer and delete the network before the next step that I train another network with the same function. The problem is that in some cases, I receive this error:
Resource exhausted: Failed to allocate request for 32.00MiB (33554432B) on device ordinal 0: while running replica 0 and partition 0 of a replicated computation (other replicas may have failed as well).
I think there is still some remaining stuff in the TPU memory I need to remove except that network. The point here is that both steps are independent, and they don't share any information or variable. But I have to do this sequentially to manage my storage on Google cloud. Also, when I run these two steps separately, it works fine. Is there any way to clean TPU memory thoroughly before going to the next step of my code? I think just removing the network is not enough.

Unfortunately, you can’t clean the TPU memory, but you can reduce memory usage by these options;
The most effective ways to reduce memory usage are to:
Reduce excessive tensor padding
Tensors in TPU memory are padded, that is, the TPU rounds up the sizes of tensors stored in memory to perform computations more efficiently. This padding happens transparently at the hardware level and does not affect results. However, in certain cases the padding can result in significantly increased memory use and execution time.
Reduce the batch size
Slowly reduce the batch size until it fits in memory, making sure that the total batch size is a multiple of 64 (the per-core batch size has to be a multiple of 8). Keep in mind that larger batch sizes are more efficient on the TPU. A total batch size of 1024 (128 per core) is generally a good starting point.
If the model cannot be run on the TPU even with a small batch size
(for example, 64), try reducing the number of layers or the layer
sizes.
You could read more about troubleshooting in this documentation

You can try to clean TPU state after each training and see if that helps with
tf.tpu.experimental.shutdown_tpu_system() call.
Another option is to restart the TPU to clean the memory using:
pip3 install cloud-tpu-client
import tensorflow as tf
from cloud_tpu_client import Client
print(tf.__version__)
Client().configure_tpu_version(tf.__version__, restart_type='always')

TensorFlow's Mirrored strategy, batch size and Back Propagation

i'm dealing with the training of a Neural Network on a multi-gpu server. I'm using the MirroredStrategy API from TensorFlow 2.1 and i'm getting a lil confused.
I have 8 GPUs (Nvidia V100 32GB)
I'm specifying a batch size of 32 (how is it managed? Each gpu will have a batch of 32 samples? Should i specify 256 as batch size -32x8- ?)
When and how is Back-propagation applied? I've read that the MirroredStrategy is synchronous: does it imply that after the forward step all batches are grouped into one batch of size 32x8 and after that back-propagation is applied? Or Back-prop is applied once for each batch of size 32 in a sequential manner?
I really want to be sure on what kind of experiments i submit to the server since each training job is really time consuming and having the batch size to change (and back-prop) based on the number of available GPUs affects results correctness.
Thank you for any help provided.

When using MirroredStrategy, the batch size refers to the global batch size. You can see in the docs here
For instance, if using MirroredStrategy with 2 GPUs, each batch of size 10 will get divided among the 2 GPUs, with each receiving 5 input examples in each step.
So in your case if you want each GPU to process 32 samples per step, you can set the batch size as 32 * strategy.num_replicas_in_sync.
Each GPU will compute the forward and backward passes through the model on a different slice of the input data. The computed gradients from each of these slices are then aggregated across all of the devices and reduced (usually an average) in a process known as AllReduce. The optimizer then performs the parameter updates with these reduced gradients thereby keeping the devices in sync.

Allowing Tensorflow to Use both GPU and Physical System Memory

For a project I'm working on, I am using an altered version of Mask RCNN to train a model that will find objects in an image. These images are relatively small, about 300 x 200 pixels, and I train them for a relatively long time, around 100 epochs.
However, my main question relates to the batch size and how Tensorflow allocates memory on the gpu for the validation stage per epoch. I want to increase my batch size to help better smooth out the validation curve, as well as increase the accuracy of the overall model. However, if I increase my batch size to drastically, I get a OOM: GPU out-of-momory and keras_scratch_graph error. I'm currently working with two NVIDIA Quadro P5000s that have 16GB of vram each. having about 3 images per GPU, I can have a max batch size of 6 before it errors out. I've looked around and most people either say to just decrease the batch size, which I would prefer not to do, or enable GPU growth, which I couldn't get to work either. I could decrease the complexity of my model to decrease the size of tensors that are being evaluated, but I don't want to risk it as it could cause my accuracy to decrease, or loss to increase.
Is there a way that I can offset some images onto my physical systems memory, or am I purely limited to the amount of ram I have available on my GPU? Are their any more compact or robust methods out there that could solve this issue?

Training in tensor flow cpu version too slow

I have installed Tensorflow cpu version.I have only few images as dataset and I am training on a machine with 4GB ram and Core i5 3340m 2.70GHZ with batch size 1 and it is still extremely slow.the size of all images is same (200X185 i think).Will it train like this ? kindly tell me how can I speed up this process?
Training porcess

If your network is deep, it could take a long time to train your network using CPU as it is not optimized like GPU for calculations.
I would suggest you to get a graphic card, even a old version of graphic card can significantly improve the performance (it could be like 100x faster).

Let's put some numbers here. You are dealing with images with a size of 200x185. Do you realize we are talking about 37000 features right? If we deal with gray levels. If we deal with RGB multiply that by 3. How many images are you using for training? Keep also in mind that SGD (Stochastic Gradient Descent, mini-batch size = 1) tend to be very slow for big datasets... Give us some numbers. How many training images and what is "slow". How much time for one epoch. Something else: programming languages, library (tensorflow, etc.), optimizer, etc. would help us in judging if your code is "slow" and can it be made faster.

batch size is another param affect training time: higher size will help reduce time each epoch, but will require more epoch to have the same effiency like size=1
And if your network is deep (using CNN, etc), you should run on GPU

AWS g2.8xlarge performance and out of memory issues when using tensorflow

I am currently training a recurrent net on Tensorflow for a text classification problem and am running into performance and out of memory issues. I am on AWS g2.8xlarge with Ubuntu 14.04, and a recent nightly build of tensorflow (which I downloaded on Aug 25).
1) Performance issue:
On the surface, both the CPU and GPU are highly under-utilized. I've run multiple tests on this (and have used line_profiler and memory_profiler in the process). Train durations scale linearly with number of epochs, so I tested with 1 epoch. For RNN config = 1 layer, 20 nodes, training time = 146 seconds.
Incidentally, that number is about 20 seconds higher/slower than the same test run on a g2.2xlarge!
Here is a snapshot of System Monitor and nvidia-smi (updated every 2 seconds) about 20 seconds into the run:
SnapshotEarlyPartOfRun
As you can see, GPU utilization is at 19%. When I use nvprof, I find that the total GPU process time is about 27 seconds or so. Also, except for one vCPU, all others are very under-utilized. The numbers stay around this level, till the end of the epoch where I measure error across the entire training set, sending GPU utilization up to 45%.
Unless I am missing something, on the surface, each device is sitting around waiting for something to happen.
2) Out of memory issue:
If I increase the number of nodes to 200, it gives me an Out of Memory error which happens on the GPU side. As you can see from the above snapshots, only one of the four GPUs is used. I've found that the way to get tensorflow to use the GPU has to do with how you assign the model. If you don't specify anything, tensorflow will assign it to a GPU. If you specify a GPU, only it will be used. Tensorflow does not like it when I assign it to multiple devices with a "for d in ['/gpu:0',...]". I get into an issue with re-using the embedding variable. I would like to use all 4 GPUs for this (without setting up distributed tensorflow). Here is the snapshot of the Out of memory error:
OutofMemoryError
Any suggestions you may have for both these problems would be greatly appreciated!

Re (1), to improve GPU utilization did you try increasing the batch size and / or shortening the sequences you use for training?
Re (2), to use multiple GPUs you do need to manually assign the ops to GPU devices, I believe. The right way is to place ops on specific GPUs by doing
with g.Device("/gpu:0"):
...
with g.Device("/gpu:1"):
...

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.