We have Ubuntu 18.04 installed machine with an RTX 2080 Ti GPU with about 3-4 users using it remotely. Is it possible to give a maximum threshold GPU usage per user (say 60%) so any other could use the rest?
We are running tensorflow deep learning models if it helps to suggest an alternative.
My apologies for taking so long to come back here to answer the question, even after figuring out a way to do this.
It is indeed possible to threshold the GPU usage with tensorflow's per_process_gpu_memory_fraction. [Hence I edited the question]
Following snippet assigns 46% of GPU memory for the user.
init = tf.global_variables_initializer()
gpu_options = tf.GPUOptions(per_process_gpu_memory_fraction=0.46)
with tf.Session(config=tf.ConfigProto(gpu_options=gpu_options)) as sess:
sess.run(init)
############
#Training happens here#
############
Currently, we have 2 users using the same GPU simultaneously without any issues. We have assigned 46% per each. Make sure you don't make it 50-50% (aborted, core dumped error occurs if you do so). Try to keep around 300MB memory in idle.
And as a matter of fact, this GPU division does not slow down the training process. Surprisingly, it offers about the same speed as if the full memory is used, at least according to our experience. Although, this may change with high dimensional data.
Related
I'm using a GPU to train quite a lot of models. I want to tune the architecture of the network, so I train different models sequentially to compare their performances (I'm using keras-tuner).
The problem is that some models are very small, and some others are very large. I don't want to allocate all the GPU memory to my trainings, but only the quantity I need. I've TF_FORCE_GPU_ALLOW_GROWTH to true, meaning that when a model requires a large quantity of memory, then the GPU will allocate it. However, once that big model has been trained, the memory will not be released, even if the next trainings are tiny models.
Is there a way to force the GPU to release unused memory? Something like TF_FORCE_GPU_ALLOW_SHRINK?
Maybe having an automatic shrinking might be difficult to achieve. If so I would be happy with a manual releasing that I could add in a callback to be run after each training.
You can try by limiting GPU memory growth using this code:
import tensorflow as tf
gpus = tf.config.experimental.list_physical_devices('GPU')
tf.config.experimental.set_memory_growth(gpus[0], True)
The second method is to configure a virtual GPU device with tf.config.set_logical_device_configuration and set a hard limit on the total memory to allocate it to the GPU.
Please check this link for more details.
The problem
I am currently working on a project that I sadly can't share with you. The project is about hyper-parameter optimization for neural networks, and it requires that I train multiple neural network models (more than I can store on my GPU) in parallel. The network architectures stay the same, but the network parameters and hyper-parameters are subjected to change between each training interval. I am currently achieving this using PyTorch on a linux environment in order to allow my NVIDIA GTX 1660 (6GB RAM) to use the multiprocessing feature that PyTorch provides.
Code (simplified):
def training_function(checkpoint):
load(checkpoint)
train(checkpoint)
unload(checkpoint)
for step in range(training_steps):
trained_checkpoints = list()
for trained_checkpoint in pool.imap_unordered(training_function, checkpoints):
trained_checkpoints.append(trained_checkpoint)
for optimized_checkpoint in optimize(trained_checkpoints):
checkpoints.update(optimized_checkpoint)
I currently test with a population of 30 neural networks (i.e. 30 checkpoints) with the MNIST and FashionMNIST datasets which consists of 70 000 (50k training, 10k validation, 10k testing) 28x28 images with 1 channel each respectively. The network I train is a simple Lenet5 implementation.
I use a torch.multiprocessing pool and allow 7 processes to be spawned. Each process uses some of the GPU memory available just to initialize the CUDA environment in each process. After training, the checkpoints are adapted with my hyper-parameter optimization technique.
The load function in the training_function loads the model- and optimizer state (holds the network parameter tensors) from a local file into GPU memory using torch.load. The unload saves the newly trained states back to file using torch.save and deletes them from memory. I do this because PyTorch will only detach GPU tensors when no variable is referencing them. I have to do this because I have limited GPU memory.
The current setup works, but each CUDA initialization occupies over 700MB of GPU RAM, and so I am interested if there are other ways I could do this that may use less memory without a penalty to efficiency.
My attempts
I suspected I could use a thread pool in order to save some memory, and it did. By spawning 7 threads instead of 7 processes, CUDA was only initialized once, which saved almost half of my memory. However, this lead to a new problem in which the GPU only utilized approx. 30% utilization according to nvidia-smi that I am monitoring in a separate linux terminal. Without threads, I get around 85-90% utilization.
I also messed around with torch.multiprocessing.set_sharing_strategy which is currently set to 'file_descriptor', but with no luck.
My questions
Is there a better way to work with multiple model- and optimizer states without saving and loading them to files while training? I have tried to move the model to CPU using model.cpu() before saving the state_dict, but this did not work in my implementation (memory leaks).
Is there an efficient way I can train multiple neural networks at the same time that uses less GPU memory? When searching the web, I only find references to nn.DataParallel which trains the same model over multiple GPUs by copying it to each GPU. This does not work for my problem.
I will soon have access to multiple, more powerful GPUs with more memory, and I suspect this problem will be less annoying then, but I wouldn't be surprised if there is a better solution I am not getting.
Update (09.03.2020)
For any future readers, if you set out to do something similar to the pseudo code displayed above, and you plan on using multiple GPUs, please make sure to create one multiprocessing pool for each GPU device. Pools don't execute functions in order with the underlying processes it contains, and so you will end up initializing CUDA multiple times on the same process, wasting memory.
Another important note is that while you may be passing the device (e.g. 'cuda:1') to every torch.cuda-function, you may discover that torch does something with the default cuda device 'cuda:0' somewhere in the code, initializing CUDA on that device for every process, which wastes memory on an unwanted and non-needed CUDA initialization. I fixed this issue by using with torch.cuda.device(device_id) that encapsulate the entire training_function.
I ended up not using multiprocessing pools and instead defined my own custom process class that holds the device and training function. This means I have to maintain in-queues for each device-process, but they all share the same out-queue, meaning I can retrieve the results the moment they are available. I figured writing a custom process class was simpler than writing a custom pool class. I desperately tried to keep using pools as they are easily maintained, but I had to use multiple imap-functions, and so the results were not obtainable one at a time, which lead to a less efficient training-loop.
I am now successfully training on multiple GPUs, but my questions posted above still remains unanswered.
Update (10.03.2020)
I have implemented another way to store model- and optimizer statedicts outside of GPU RAM. I have written function that replaces every tensor in the dicts with it's .to('cpu') equivalent. This costs me some CPU memory, but it is more reliable than storing local files.
Update (11.06.2020)
I have still not found a different approach that leads to fewer CUDA initializations while maintaining the same processing speed. From what I've read and come to understand, PyTorch does not infer too much with how CUDA is operating, and leaves that up to NVIDIA.
I have ended up using a pool of custom, device-specific processes, called Workers, that is maintained by my custom pool class (more about this above). In addition, I let each of these Workers take in one or more checkpoints as well as the function that processes them (training, testing, hp optimization) via a Queue. These checkpoints are then processed simultaneously via a python multiprocessing ThreadPool in each Worker and the results are then returned one by one via the return Queue the moment they are ready.
This gives me the parallel procedure I was needing, but the memory issue is still there. Due to time constraints, I have come to terms with it for now.
I am conducting a research which requires me to know the memory used during run time by the model when i run a deep learning model(CNN) in google colab. Is there any code i can use to know the same .Basically I want to know how much memory has been used in total model run .(after all epoch has been complete). I am coding in python
Regards
Avik
As explained in this post and my own observations, Tensorflow always tries to allocate the entire memory, no matter how small or big is your model. Unlike for example MXNet that only allocates enough memory to run the model.
Diving a little deeper, I learned that this is indeed the default
behaviour in Tensorflow: use all available RAM to speed things up.
Fair enough :)
You might think more memory allocation means faster training, but that's not the case most of the times. You can restrict your TF memory usage, as shown in the following code:
import tensorflow as tf
from keras.backend.tensorflow_backend import set_session
config = tf.ConfigProto()
config.gpu_options.per_process_gpu_memory_fraction = 0.9
config.gpu_options.visible_device_list = "0"
set_session(tf.Session(config=config))
Here is the Tensorflow documentation if you need more details of how to set restrictions on TF memory usage.
I'm running Tensorflow Object Detection API to train my own detector using the object_detection/train.py script, found here. The problem is that I'm getting CUDA_ERROR_OUT_OF_MEMORY constantly.
I found some suggestions to reduce the batch size so the trainer consumes less memory, but I reduced from 16 to 4 and I'm still getting the same error. The difference is that when using batch_size=16, the error was thrown in step ~18 and now it is been thrown in step ~70. EDIT: setting batch_size=1 didn't solve the problem, as I still got the error at step ~2700.
What can I do to make it run smoothly until I stop the training proccess? I don't really need to get a fast training.
EDIT:
I'm currently using a GTX 750 Ti 2GB for this. The GPU is not being used for anything else than training and providing monitor image. Currently, I'm using only 80 images for training and 20 images for evaluation.
I think is not about batch_size, because you can start the training at first place.
open a terminal ans run
nvidia-smi -l
to check if there are other process kick in when this error happens. if you set batch_size=16, you can find out pretty quick.
Found the solution for my problem. The batch_size was not the problem, but a higher batch_size made the training memory consumption increase faster, because I was using the config.gpu_options.allow_growth = True configuration. This setting allows Tensorflow to increase memory consumption when needed and tries to use until 100% of GPU memory.
The problem was that I was running the eval.py script at the same time (as recommended in their tutorial) and it was using parte of the GPU memory. When the train.py script tried to use all 100%, the error was thrown.
I solved it by settings the maximum use percentage to 70% for the training proccess. It also solved the problem of stuttering while training. This may not be the optimum value for my GPU, but it is configurable using config.gpu_options.per_process_gpu_memory_fraction = 0.7 setting, for example.
Another option is to dedicate the GPU for training and use the CPU for evaluation.
Disadvantage: Evaluation will consume large portion of your CPU, but only for a few seconds every time a training checkpoint is created, which is not often.
Advantage: 100% of your GPU is used for training all the time
To target CPU, set this environment variable before you run the evaluation script:
export CUDA_VISIBLE_DEVICES=-1
You can explicitly set the evaluate batch job size to 1 in pipeline.config to consume less memory:
eval_config: {
metrics_set: "coco_detection_metrics"
use_moving_averages: false
batch_size: 1;
}
If you're still having issues, TensorFlow may not be releasing GPU memory between training runs. Try restarting your terminal or IDE and try again. This answer has more details.
I am currently training a recurrent net on Tensorflow for a text classification problem and am running into performance and out of memory issues. I am on AWS g2.8xlarge with Ubuntu 14.04, and a recent nightly build of tensorflow (which I downloaded on Aug 25).
1) Performance issue:
On the surface, both the CPU and GPU are highly under-utilized. I've run multiple tests on this (and have used line_profiler and memory_profiler in the process). Train durations scale linearly with number of epochs, so I tested with 1 epoch. For RNN config = 1 layer, 20 nodes, training time = 146 seconds.
Incidentally, that number is about 20 seconds higher/slower than the same test run on a g2.2xlarge!
Here is a snapshot of System Monitor and nvidia-smi (updated every 2 seconds) about 20 seconds into the run:
SnapshotEarlyPartOfRun
As you can see, GPU utilization is at 19%. When I use nvprof, I find that the total GPU process time is about 27 seconds or so. Also, except for one vCPU, all others are very under-utilized. The numbers stay around this level, till the end of the epoch where I measure error across the entire training set, sending GPU utilization up to 45%.
Unless I am missing something, on the surface, each device is sitting around waiting for something to happen.
2) Out of memory issue:
If I increase the number of nodes to 200, it gives me an Out of Memory error which happens on the GPU side. As you can see from the above snapshots, only one of the four GPUs is used. I've found that the way to get tensorflow to use the GPU has to do with how you assign the model. If you don't specify anything, tensorflow will assign it to a GPU. If you specify a GPU, only it will be used. Tensorflow does not like it when I assign it to multiple devices with a "for d in ['/gpu:0',...]". I get into an issue with re-using the embedding variable. I would like to use all 4 GPUs for this (without setting up distributed tensorflow). Here is the snapshot of the Out of memory error:
OutofMemoryError
Any suggestions you may have for both these problems would be greatly appreciated!
Re (1), to improve GPU utilization did you try increasing the batch size and / or shortening the sequences you use for training?
Re (2), to use multiple GPUs you do need to manually assign the ops to GPU devices, I believe. The right way is to place ops on specific GPUs by doing
with g.Device("/gpu:0"):
...
with g.Device("/gpu:1"):
...