Cleaning Google TPU memory (python)

Cleaning Google TPU memory (python) - python

My python code has two steps. In each step, I train a neural network (primarily using from mesh_transformer.transformer_shard import CausalTransformer and delete the network before the next step that I train another network with the same function. The problem is that in some cases, I receive this error:
Resource exhausted: Failed to allocate request for 32.00MiB (33554432B) on device ordinal 0: while running replica 0 and partition 0 of a replicated computation (other replicas may have failed as well).
I think there is still some remaining stuff in the TPU memory I need to remove except that network. The point here is that both steps are independent, and they don't share any information or variable. But I have to do this sequentially to manage my storage on Google cloud. Also, when I run these two steps separately, it works fine. Is there any way to clean TPU memory thoroughly before going to the next step of my code? I think just removing the network is not enough.

Unfortunately, you can’t clean the TPU memory, but you can reduce memory usage by these options;
The most effective ways to reduce memory usage are to:
Reduce excessive tensor padding
Tensors in TPU memory are padded, that is, the TPU rounds up the sizes of tensors stored in memory to perform computations more efficiently. This padding happens transparently at the hardware level and does not affect results. However, in certain cases the padding can result in significantly increased memory use and execution time.
Reduce the batch size
Slowly reduce the batch size until it fits in memory, making sure that the total batch size is a multiple of 64 (the per-core batch size has to be a multiple of 8). Keep in mind that larger batch sizes are more efficient on the TPU. A total batch size of 1024 (128 per core) is generally a good starting point.
If the model cannot be run on the TPU even with a small batch size
(for example, 64), try reducing the number of layers or the layer
sizes.
You could read more about troubleshooting in this documentation

You can try to clean TPU state after each training and see if that helps with
tf.tpu.experimental.shutdown_tpu_system() call.
Another option is to restart the TPU to clean the memory using:
pip3 install cloud-tpu-client
import tensorflow as tf
from cloud_tpu_client import Client
print(tf.__version__)
Client().configure_tpu_version(tf.__version__, restart_type='always')

Related

TensorFlow's Mirrored strategy, batch size and Back Propagation

i'm dealing with the training of a Neural Network on a multi-gpu server. I'm using the MirroredStrategy API from TensorFlow 2.1 and i'm getting a lil confused.
I have 8 GPUs (Nvidia V100 32GB)
I'm specifying a batch size of 32 (how is it managed? Each gpu will have a batch of 32 samples? Should i specify 256 as batch size -32x8- ?)
When and how is Back-propagation applied? I've read that the MirroredStrategy is synchronous: does it imply that after the forward step all batches are grouped into one batch of size 32x8 and after that back-propagation is applied? Or Back-prop is applied once for each batch of size 32 in a sequential manner?
I really want to be sure on what kind of experiments i submit to the server since each training job is really time consuming and having the batch size to change (and back-prop) based on the number of available GPUs affects results correctness.
Thank you for any help provided.

When using MirroredStrategy, the batch size refers to the global batch size. You can see in the docs here
For instance, if using MirroredStrategy with 2 GPUs, each batch of size 10 will get divided among the 2 GPUs, with each receiving 5 input examples in each step.
So in your case if you want each GPU to process 32 samples per step, you can set the batch size as 32 * strategy.num_replicas_in_sync.
Each GPU will compute the forward and backward passes through the model on a different slice of the input data. The computed gradients from each of these slices are then aggregated across all of the devices and reduced (usually an average) in a process known as AllReduce. The optimizer then performs the parameter updates with these reduced gradients thereby keeping the devices in sync.

Training multiple neural networks asynchronously in parallel

The problem
I am currently working on a project that I sadly can't share with you. The project is about hyper-parameter optimization for neural networks, and it requires that I train multiple neural network models (more than I can store on my GPU) in parallel. The network architectures stay the same, but the network parameters and hyper-parameters are subjected to change between each training interval. I am currently achieving this using PyTorch on a linux environment in order to allow my NVIDIA GTX 1660 (6GB RAM) to use the multiprocessing feature that PyTorch provides.
Code (simplified):
def training_function(checkpoint):
load(checkpoint)
train(checkpoint)
unload(checkpoint)
for step in range(training_steps):
trained_checkpoints = list()
for trained_checkpoint in pool.imap_unordered(training_function, checkpoints):
trained_checkpoints.append(trained_checkpoint)
for optimized_checkpoint in optimize(trained_checkpoints):
checkpoints.update(optimized_checkpoint)
I currently test with a population of 30 neural networks (i.e. 30 checkpoints) with the MNIST and FashionMNIST datasets which consists of 70 000 (50k training, 10k validation, 10k testing) 28x28 images with 1 channel each respectively. The network I train is a simple Lenet5 implementation.
I use a torch.multiprocessing pool and allow 7 processes to be spawned. Each process uses some of the GPU memory available just to initialize the CUDA environment in each process. After training, the checkpoints are adapted with my hyper-parameter optimization technique.
The load function in the training_function loads the model- and optimizer state (holds the network parameter tensors) from a local file into GPU memory using torch.load. The unload saves the newly trained states back to file using torch.save and deletes them from memory. I do this because PyTorch will only detach GPU tensors when no variable is referencing them. I have to do this because I have limited GPU memory.
The current setup works, but each CUDA initialization occupies over 700MB of GPU RAM, and so I am interested if there are other ways I could do this that may use less memory without a penalty to efficiency.
My attempts
I suspected I could use a thread pool in order to save some memory, and it did. By spawning 7 threads instead of 7 processes, CUDA was only initialized once, which saved almost half of my memory. However, this lead to a new problem in which the GPU only utilized approx. 30% utilization according to nvidia-smi that I am monitoring in a separate linux terminal. Without threads, I get around 85-90% utilization.
I also messed around with torch.multiprocessing.set_sharing_strategy which is currently set to 'file_descriptor', but with no luck.
My questions
Is there a better way to work with multiple model- and optimizer states without saving and loading them to files while training? I have tried to move the model to CPU using model.cpu() before saving the state_dict, but this did not work in my implementation (memory leaks).
Is there an efficient way I can train multiple neural networks at the same time that uses less GPU memory? When searching the web, I only find references to nn.DataParallel which trains the same model over multiple GPUs by copying it to each GPU. This does not work for my problem.
I will soon have access to multiple, more powerful GPUs with more memory, and I suspect this problem will be less annoying then, but I wouldn't be surprised if there is a better solution I am not getting.
Update (09.03.2020)
For any future readers, if you set out to do something similar to the pseudo code displayed above, and you plan on using multiple GPUs, please make sure to create one multiprocessing pool for each GPU device. Pools don't execute functions in order with the underlying processes it contains, and so you will end up initializing CUDA multiple times on the same process, wasting memory.
Another important note is that while you may be passing the device (e.g. 'cuda:1') to every torch.cuda-function, you may discover that torch does something with the default cuda device 'cuda:0' somewhere in the code, initializing CUDA on that device for every process, which wastes memory on an unwanted and non-needed CUDA initialization. I fixed this issue by using with torch.cuda.device(device_id) that encapsulate the entire training_function.
I ended up not using multiprocessing pools and instead defined my own custom process class that holds the device and training function. This means I have to maintain in-queues for each device-process, but they all share the same out-queue, meaning I can retrieve the results the moment they are available. I figured writing a custom process class was simpler than writing a custom pool class. I desperately tried to keep using pools as they are easily maintained, but I had to use multiple imap-functions, and so the results were not obtainable one at a time, which lead to a less efficient training-loop.
I am now successfully training on multiple GPUs, but my questions posted above still remains unanswered.
Update (10.03.2020)
I have implemented another way to store model- and optimizer statedicts outside of GPU RAM. I have written function that replaces every tensor in the dicts with it's .to('cpu') equivalent. This costs me some CPU memory, but it is more reliable than storing local files.
Update (11.06.2020)
I have still not found a different approach that leads to fewer CUDA initializations while maintaining the same processing speed. From what I've read and come to understand, PyTorch does not infer too much with how CUDA is operating, and leaves that up to NVIDIA.
I have ended up using a pool of custom, device-specific processes, called Workers, that is maintained by my custom pool class (more about this above). In addition, I let each of these Workers take in one or more checkpoints as well as the function that processes them (training, testing, hp optimization) via a Queue. These checkpoints are then processed simultaneously via a python multiprocessing ThreadPool in each Worker and the results are then returned one by one via the return Queue the moment they are ready.
This gives me the parallel procedure I was needing, but the memory issue is still there. Due to time constraints, I have come to terms with it for now.

Keras with Tensorflow: Use memory as it's needed [ResourceExhaustedError]

So I'm trying to tain my CNN with mutilple datasets and it seams that when I add enough data (such as when I add multiple sets as one or when I try to add the one that has over a million samples) it throws a ResourceExhaustedError.
As for the instructions here, I tried adding
from keras.backend.tensorflow_backend import set_session
import tensorflow as tf
config = tf.ConfigProto()
config.gpu_options.per_process_gpu_memory_fraction = 0.3
set_session(tf.Session(config=config))
to my code but this doesn't seam to make a difference.
I see 0.3 after printing out config.gpu_options.per_process_gpu_memory_fraction so that part seams to be ok.
I even threw in a config.gpu_options.allow_growth = True for good mesure but it doesn't seam to want to do anything but attempt to use all the memory at once only to find that it isn't enough.
The computer I'm trying to use to train this CNN has 4 GTX1080 Ti's with 12gb of dedicated memory each.
EDIT: I'm sorry for not specifying how I was loading the data, I honestly didn't realise there was more than one way. When I was learning, they always had examples where the loaded the datasets that were already built in and it took me a while to realise how to load a self-supplied dataset.
The way I'm doing it is that I'm creating two numpy arrays . One has the path or each image and the other has the corresponding label. Here's the most basic example of this:
data_dir = "folder_name"
# There is a folder for every form and in that folder is every line of that form
for filename in glob.glob(os.path.join(data_dir, '*', '*')):
# the format for file names are: "{author id}-{form id}-{line number}.png"
# filename is a path to the file so .split('\\')[-1] get's the raw file name without the path and .split('-')[0] get's the author id
authors.append(filename.split('\\')[-1].split('-')[0])
files.append(filename)
#keras requires numpy arrays
img_files = np.asarray(files)
img_targets = np.asarray(authors)

Are you sure you're not using a giant batch_size?
"Adding data": honestly I don't know what that means and if you could please describe exactly, with code, what you're doing here, it would be of help.
The number of samples should not cause any problems with GPU memory at all. What does cause a problem is a big batch_size.
Loading a huge dataset could cause a CPU RAM problem, not related with keras/tensorflow. A problem with a numpy array that is too big. (You can test this by simply loading your data "without creating any models")
If that is your problem, you should use a generator to load batches gradually. Again, since there is absolutely no code in your question, we can't help much.
But these are two forms of simply creating a generator for images:
Use the existing ImageDataGenerator and it's flow_from_directory() methods, explained here
Create your own coded generator, which can be:
A loop with yield
A class derived from keras.utils.Sequence
A quick example of a loop generator:
def imageBatchGenerator(imageFiles, imageLabels, batch_size):
while True:
batches = len(imageFiles) // batch_size
if len(imageFiles) % batch_size > 0:
batches += 1
for b in range(batches):
start = b * batch_size
end = (b+1) * batch_size
images = loadTheseImagesIntoNumpy(imageFiles[start:end])
labels = imageLabels[start:end]
yield images,labels
Warning: even with generators, you must make sure your batch size is not too big!
Using it:
model.fit_generator(imageBatchGenerator(files,labels,batchSize), steps_per_epoch = theNumberOfBatches, epochs= ....)
Dividing your model among GPUs
You should be able to decide which layers are processed by which GPU, this "could" probably optimize your RAM usage.
Example, when creating the model:
with tf.device('/gpu:0'):
createLayersThatGoIntoGPU0
with tf.device('/gpu:1'):
createLayersThatGoIntoGPU1
#you will probably need to go back to a previous GPU, as you must define your layers in a proper sequence
with tf.device('/cpu:0'):
createMoreLayersForGPU0
#and so on
I'm not sure this would be better or not, but it's worth trying too.
See more details here: https://keras.io/getting-started/faq/#how-can-i-run-a-keras-model-on-multiple-gpus

The ResourceExhaustedError is raised because you're trying to allocate more memory than is available in your GPUs or main memory. The memory allocation is approximately equal to your network footprint (to estimate this, save a checkpoint and look at the file size) plus your batch size multiplied by the size of a single element of your dataset.
It's difficult to answer your question directly without some more information about your setup, but there is one element of this question that caught my attention: you said that you get the error when you "add enough data" or "use a big enough dataset." That's odd. Notice that the size of your dataset is not included in the calculation for memory allocation footprint. Thus, the size of the dataset shouldn't matter. Since it does, that seems to imply that you are somehow attempting to load your entire dataset into GPU memory or main memory. If you're doing this, that's the origin of your problem. To fix it, use the TensorFlow Dataset API. Using a Dataset sidesteps this limited memory resources by hiding the data behind an Iterator that only yields batches when called. Alternatively, you could use the older feed_dict and QueueRunner data feeding structure, but I don't recommend it. You can find some examples of this here.
If you are already using the Dataset API, you'll need to post more of your code as an edit to your question for us to help you.

There is no setting that magically allows you more memory than your GPU has. It looks to me that your inputs are just to big to fit in the GPU RAM (along with all the required state and gradients).
you should use the config.gpu_options.allow_growth = True but not in order to get more memory, just to get an idea of how much memory you need per input length. start with a small length, see with nvidia-smi how much RAM does your GPU take and then increase the length. do that again and again until you understand what is the maximal length of inputs (batch size) that your GPU can hold.

Training changing input size RNN on Tensorflow

I want to train an RNN with different input size of sentence X, without padding. The logic used for this is that I am using Global Variables and for every step, I take an example, write the forward propagation i.e. build the graph, run the optimizer and then repeat the step again with another example. The program is extremely slow as compared to the numpy implementation of the same thing where I have implemented forward and backward propagation and using the same logic as above. The numpy implementation takes a few seconds while Tensorflow is extremely slow. Can running the same thing on GPU will be useful or I am doing some logical mistake ?

As a general guideline, GPU boosts performance only if you have calculation intensive code and little data transfer. In other words, if you train your model one instance at a time (or on small batch sizes) the overhead for data transfer to/from GPU can even make your code run slower! But if you feed in a good chunk of samples, then GPU will definitely boost your code.

AWS g2.8xlarge performance and out of memory issues when using tensorflow

I am currently training a recurrent net on Tensorflow for a text classification problem and am running into performance and out of memory issues. I am on AWS g2.8xlarge with Ubuntu 14.04, and a recent nightly build of tensorflow (which I downloaded on Aug 25).
1) Performance issue:
On the surface, both the CPU and GPU are highly under-utilized. I've run multiple tests on this (and have used line_profiler and memory_profiler in the process). Train durations scale linearly with number of epochs, so I tested with 1 epoch. For RNN config = 1 layer, 20 nodes, training time = 146 seconds.
Incidentally, that number is about 20 seconds higher/slower than the same test run on a g2.2xlarge!
Here is a snapshot of System Monitor and nvidia-smi (updated every 2 seconds) about 20 seconds into the run:
SnapshotEarlyPartOfRun
As you can see, GPU utilization is at 19%. When I use nvprof, I find that the total GPU process time is about 27 seconds or so. Also, except for one vCPU, all others are very under-utilized. The numbers stay around this level, till the end of the epoch where I measure error across the entire training set, sending GPU utilization up to 45%.
Unless I am missing something, on the surface, each device is sitting around waiting for something to happen.
2) Out of memory issue:
If I increase the number of nodes to 200, it gives me an Out of Memory error which happens on the GPU side. As you can see from the above snapshots, only one of the four GPUs is used. I've found that the way to get tensorflow to use the GPU has to do with how you assign the model. If you don't specify anything, tensorflow will assign it to a GPU. If you specify a GPU, only it will be used. Tensorflow does not like it when I assign it to multiple devices with a "for d in ['/gpu:0',...]". I get into an issue with re-using the embedding variable. I would like to use all 4 GPUs for this (without setting up distributed tensorflow). Here is the snapshot of the Out of memory error:
OutofMemoryError
Any suggestions you may have for both these problems would be greatly appreciated!

Re (1), to improve GPU utilization did you try increasing the batch size and / or shortening the sequences you use for training?
Re (2), to use multiple GPUs you do need to manually assign the ops to GPU devices, I believe. The right way is to place ops on specific GPUs by doing
with g.Device("/gpu:0"):
...
with g.Device("/gpu:1"):
...

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.