I have some trouble with how tensorflow handle memory. I would like to remove tensors from my memory after each iteration on this toy example.
I am using tensorflow EagerExecution. I have tried with Variables and with simple tensors. tf.assign doesn't do the job. More and more memory is used. I guess it's normal in order to be able to compute the gradient. Even if I apply some dummy optimizer at the end of each iteration, the memory isn't not released (more precisely, it happens sometimes but the global trend is that the memory use is growing).
So is it possible to delete manually ?
import tensorflow as tf
import tensorflow.contrib.eager as tfe
import numpy as np
import time as ti
tf.enable_eager_execution()
for i in range(150):
all_subject=tfe.Variable(np.random.rand(200, 500), dtype=tf.float32)
tf.assign(all_subject, np.random.rand(200,500) )
ti.sleep(1.0)
del all_subject
ti.sleep(0.5)
What the allocation looks like :
Memory profile
According to the documentation on eager execution,
During eager execution the lifetime of state objects is determined by the lifetime of their corresponding Python object.
So you should not observe any memory leak in your code, even without an explicit del: the simple fact of reassigning your variable to something else should free the memory.
However, this is not what happen and I observe the same memory leak as you.
So this could be a (serious) bug, which you could submit here.
Related
I am training a model in Pytorch which barely fits in the GPU RAM restrictions in Colab (I don't have hardware to run locally). It fits in memory when training, but during inference it has higher memory requirements and the GPU RAM runs out. There are some large objects (another model) that my model needs to train, but not to sample--my idea is that I can create a context manager that takes them out of GPU, yields, and brings them back in, like so:
#contextmanager
def reduce_gpu_usage(large_cuda_objs):
for lco in large_cuda_objs:
lco = lco.cpu()
yield
for lco in large_cuda_objs:
lco = lco.cuda()
This doesn't work, though. Based on my initial debugging, I guess that the problem with this code is that even though the objects are passed "by reference", reassigning the reference to them which was passed by value does nothing outside local scope. I tried following this answer to reassign the variables in the caller, but this seems like bad practice and also didn't work:
mem = torch.cuda.memory_allocated
memr = torch.cuda.memory_reserved
print(mem())
print(memr())
with reduce_gpu_usage(large_cuda_obj_names=[*several object names as strings*]):
print(mem())
print(memr())
gave me
15040313344
15915286528
15040313344
15915286528
My question is this: is what I'm trying to do possible and advisable? What is the best way to do it? I just need to temporarily move some objects to cpu and back to gpu while running inference. I'd like to do this with a context manager for clean use and easy reverting to resume training. I'd also like to do this without copying and pasting several lines of code in any place I do inference, and without creating a class if possible.
As I understand, tf.reset_default_graph() only creates a new graph and sets it equal to the default graph. So, the previously created tensors would just be lying around occupying the memory. I have also read the unreferenced tensors are not garbage collected (like normal variables in Python are).
If I am running a cross-validation to search for a set of hyperparameters and thus creating the same graph, again and again, how do I get rid of the previously created tensors?
I had the same problem when designing experiments, after researching about this problem, the only solution that worked for me is this one. As you can read in that link, it seems to be a design flaw and the TF team doesn't seem to care about fixing.
The solution is to create a new process for each cross-validation iteration. So when the process finishes the system kills it and releases the resources automatically.
import multiprocessing
def evaluate(...):
import tensorflow as tf
# Your logic
for ... in cross_valiadtion_loop:
process_eval = multiprocessing.Process(target=evaluate, args=(...))
process_eval.start()
process_eval.join()
using Tensorflow r0.9/r.10 I get the following message, that makes me worried I've set my neural network model in the wrong way.
I tensorflow/core/common_runtime/gpu/pool_allocator.cc:244] PoolAllocator: After 6206792 get requests, put_count=6206802 evicted_count=5000 eviction_rate=0.000805568 and unsatisfied allocation rate=0.000806536
The network I use is similar to AlexNet/VGG-M, I create the variables and the ops in a function called once, and then I just loop over multiple epochs calling the same omptimizer, loss and prediction function for each mini-batch iteration.
Another thing that makes me worried is that the network can be unstable when using a large batch size: it runs fine for few epochs, and then it goes out of memory (trying to allocate...).
Is there any way to check if there is something wrong and what it is?
This is an info-level log statement (the "I" prefix). It does not necessarily mean that anything is wrong: however, the pool allocator (a cache for allocations) is finding that it frequently has to fall back on the underlying allocator. This may indicate memory pressure.
For your instability problem: as you observe, large batches can lead to out-of-memory errors. There is some nondeterminism to operator scheduling, which is why you may not see it fail every time. Try lowering your batch size until you consistently no longer see out of memory errors.
I have been using scipy.optimize.fmin_l_bfgs_b() to minimize functions for a while now, but recently I ran into a behavior I have not noticed before. While optimizing some new function, memory usage keeps increasing as more iterations are executed. For example, by the 1500th iteration memory usage has increased x100 and in some cases I have to stop the optimization before running out of memory. For reference, I have run previously run scipy.optimize.fmin_l_bfgs_b() to optimize other functions and never saw an increase in memory usage.
From how I understand this function works, it should be performing a similar type of calculation at each iteration, so I don't understand why the memory usage would increase.
Is this behavior expected, or is this probably some kind of memory leak (either in fmin_l_bfgs_b or in the function I supply)?
If memory is not deallocated in the minimization function, each call to the function will increase the memory allocation of your program. Most Python objects will properly deallocate.
I had this issue with an array that was returned from a C extension module where the module was responsible for allocating the memory but did not properly return it. Rewriting the module to correctly release memory solved the issue, the solution had been discussed here:
Memory leak in Python extension when array is created with PyArray_SimpleNewFromData() and returned
I'm using theano to do some computation involving a large (about 300,000 x 128) matrix.
The theano function quit after outputting a MemoryError, similar to this question.
I think it's probably because the large matrix gets processed step by step, and each step leaves a large GpuArray (although the same shape as the first one) in the memory.
So my question is:
In the case of temporary(non-output) variables with the same shape and dtype, does theano make any effort to reuse such allocated memory? like, a pool for each array shape exist?
If 1, can I inspect which node in the function graph reuses memory?
Although I know there's shared that can do explicit sharing, but I'm in doubt that using it would make the (already hard) computation code harder to understand.
UPDATE
A simplified example of such situation:
import theano
from theano import tensor as T
a0 = T.matrix() # the initial
# op1 op2 op3 are valid, complicated operations,
# whose output's shape are identical to a0's
a1 = op1(a0)
a2 = op2(a1)
a3 = op3(a2)
f = theano.function([a0], a3)
If none of op1, op2, op3 can be optimized as in-place,
will theano try to reuse memory, e.g. a1 and a3 might "share"
the same address, to reduce memory footprint, since a1 is no
longer used at the time of op3?
Thanks!
You need to use in-place operations as much as possible. For the most part this is not under user control (they are automatically used by the optimizer when circumstances allow) but there are a few things you can do to encourage their use.
Take a look at the documentation on this issue:
memory profiler
Information on the optimizations that enable in-place operation
How to create custom operations that can work in-place
Do not use the inplace parameter of the inc_subtensor operation, as indicated in the documentation. The docs also indicate that inplace operators are not supported for user specification (the optimizer will automatically apply them when possible).
You could use Theano's memory profiler to help track down which operation(s) are using up the memory.
UPDATE
I'm not an expert in this area but I believe Theano uses a garbage collection mechanism to free up memory that is no longer needed. This is discussed in the documentation.