Torch.cuda.empty_cache() replacement in case of CPU only enviroment - python

Currently, I am using PyTorch built with CPU only support. When I run inference, somehow information for that input file is stored in cache and memory keeps on increasing for every new unique file used for inference. On the other hand, memory usage does not increase if i use the same file again and again.
Is there a way to clear cache like cuda.empty_cache() in case of CPUs only.

Related

GPU RAM Reducing Context Manager

I am training a model in Pytorch which barely fits in the GPU RAM restrictions in Colab (I don't have hardware to run locally). It fits in memory when training, but during inference it has higher memory requirements and the GPU RAM runs out. There are some large objects (another model) that my model needs to train, but not to sample--my idea is that I can create a context manager that takes them out of GPU, yields, and brings them back in, like so:
#contextmanager
def reduce_gpu_usage(large_cuda_objs):
for lco in large_cuda_objs:
lco = lco.cpu()
yield
for lco in large_cuda_objs:
lco = lco.cuda()
This doesn't work, though. Based on my initial debugging, I guess that the problem with this code is that even though the objects are passed "by reference", reassigning the reference to them which was passed by value does nothing outside local scope. I tried following this answer to reassign the variables in the caller, but this seems like bad practice and also didn't work:
mem = torch.cuda.memory_allocated
memr = torch.cuda.memory_reserved
print(mem())
print(memr())
with reduce_gpu_usage(large_cuda_obj_names=[*several object names as strings*]):
print(mem())
print(memr())
gave me
15040313344
15915286528
15040313344
15915286528
My question is this: is what I'm trying to do possible and advisable? What is the best way to do it? I just need to temporarily move some objects to cpu and back to gpu while running inference. I'd like to do this with a context manager for clean use and easy reverting to resume training. I'd also like to do this without copying and pasting several lines of code in any place I do inference, and without creating a class if possible.

Access tensorflow/core/framework/cpu_allocator_impl.cc from within python

I have found plenty of questions regarding the tensorflow warning
tensorflow/core/framework/cpu_allocator_impl.cc:81] Allocation of
xxxxxxxxx exceeds 10% of system memory
I know that it is just a warning, not an error and that its display can be suppressed. I also know how to (most likely) address the issue: reduce the batch size.
However, I have never found the issue addressed the other way around:
Given my network and the current state of the system on which it runs, what is the largest batch size it can fit without problems?
So is there a way to access what tensorflow is doing internally from within the python (3.7) interface?
I'm running tf 2.2.0 on my CPU. I know that there are ways to limit the GPU memory usage (let it grow / not use 100% of available space/ etc), but I have not found an equivalent for the CPU. Creating a logical device with memory limit is not supported with CPUs (https://www.tensorflow.org/api_docs/python/tf/config/set_logical_device_configuration)

Over-high memory usage during reading parquet in Python

I have a parquet file at around 10+GB, with columns are mainly strings. When loading it into the memory, the memory usage can peak to 110G, while after it's finished the memory usage is reduced back to around 40G.
I'm working on a high-performance computer with allocated memory so I do have access to large memory. However, it seems a waste to me that I have to apply for a 128G memory just for loading data, after that 64G is sufficient for me. Also, 128G memory is more often to be out of order.
My naive conjecture is that the Python interpreter mistreated the 512G physical memory on the HPC as the total available memory, so it does not do garbage collection as often as actually needed. For example, when I load the data with 64G memory, it never threw me a MemoryError but the kernel is directly killed and restarted.
I was wondering whether the over-high usage of memory when loading is a regular behavior of pyarrow, or it is due to the special setting of my environment. If the latter, then is it possible to somehow limit the available memory during loading?
We fixed a memory use bug that's present in 0.14.0/0.14.1 (which is probably what you're using right now).
https://issues.apache.org/jira/browse/ARROW-6060
We also are introducing an option to read string columns as categorical (aka DictionaryArray in Arrow parlance) which also will reduce memory usage. See https://issues.apache.org/jira/browse/ARROW-3325 and discussion in
https://ursalabs.org/blog/2019-06-07-monthly-report/

How to know if python process is garbage collecting too often?

I am tuning the performance of a python job (that uses Pytorch and Nvidia Cuda). My python process runs on a shared cluster where the max RAM needed must be explicitly specified. Since the lower is better, we tend to set lower RAM that doesn't cause Out of Memory.
In particular, I noticed that the RAM could make quite a difference on performance. For instance, if I set RAM=6GB, my job takes 27Hours. If I set max RAM=10GB (keeping all other variables same), the same job takes about ~16Hrs. Most of the work is done on GPU using GPU RAM, so the CPU and CPU RAM is only for housekeeping and moving tensors.
My suspect is garbage collector runs too often when I set less RAM. I had observed this kind of behavior when I was dealing with JavaVM, and I had tools to inspect how much time my JVM process spends in the garbage collector.
However, with python, I am clueless.
Are there any ways to inspect the memory management, in particular, Garbage Collector time (fraction of runtime)?

Run out of VRAM using Theano on Amazon cluster

I'm trying to execute the logistic_sgd.py code on an Amazon cluster running the ami-b141a2f5 (Theano - CUDA 7) image.
Instead of the included MNIST database I am using the SD19 database, which requires changing a few dimensional constants, but otherwise no code has been touched. The code runs fine locally, on my CPU, but once I SSH the code and data to the Amazon cluster and run it there, I get this output:
It looks to me like it is running out of VRAM, but it was my understanding that the code should run on a GPU already, without any tinkering on my part necessary. After following the suggestion from the error message, the error persists.
There's nothing especially strange here. The error message is almost certainly accurate: there really isn't enough VRAM. Often, a script will run fine on CPU but then fail like this on GPU simply because there is usually much more system memory available than GPU memory, especially since the system memory is virtualized (and can page out to disk if required) while the GPU memory isn't.
For this script, there needs to be enough memory to store the training, validation, and testing data sets, the model parameters, and enough working space to store intermediate results of the computation. There are two options available:
Reduce the amount of memory needed for one or more of these three components. Reducing the amount of training data is usually easiest; reducing the size of the model next. Unfortunately both of those two options will often impair the quality of the result that is being looked for. Reducing the amount of memory needed for intermediate results is usually beyond the developers control -- it is managed by Theano, but there is sometimes scope for altering the computation to achieve this goal once a good understanding of Theano's internals is achieved.
If the model parameters and working memory can fit in GPU memory then the most common solution is to change the code so that the data is no longer stored in GPU memory (i.e. just store it as numpy arrays, not as Theano shared variables) then pass each batch of data in as inputs instead of givens. The LSTM sample code is an example of this approach.

Categories