Over-high memory usage during reading parquet in Python

Over-high memory usage during reading parquet in Python - python

I have a parquet file at around 10+GB, with columns are mainly strings. When loading it into the memory, the memory usage can peak to 110G, while after it's finished the memory usage is reduced back to around 40G.
I'm working on a high-performance computer with allocated memory so I do have access to large memory. However, it seems a waste to me that I have to apply for a 128G memory just for loading data, after that 64G is sufficient for me. Also, 128G memory is more often to be out of order.
My naive conjecture is that the Python interpreter mistreated the 512G physical memory on the HPC as the total available memory, so it does not do garbage collection as often as actually needed. For example, when I load the data with 64G memory, it never threw me a MemoryError but the kernel is directly killed and restarted.
I was wondering whether the over-high usage of memory when loading is a regular behavior of pyarrow, or it is due to the special setting of my environment. If the latter, then is it possible to somehow limit the available memory during loading?

We fixed a memory use bug that's present in 0.14.0/0.14.1 (which is probably what you're using right now).
https://issues.apache.org/jira/browse/ARROW-6060
We also are introducing an option to read string columns as categorical (aka DictionaryArray in Arrow parlance) which also will reduce memory usage. See https://issues.apache.org/jira/browse/ARROW-3325 and discussion in
https://ursalabs.org/blog/2019-06-07-monthly-report/

Related

How is memory handled once touched for the first time in numpy.zeros?

I recently saw that when creating a numpy array via np.empty or np.zeros, the memory of that numpy array is not actually allocated by the operating system as discussed in this answer (and this question), because numpy utilizes calloc to allocate the array's memory.
In fact, the OS isn't even "really" allocating that memory until you try to access it.
Therefore,
l = np.zeros(2**28)
does not increase the utilized memory the system reports, e.g., in htop.
Only once I touch the memory, for instance by executing
np.add(l, 0, out=l)
the utilized memory is increased.
Because of that behaviour I have got a couple of questions:
1. Is touched memory copied under the hood?
If I touch chunks of the memory only after a while, is the content of the numpy array copied under the hood by the operating system to guarantee that the memory is contiguous?
i = 100
f[:i] = 3
while True:
... # Do stuff
f[i] = ... # Once the memory "behind" the already allocated chunk of memory is filled
# with other stuff, does the operating system reallocate the memory and
# copy the already filled part of the array to the new location?
i = i + 1
2. Touching the last element
As the memory of the numpy array is continguous in memory, I tought
f[-1] = 3
might require the enitre block of memory to be allocated (without touching the entire memory).
However, it does not, the utilized memory in htop does not increase by the size of the array.
Why is that not the case?

OS isn't even "really" allocating that memory until you try to access it
This is dependent of the target platform (typically the OS and its configuration). Some platform directly allocates page in physical memory (eg. AFAIK the XBox does as well as some embedded platforms). However, mainstream platforms actually do that indeed.
1. Is touched memory copied under the hood?
If I touch chunks of the memory only after a while, is the content of the numpy array copied under the hood by the operating system to guarantee that the memory is contiguous?
Allocations are perform in virtual memory. When a first touch is done on a given memory page (chunk of fixed sized, eg. 4 KiB), the OS maps the virtual page to a physical one. So only one page will be physically map when you set only one item of the array (unless the item cross two pages which only happens in pathological cases).
The physical pages may not be contiguous for a contiguous set of virtual pages. However, this is not a problem and you should not care about it. This is mainly the job of the OS. That being said, modern processors have a dedicated unit called TLB to translate virtual address (the one you could see with a debugger) to physical ones (since this translation is relatively expensive and performance critical).
The content of the Numpy array is not reallocated nor copied thanks to paging (at least from the user point-of-view, ie. in virtual memory).
2. Touching the last element
I thought f[-1] = 3 might require the entire block of memory to be allocated (without touching the entire memory). However, it does not, the utilized memory in htop does not increase by the size of the array. Why is that not the case?
Only the last page in virtual memory associated to the Numpy array is mapped thanks to paging. This is why you do not see a big change in htop. However, you should see a slight change (the size of a page on your platform) if you look carefully. Otherwise, this should mean the page has been already mapped due to other previous recycled allocations. Indeed, the allocation library can preallocate memory area to speed up allocations (by reducing the number of slow requests to the OS). The library could also keep the memory mapped when it is freed by Numpy in order to speed the next allocations up (since the memory do not have to be unmapped to be then mapped again). This is unlikely to occur for huge arrays in practice because the impact on memory consumption would be too expensive.

How to know if python process is garbage collecting too often?

I am tuning the performance of a python job (that uses Pytorch and Nvidia Cuda). My python process runs on a shared cluster where the max RAM needed must be explicitly specified. Since the lower is better, we tend to set lower RAM that doesn't cause Out of Memory.
In particular, I noticed that the RAM could make quite a difference on performance. For instance, if I set RAM=6GB, my job takes 27Hours. If I set max RAM=10GB (keeping all other variables same), the same job takes about ~16Hrs. Most of the work is done on GPU using GPU RAM, so the CPU and CPU RAM is only for housekeeping and moving tensors.
My suspect is garbage collector runs too often when I set less RAM. I had observed this kind of behavior when I was dealing with JavaVM, and I had tools to inspect how much time my JVM process spends in the garbage collector.
However, with python, I am clueless.
Are there any ways to inspect the memory management, in particular, Garbage Collector time (fraction of runtime)?

What is python's strategy to manage allocation/freeing of large variables?

As a follow-up to this question, it appears that there are different allocation/deallocation strategies for little and big variables in (C)Python.
More precisely, there seems to be a boundary in the object size above which the memory used by the allocated object can be given back to the OS. Below this size, the memory is not given back to the OS.
To quote the answer taken from the Numpy policy for releasing memory:
The exception is that for large single allocations (e.g. if you create a multi-megabyte array), a different mechanism is used. Such large memory allocations can be released back to the OS. So it might specifically be the non-numpy parts of your program that are producing the issues you see.
Indeed, these two allocations strategies are easy to show. For example:
1st strategy: no memory is given back to the OS
import numpy as np
import psutil
import gc
# Allocate array
x = np.random.uniform(0,1, size=(10**4))
# gc
del x
gc.collect()
# We go from 41295.872 KB to 41295.872 KB
# using psutil.Process().memory_info().rss / 10**3; same behavior for VMS
=> No memory given back to the OS
2nd strategy: freed memory is given back to the OS
When doing the same experiment, but with a bigger array:
x = np.random.uniform(0,1, size=(10**5))
del x
gc.collect()
# We go from 41582.592 KB to 41017.344 KB
=> Memory is released to the OS
It seems that objects approximately bigger than 8*10**4 bytes get allocated using the 2nd strategy.
So:
Is this behavior documented? (And what is the exact boundary at which the allocation strategy changes?)
What are the internals of these strategies (more than assuming the use of an mmap/munmap to release the memory back to the OS)
Is this 100% done by the Python runtime or does Numpy have a specific way of handling this? (The numpy doc mentions the NPY_USE_PYMEM that switches between the memory allocator)

What you observe isn't CPython's strategy, but the strategy of the memory allocator which comes with the C-runtime your CPython-version is using.
When CPython allocates/deallocates memory via malloc/free, it doesn't not communicate directly with the underlying OS, but with a concrete implementation of memory allocator. In my case on Linux, it is the GNU Allocator.
The GNU Allocator has different so called arenas, where the memory isn't returned to OS, but kept so it can be reused without the need to comunicate with OS. However, if a large amout of memory is requested (whatever the definition of "large"), the allocator doesn't use the memory from arenas but requests the memory from OS and as consequence can give it directly back to OS, once free is called.
CPython has its own memory allocator - pymalloc, which is built atop of the C-runtime-allocator. It is optimized for small objects, which live in a special arena; there is less overhead when creating/freeing these objects as compared to the underlying C-runtime-allocator. However, objects bigger than 512 bytes don't use this arena, but are managed directly by the C-runtime-allocator.
The situation is even more complex with numpy's array, because different memory-allocators are used for the meta-data (like shape, datatype and other flags) and for the the actual data itself:
For meta-data PyArray_malloc, the CPython's memory allocator (i.e. pymalloc) is used.
For data itself, PyDataMem_NEW is used, which utilzes the underlying C-runtimme-functionality directly:
NPY_NO_EXPORT void *
PyDataMem_NEW(size_t size)
{
void *result;
result = malloc(size);
...
return result;
}
I'm not sure, what was the exact idea behind this design: obviously one would like to prifit from small object optimization of pymalloc, and for data this optimization would never work, but then one could use PyMem_RawMalloc instead of malloc. Maybe the goal was to be able to wrap numpy arrays around memory allocated by C-routines and take over the ownership of memory (but this will not work in some circumstances, see my comment at the end of this post).
This explains the behavior you are observing: For data (whose size is changing depending on the passed size-argument in) PyDataMem_NEW is used, which bypasses CPython's memory allocator and you see the original behavior of C-runtime's allocators.
One should try to avoid to mix different allocations/deallocations routines PyArray_malloc/PyDataMem_NEW'/mallocandPyArray_free/PyDataMem_FREE/free`: even if it works at OS+Python version at hand, it might fail for another combinations.
For example on Windows, when an extension is built with a different compiler version, one executable might have different memory allocators from different C-run-times and malloc/free might communicate with different C-memory-allocators, which could lead to hard to track down errors.

How large data can Python Ray handle?

Python Ray looks interesting for machine learning applications. However, I wonder how large Python Ray can handle. Is it limited by memory or can it actually handle data that exceeds memory?

It currently works best when the data fits in memory (if you're on a cluster, then that means the aggregate memory of the cluster). If the data exceeds the available memory, then Ray will evict the least recently used objects. If those objects are needed later on, they will be reconstructed by rerunning the tasks that created them.

.persist() line sometimes leads to Java Out of Heap Space error

As far as I know, when you use .persist(), writing the line persist sets only the persistence level, and then the next action in the script will cause the actual persistence work to be invoked.
However, sometimes, seemingly depending on the dataframe, persist() will lead to a Java out of heap space error.
What is the intended behavior of persist, and why could this simple line actually lead to this memory error?

The whole point of RDD Persistence is to store intermediate results in memory, allowing faster access on subsequent use. There are several different levels of persistence, ranging from MEMORY_ONLY (the default), over MEMORY_AND_DISK, up to DISK_ONLY. Persisting purely to memory means that there has to be enough heap space for the persist to work. If you run out of heap memory you can
go for a lower persistence level,
reduce the size of your partitions,
reduce the total number of persisted RDD stages, e.g., by unpersisting them.
Finding the right balance is one of the key challenges in Spark to achieve a good tradeoff between memory and CPU usage.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.