Python: minimizing memory usage with functions - python

I am writing a code where at some point I need to solve several generalized eigenvalue problems for large sparse matrices. Because these operations are essentially similar (only the name of the considered matrices are changing), I made a function:
def eig_prob(myvariables):
# this is just a simplified example
name = 'iteration_'+myvariables["i"]
A = myvariables["A"]
B = myvariables["B"]
N = myvariables["nb_eig"]
Z,V = eigsh(A,N,B,sigma = 1)
# save in Matlab format
scipy.io.savemat(files["exec"]+name+".mat",{"Z":Z,"V":V})
As I do not return any argument to my main function, I would expect the quantity of RAM memory to be the same before and after the call to eig_prob.
In fact, I observe that the consumption of RAM memory increased by about 800 Mb during the call to eig_prob, which is expected, and this memory is not freed after the call, which seems surprising to me.
Is there any explanation for such behavior? Can it be avoided? Do I need to run my function as a sub process to avoid this over consumption of memory?
edit: a colleague of mine indicated that gs.collect() [1] may help, it does! When called after the function, gs.collect() frees the 800 Mb.
[1] https://docs.python.org/2/library/gc.html

If a Python object is allocated, it happens to be put onto the heap of the program.
If it is a quite large object, memory will be allocated via mmap() for as long as it is needed and freed again afterwards. I am not sure if that happens immediately...
For smaller objects, the brk() boundary of the process will be shifted. In this case, memory is allocated. If some other objects are added afterwards and the former objects are freed, their memory is free on the heap, but cannot be returned to the OS. Only after the end-most object on the heap is freed, part of the free area can be returnd to the OS.
You talk about 800 MB, which is clearly so large that the mmap() method should be used, but if the data consists of thousands of smaller objects, chances are that they land on the brk() heap.

Related

How is memory handled once touched for the first time in numpy.zeros?

I recently saw that when creating a numpy array via np.empty or np.zeros, the memory of that numpy array is not actually allocated by the operating system as discussed in this answer (and this question), because numpy utilizes calloc to allocate the array's memory.
In fact, the OS isn't even "really" allocating that memory until you try to access it.
Therefore,
l = np.zeros(2**28)
does not increase the utilized memory the system reports, e.g., in htop.
Only once I touch the memory, for instance by executing
np.add(l, 0, out=l)
the utilized memory is increased.
Because of that behaviour I have got a couple of questions:
1. Is touched memory copied under the hood?
If I touch chunks of the memory only after a while, is the content of the numpy array copied under the hood by the operating system to guarantee that the memory is contiguous?
i = 100
f[:i] = 3
while True:
... # Do stuff
f[i] = ... # Once the memory "behind" the already allocated chunk of memory is filled
# with other stuff, does the operating system reallocate the memory and
# copy the already filled part of the array to the new location?
i = i + 1
2. Touching the last element
As the memory of the numpy array is continguous in memory, I tought
f[-1] = 3
might require the enitre block of memory to be allocated (without touching the entire memory).
However, it does not, the utilized memory in htop does not increase by the size of the array.
Why is that not the case?
OS isn't even "really" allocating that memory until you try to access it
This is dependent of the target platform (typically the OS and its configuration). Some platform directly allocates page in physical memory (eg. AFAIK the XBox does as well as some embedded platforms). However, mainstream platforms actually do that indeed.
1. Is touched memory copied under the hood?
If I touch chunks of the memory only after a while, is the content of the numpy array copied under the hood by the operating system to guarantee that the memory is contiguous?
Allocations are perform in virtual memory. When a first touch is done on a given memory page (chunk of fixed sized, eg. 4 KiB), the OS maps the virtual page to a physical one. So only one page will be physically map when you set only one item of the array (unless the item cross two pages which only happens in pathological cases).
The physical pages may not be contiguous for a contiguous set of virtual pages. However, this is not a problem and you should not care about it. This is mainly the job of the OS. That being said, modern processors have a dedicated unit called TLB to translate virtual address (the one you could see with a debugger) to physical ones (since this translation is relatively expensive and performance critical).
The content of the Numpy array is not reallocated nor copied thanks to paging (at least from the user point-of-view, ie. in virtual memory).
2. Touching the last element
I thought f[-1] = 3 might require the entire block of memory to be allocated (without touching the entire memory). However, it does not, the utilized memory in htop does not increase by the size of the array. Why is that not the case?
Only the last page in virtual memory associated to the Numpy array is mapped thanks to paging. This is why you do not see a big change in htop. However, you should see a slight change (the size of a page on your platform) if you look carefully. Otherwise, this should mean the page has been already mapped due to other previous recycled allocations. Indeed, the allocation library can preallocate memory area to speed up allocations (by reducing the number of slow requests to the OS). The library could also keep the memory mapped when it is freed by Numpy in order to speed the next allocations up (since the memory do not have to be unmapped to be then mapped again). This is unlikely to occur for huge arrays in practice because the impact on memory consumption would be too expensive.

CuPy random - how to generate new random set in same memory?

I am generating a large array of random numbers, totaling more than half the available memory on a GPU. I am doing this in a loop.
When I call cupy.random the second time (or third time...), assigning to the same variable name, it does not free the memory for the first array. It tries to allocate more memory, which causes an out of memory error.
Explicitly freeing the memory before generating a new random array is very slow, and seems inefficient.
Is there a way to generate a new set of numbers, but in the same memory space?
Edit: cupy.random.shuffle() is letting me work around the problem, but I wonder if there is a better way?
Edit 2: on further review, shuffle() does not address the problem, and appears to need even more memory than allocating a second block (before freeing the first) of memory... I am back to restricting ndarray size to less than half the remaining memory, so two ndarrays can be allocated alternately
As user2357112 suggests, cupy.random.random() does not appear to support “re-randomizing“ an existing ndarray, even though cuRand does. Writing C to modify an existing cupy array somewhat defeats the point of using python / cupy in the first place.
Curiously, having an array about 1/3rd the size of available memory, while increasing the number of loops, is faster in total execution time (versus larger arrays/fewer loops). I was not able to determine when cupy (or python or cuda?) does garbage collection on the disused array, but it seems to happen asynchronously.
If GPU garbage collection uses cuda cores (I presume it does?), it does not appear to materially effect my code execution time. Nvidia-smi reports “P2” GPU usage when my code calculations are running, suggesting there are still cores available for cupy / cuda to free memory outside of my code?
I don’t like answering my own question... just sharing what I found in case it helps someone else

At what point am I using too much memory on a Mac?

I've tried really hard to figure out why my python is using 8 gigs of memory. I've even use gc.get_object() and measured the size of each object and only one of them was larger than 10 megs. Still, all of the objects, and there were about 100,000 of them, added up to 5.5 gigs. On the other hand, my computer is working fine, and the program is running at a reasonable speed. So is the fact that I'm using so much memory cause for concern?
As #bnaecker said this doesn't have a simple (i.e., yes/no) answer. It's only a problem if the combined RSS (resident set size) of all running processes exceeds the available memory thus causing excessive demand paging.
You didn't say how you calculated the size of each object. Hopefully it was by using sys.getsizeof() which should accurately include the overhead associated with each object. If you used some other method (such as calling the __sizeof() method directly) then your answer will be far lower than the correct value. However, even sys.getsizeof() won't account for wasted space due to memory alignment. For example, consider this experiment (using python 3.6 on macOS):
In [25]: x='x'*8193
In [26]: sys.getsizeof(x)
Out[26]: 8242
In [28]: 8242/4
Out[28]: 2060.5
Notice that last value. It implies that the object is using 2060 and 1/2 words of memory. Which is wrong since all allocations consume a multiple of a word. In fact, it looks to me like sys.getsizeof() does not correctly account for word alignment and padding of either the underlying object or the data structure that describes the object. Which means the value is smaller than the amount of memory actually used by the object. Multiplied by 100,000 objects that could represent a substantial amount of memory.
Also, many memory allocators will round up large allocations to a page size (typically a multiple of 4 KiB). Which results in "wasted" space that is probably not going to be included in the sys.getsizeof() return value.

What is python's strategy to manage allocation/freeing of large variables?

As a follow-up to this question, it appears that there are different allocation/deallocation strategies for little and big variables in (C)Python.
More precisely, there seems to be a boundary in the object size above which the memory used by the allocated object can be given back to the OS. Below this size, the memory is not given back to the OS.
To quote the answer taken from the Numpy policy for releasing memory:
The exception is that for large single allocations (e.g. if you create a multi-megabyte array), a different mechanism is used. Such large memory allocations can be released back to the OS. So it might specifically be the non-numpy parts of your program that are producing the issues you see.
Indeed, these two allocations strategies are easy to show. For example:
1st strategy: no memory is given back to the OS
import numpy as np
import psutil
import gc
# Allocate array
x = np.random.uniform(0,1, size=(10**4))
# gc
del x
gc.collect()
# We go from 41295.872 KB to 41295.872 KB
# using psutil.Process().memory_info().rss / 10**3; same behavior for VMS
=> No memory given back to the OS
2nd strategy: freed memory is given back to the OS
When doing the same experiment, but with a bigger array:
x = np.random.uniform(0,1, size=(10**5))
del x
gc.collect()
# We go from 41582.592 KB to 41017.344 KB
=> Memory is released to the OS
It seems that objects approximately bigger than 8*10**4 bytes get allocated using the 2nd strategy.
So:
Is this behavior documented? (And what is the exact boundary at which the allocation strategy changes?)
What are the internals of these strategies (more than assuming the use of an mmap/munmap to release the memory back to the OS)
Is this 100% done by the Python runtime or does Numpy have a specific way of handling this? (The numpy doc mentions the NPY_USE_PYMEM that switches between the memory allocator)
What you observe isn't CPython's strategy, but the strategy of the memory allocator which comes with the C-runtime your CPython-version is using.
When CPython allocates/deallocates memory via malloc/free, it doesn't not communicate directly with the underlying OS, but with a concrete implementation of memory allocator. In my case on Linux, it is the GNU Allocator.
The GNU Allocator has different so called arenas, where the memory isn't returned to OS, but kept so it can be reused without the need to comunicate with OS. However, if a large amout of memory is requested (whatever the definition of "large"), the allocator doesn't use the memory from arenas but requests the memory from OS and as consequence can give it directly back to OS, once free is called.
CPython has its own memory allocator - pymalloc, which is built atop of the C-runtime-allocator. It is optimized for small objects, which live in a special arena; there is less overhead when creating/freeing these objects as compared to the underlying C-runtime-allocator. However, objects bigger than 512 bytes don't use this arena, but are managed directly by the C-runtime-allocator.
The situation is even more complex with numpy's array, because different memory-allocators are used for the meta-data (like shape, datatype and other flags) and for the the actual data itself:
For meta-data PyArray_malloc, the CPython's memory allocator (i.e. pymalloc) is used.
For data itself, PyDataMem_NEW is used, which utilzes the underlying C-runtimme-functionality directly:
NPY_NO_EXPORT void *
PyDataMem_NEW(size_t size)
{
void *result;
result = malloc(size);
...
return result;
}
I'm not sure, what was the exact idea behind this design: obviously one would like to prifit from small object optimization of pymalloc, and for data this optimization would never work, but then one could use PyMem_RawMalloc instead of malloc. Maybe the goal was to be able to wrap numpy arrays around memory allocated by C-routines and take over the ownership of memory (but this will not work in some circumstances, see my comment at the end of this post).
This explains the behavior you are observing: For data (whose size is changing depending on the passed size-argument in) PyDataMem_NEW is used, which bypasses CPython's memory allocator and you see the original behavior of C-runtime's allocators.
One should try to avoid to mix different allocations/deallocations routines PyArray_malloc/PyDataMem_NEW'/mallocandPyArray_free/PyDataMem_FREE/free`: even if it works at OS+Python version at hand, it might fail for another combinations.
For example on Windows, when an extension is built with a different compiler version, one executable might have different memory allocators from different C-run-times and malloc/free might communicate with different C-memory-allocators, which could lead to hard to track down errors.

CPython memory allocation

This is a post inspired from this comment about how memory is allocated for objects in CPython. Originally, this was in the context of creating a list and appending to it in a for loop vis a vis a list comprehension.
So here are my questions:
how many different allocaters are there in CPython?
what is the function of each?
when is malloc acutally called? (a list comprehension may not result in a call to malloc, based on what's said in this comment
How much memory does python allocate for itself at startup?
are there rules governing which data structures get first "dibs" on this memory?
What happens to the memory used by an object when it is deleted (does python still hold on to the memory to allocate to another object in the future, or does the GC free up the memory for another process, say Google Chrome, to use)?
When is a GC triggered?
lists are dynamic arrays, which means they need a contiguous piece of memory. This means that if I try to append an object into a list, whose underlying-C-data-structure array cannot be extended, the array is copied over onto a different part of memory, where a larger contiguous block is available. So how much space is allocated to this array when I initialize a list?
how much extra space is allocated to the new array, which now holds the old list and the appended object?
EDIT: From the comments, I gather that there are far too many questions here. I only did this because these questions are all pretty related. Still, I'd be happy to split this up into several posts if that is the case (please let me know to do so in the comments)
Much of this is answered in the Memory Management chapter of the C API documentation.
Some of the documentation is vaguer than you're asking for. For further details, you'd have to turn to the source code. And nobody's going to be willing to do that unless you pick a specific version. (At least 2.7.5, pre-2.7.6, 3.3.2, pre-3.3.3, and pre-3.4 would be interesting to different people.)
The source to the obmalloc.c file is a good starting place for many of your questions, and the comments at the top have a nice little ASCII-art graph:
Object-specific allocators
_____ ______ ______ ________
[ int ] [ dict ] [ list ] ... [ string ] Python core |
+3 | <----- Object-specific memory -----> | <-- Non-object memory --> |
_______________________________ | |
[ Python`s object allocator ] | |
+2 | ####### Object memory ####### | <------ Internal buffers ------> |
______________________________________________________________ |
[ Python`s raw memory allocator (PyMem_ API) ] |
+1 | <----- Python memory (under PyMem manager`s control) ------> | |
__________________________________________________________________
[ Underlying general-purpose allocator (ex: C library malloc) ]
0 | <------ Virtual memory allocated for the python process -------> |
=========================================================================
_______________________________________________________________________
[ OS-specific Virtual Memory Manager (VMM) ]
-1 | <--- Kernel dynamic storage allocation & management (page-based) ---> |
__________________________________ __________________________________
[ ] [ ]
-2 | <-- Physical memory: ROM/RAM --> | | <-- Secondary storage (swap) --> |
how many different allocaters are there in CPython?
According to the docs, "several". You could count up the ones in the builtin and stdlib types, then add the handful of generic ones, if you really wanted. But I'm not sure what it would tell you. (And it would be pretty version-specific. IIRC, the exact number even changed within the 3.3 tree, as there was an experiment with whether the new-style strings should use three different allocators or one.)
what is the function of each?
The object-specific allocators at level +3 are for specific uses cases that are worth optimizing. As the docs say:
For example, integer objects are managed differently within the heap than strings, tuples or dictionaries because integers imply different storage requirements and speed/space tradeoffs.
Below that, there are various generic supporting allocators at level +2 (and +1.5 and maybe +2.5)—at least an object allocator, an arena allocator, and a small-block allocator, etc.—but all but the first are private implementation details (meaning private even to the C-API; obviously all of it is private to Python code).
And below that, there's the raw allocator, whose function is to ask the OS for more memory when the higher-level allocators need it.
when is malloc acutally called?
The raw memory allocator (or its heap manager) should be the only thing that ever calls malloc. (In fact, it might not even necessarily call malloc; it might use functions like mmap or VirtualAlloc instead. But the point is that it's the only thing that ever asks the OS for memory.) There are a few exceptions within the core of Python, but they'll rarely be relevant.
The docs explicitly say that higher-level code should never try to operate on Python objects in memory obtained from malloc.
However, there are plenty of stdlib and extension modules that use malloc for purposes besides Python objects.
For example, a numpy array of 1000x1000 int32 values doesn't allocate 1 million Python ints, so it doesn't have to go through the int allocator. Instead, it just mallocs an array of 1 million C ints, and wraps them up in Python objects as needed when you access them.
How much memory does python allocate for itself at startup?
This is platform-specific, and a bit hard to figure out from the code. However, when I launch a new python3.3 interpreter on my 64-bit Mac, it starts of with 13.1MB of virtual memory, and almost immediately expands to 201MB. So, that should be a rough ballpark guide.
are there rules governing which data structures get first "dibs" on this memory?
Not really, no. A malicious or buggy object-specific allocator could immediately grab all of the pre-allocated memory and more, and there's nothing to stop it.
What happens to the memory used by an object when it is deleted (does python still hold on to the memory to allocate to another object in the future, or does the GC free up the memory for another process, say Google Chrome, to use)?
It goes back to the object-specific allocator, which may keep it on a freelist, or release it to the raw allocator, which keeps its own freelist. The raw allocator almost never releases memory back to the OS.
This is because there's usually no good reason to release memory back to a modern OS. If you have a ton of unused pages lying around, the OS's VM will just page them out if another process needs it. And when there is a good reason, it's almost always application-specific, and the simplest solution is to use multiple processes to manage your huge short-lived memory requirements.
When is a GC triggered?
It depends on what you mean by "a GC".
CPython uses refcounting; every time you release a reference to an object (by rebinding a variable or a slot in a collection, letting a variable go out of scope, etc.), if it was the last reference, it will be cleaned up immediately. This is explained in the Reference Counting section in the docs.
However, there's a problem with refcounting: if two objects reference each other, even when all outside references go away, they still won't get cleaned up. So, CPython has always had a cycle collector that periodically walks objects looking for cycles of objects that reference each other, but have no outside references. (It's a little more complicated, but that's the basic idea.) This is fully explained in the docs for the gc module. The collector can run when you ask it to explicitly, when the freelists are getting low, or when it hasn't run in a long time; this is dynamic and to some extent configurable, so it's hard to give a specific answer to "when".
lists are dynamic arrays, which means they need a contiguous piece of memory. This means that if I try to append an object into a list, whose underlying-C-data-structure array cannot be extended, the array is copied over onto a different part of memory, where a larger contiguous block is available. So how much space is allocated to this array when I initialize a list?
The code for this is mostly inside listobject.c. It's complicated; there are a bunch of special cases, like the code used by timsort for creating temporary intermediate lists and for non-in-place sorting. But ultimately, some piece of code decides it needs room for N pointers.
It's also not particularly interesting. Most lists are either never expanded, or expanded far beyond the original size, so doing extra allocation at the start wastes memory for static lists and doesn't help much for most growing lists. So, Python plays it conservative. I believe it starts by looking through its internal freelist that's not too much bigger than N pointers (it might also consolidate adjacent freed list storage; I don't know if it does), so it might overallocate a little bit occasionally, but generally it doesn't. The exact code should be in PyList_New.
At any rate, if there's no space in the list allocator's freelist, it drops down to the object allocator, and so on through the levels; it may end up hitting level 0, but usually it doesn't.
how much extra space is allocated to the new array, which now holds the old list and the appended object?
This is handled in list_resize, and this is the interesting part.
The only way to avoid list.append being quadratic is to over allocate geometrically. Overallocating by too small of a factor (like 1.2) wastes way too much time for the first few expansions; using too large of a factor (like 1.6) wastes way too much space for very large arrays. Python handles this by using a sequence that starts off at 2.0 but quickly converges toward somewhere around 1.25. According to the 3.3 source:
The growth pattern is: 0, 4, 8, 16, 25, 35, 46, 58, 72, 88, ...
You didn't ask specifically about sorted, but I know that's what prompted you.
Remember that timsort is primarily a merge sort, with an insertion sort for small sublists that aren't already sorted. So, most of its operations involve allocating a new list of about size 2N and freeing two lists of about size N. So, it can be almost as space- and allocation-efficient when copying as it would be in-place. There is up to O(log N) waste, but this usually isn't the factor that makes a copying sort slower.

Categories