CPython internal structures - python

GAE has various limitations, one of which is size of biggest allocatable block of memory amounting to 1Mb (now 10 times more, but that doesn't change the question). The limitation means that one cannot put more then some number of items in list() as CPython would try to allocate contiguous memory block for element pointers. Having huge list()s can be considered bad programming practice, but even if no huge structure is created in program itself, CPython maintains some behind the scenes.
It appears that CPython is maintaining single global list of objects or something. I.e. application that has many small objects tend to allocate bigger and bigger single blocks of memory.
First idea was gc, and disabling it changes application behavior a bit but still some structures are maintained.
A simplest short application that experience the issue is:
a = b = []
number_of_lists = 8000000
for i in xrange(number_of_lists):
b.append([])
b = b[0]
Can anyone enlighten me how to prevent CPython from allocating huge internal structures when having many objects in application?

On a 32-bit system, each of the 8000000 lists you create will allocate 20 bytes for the list object itself, plus 16 bytes for a vector of list elements. So you are trying to allocate at least (20+16) * 8000000 = 20168000000 bytes, about 20 GB. And that's in the best case, if the system malloc only allocates exactly as much memory as requested.
I calculated the size of the list object as follows:
2 Pointers in the PyListObject structure itself (see listobject.h)
1 Pointer and one Py_ssize_t for the PyObject_HEAD part of the list object (see object.h)
one Py_ssize_t for the PyObject_VAR_HEAD (also in object.h)
The vector of list elements is slightly overallocated to avoid having to resize it at each append - see list_resize in listobject.c. The sizes are 0, 4, 8, 16, 25, 35, 46, 58, 72, 88, ... Thus, your one-element lists will allocate room for 4 elements.
Your data structure is a somewhat pathological example, paying the price of a variable-sized list object without utilizing it - all your lists have only a single element. You could avoid the 12 bytes overallocation by using tuples instead of lists, but to further reduce the memory consumption, you will have to use a different data structure that uses fewer objects. It's hard to be more specific, as I don't know what you are trying to accomplish.

I'm a bit confused as to what you're asking. In that code example, nothing should be garbage collected, as you're never actually killing off any references. You're holding a reference to the top level list in a and you're adding nested lists (held in b at each iteration) inside of that. If you remove the 'a =', then you've got unreferenced objects.
Edit: In response to the first part, yes, Python holds a list of objects so it can know what to cull. Is that the whole question? If not, comment/edit your question and I'll do my best to help fill in the gaps.

What are you trying to accomplish with the
a = b = []
and
b = b[0]
statements? It's certainly odd to see statements like that in Python, because they don't do what you might naively expect: in that example, a and b are two names for the same list (think pointers in C). If you're doing a lot of manipulation like that, it's easy to confuse the garbage collector (and yourself!) because you've got a lot of strange references floating around that haven't been properly cleared.
It's hard to diagnose what's wrong with that code without knowing why you want to do what it appears to be doing. Sure, it exposes a bit of interpreter weirdness... but I'm guessing you're approaching your problem in an odd way, and a more Pythonic approach might yield better results.

So that you're aware of it, Python has its own allocator. You can disable it using --without-pyalloc during the configure step.
However, the largest arena is 256KB so that shouldn't be the problem. You can also compile Python with debugging enabled, using --with-pydebug. This would give you more information about memory use.
I suspect your hunch and am sure that oefe's diagnosis are correct. A list uses contiguous memory, so if your list gets too large for a system arena then you're out of luck. If you're really adventurous you can reimplement PyList to use multiple blocks, but that's going to be a lot of work since various bits of Python expect contiguous data.

Related

How to know if you have a non-contiguous list in Python?

For arbitrarily large N, it isn't possible to store all the data in a list contiguously in memory.
For instance, in Python, if I do arr = [0] * N, for large enough N, this can't be contiguous.
What does Python do for this? I assume it gets stored non-contiguously. How does that work?
CPython lists are always contiguous, at least in virtual memory. (There's little they could do to reasonably control physical contiguity and little reason to try.) CPython makes no attempt to break lists into noncontiguous segments in the face of memory fragmentation or anything like that.
If you want to see for yourself, take a look at Include/listobject.h and Objects/listobject.c. There's nothing in there about noncontiguous lists.

Python list memory reallocation issue

If I am using C-Python or jython (in Python 2.7), and for list ([]) data structure, if I continue to add new elements, will there be memory reallocation issue like how Java ArrayList have (since Java ArrayList requires continuous memory space, if current pre-allocated space is full, it needs re-allocate new larger continuous large memory space, and move existing elements to the new allocated space)?
http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b14/java/util/ArrayList.java#ArrayList.ensureCapacity%28int%29
regards,
Lin
The basic story, at least for the main Python, is that a list contains pointers to objects elsewhere in memory. The list is created with a certain free space (eg. for 8 pointers). When that fills up, it allocates more memory, and so on. Whether it moves the pointers from one block of memory to another, is a detail that most users ignore. In practice we just append/extend a list as needed and don't worry about memory use.
Why does creating a list from a list make it larger?
I assume jython uses the same approach, but you'd have to dig into its code to see how that translates to Java.
I mostly answer numpy questions. This is a numerical package that creates fixed sized multidimensional arrays. If a user needs to build such an array incrementally, we often recommend that they start with a list and append values. At the end they create the array. Appending to a list is much cheaper than rebuilding an array multiple times.
Internally python lists are Array of pointers as mentioned by hpaulj
The next question then is how can you extend the an Array in C as explained in the answer. Which explains this can be done using realloc function in C.
This lead me to look in to the behavior of realloc which mentions
The function may move the memory block to a new location (whose address is returned by the function).
From this my understanding is the array object is extended if contiguous memory is available, else memory block (containing the Array object not List object) is copied to newly allocated memory block with greater size.
This is my understanding, corrections are welcome if I am wrong.

Why is ''.join() faster than += in Python?

I'm able to find a bevy of information online (on Stack Overflow and otherwise) about how it's a very inefficient and bad practice to use + or += for concatenation in Python.
I can't seem to find WHY += is so inefficient. Outside of a mention here that "it's been optimized for 20% improvement in certain cases" (still not clear what those cases are), I can't find any additional information.
What is happening on a more technical level that makes ''.join() superior to other Python concatenation methods?
Let's say you have this code to build up a string from three strings:
x = 'foo'
x += 'bar' # 'foobar'
x += 'baz' # 'foobarbaz'
In this case, Python first needs to allocate and create 'foobar' before it can allocate and create 'foobarbaz'.
So for each += that gets called, the entire contents of the string and whatever is getting added to it need to be copied into an entirely new memory buffer. In other words, if you have N strings to be joined, you need to allocate approximately N temporary strings and the first substring gets copied ~N times. The last substring only gets copied once, but on average, each substring gets copied ~N/2 times.
With .join, Python can play a number of tricks since the intermediate strings do not need to be created. CPython figures out how much memory it needs up front and then allocates a correctly-sized buffer. Finally, it then copies each piece into the new buffer which means that each piece is only copied once.
There are other viable approaches which could lead to better performance for += in some cases. E.g. if the internal string representation is actually a rope or if the runtime is actually smart enough to somehow figure out that the temporary strings are of no use to the program and optimize them away.
However, CPython certainly does not do these optimizations reliably (though it may for a few corner cases) and since it is the most common implementation in use, many best-practices are based on what works well for CPython. Having a standardized set of norms also makes it easier for other implementations to focus their optimization efforts as well.
I think this behaviour is best explained in Lua's string buffer chapter.
To rewrite that explanation in context of Python, let's start with an innocent code snippet (a derivative of the one at Lua's docs):
s = ""
for l in some_list:
s += l
Assume that each l is 20 bytes and the s has already been parsed to a size of 50 KB. When Python concatenates s + l it creates a new string with 50,020 bytes and copies 50 KB from s into this new string. That is, for each new line, the program moves 50 KB of memory, and growing. After reading 100 new lines (only 2 KB), the snippet has already moved more than 5 MB of memory. To make things worse, after the assignment
s += l
the old string is now garbage. After two loop cycles, there are two old strings making a total of more than 100 KB of garbage. So, the language compiler decides to run its garbage collector and frees those 100 KB. The problem is that this will happen every two cycles and the program will run its garbage collector two thousand times before reading the whole list. Even with all this work, its memory usage will be a large multiple of the list's size.
And, at the end:
This problem is not peculiar to Lua: Other languages with true garbage
collection, and where strings are immutable objects, present a similar
behavior, Java being the most famous example. (Java offers the
structure StringBuffer to ameliorate the problem.)
Python strings are also immutable objects.

CPython memory allocation

This is a post inspired from this comment about how memory is allocated for objects in CPython. Originally, this was in the context of creating a list and appending to it in a for loop vis a vis a list comprehension.
So here are my questions:
how many different allocaters are there in CPython?
what is the function of each?
when is malloc acutally called? (a list comprehension may not result in a call to malloc, based on what's said in this comment
How much memory does python allocate for itself at startup?
are there rules governing which data structures get first "dibs" on this memory?
What happens to the memory used by an object when it is deleted (does python still hold on to the memory to allocate to another object in the future, or does the GC free up the memory for another process, say Google Chrome, to use)?
When is a GC triggered?
lists are dynamic arrays, which means they need a contiguous piece of memory. This means that if I try to append an object into a list, whose underlying-C-data-structure array cannot be extended, the array is copied over onto a different part of memory, where a larger contiguous block is available. So how much space is allocated to this array when I initialize a list?
how much extra space is allocated to the new array, which now holds the old list and the appended object?
EDIT: From the comments, I gather that there are far too many questions here. I only did this because these questions are all pretty related. Still, I'd be happy to split this up into several posts if that is the case (please let me know to do so in the comments)
Much of this is answered in the Memory Management chapter of the C API documentation.
Some of the documentation is vaguer than you're asking for. For further details, you'd have to turn to the source code. And nobody's going to be willing to do that unless you pick a specific version. (At least 2.7.5, pre-2.7.6, 3.3.2, pre-3.3.3, and pre-3.4 would be interesting to different people.)
The source to the obmalloc.c file is a good starting place for many of your questions, and the comments at the top have a nice little ASCII-art graph:
Object-specific allocators
_____ ______ ______ ________
[ int ] [ dict ] [ list ] ... [ string ] Python core |
+3 | <----- Object-specific memory -----> | <-- Non-object memory --> |
_______________________________ | |
[ Python`s object allocator ] | |
+2 | ####### Object memory ####### | <------ Internal buffers ------> |
______________________________________________________________ |
[ Python`s raw memory allocator (PyMem_ API) ] |
+1 | <----- Python memory (under PyMem manager`s control) ------> | |
__________________________________________________________________
[ Underlying general-purpose allocator (ex: C library malloc) ]
0 | <------ Virtual memory allocated for the python process -------> |
=========================================================================
_______________________________________________________________________
[ OS-specific Virtual Memory Manager (VMM) ]
-1 | <--- Kernel dynamic storage allocation & management (page-based) ---> |
__________________________________ __________________________________
[ ] [ ]
-2 | <-- Physical memory: ROM/RAM --> | | <-- Secondary storage (swap) --> |
how many different allocaters are there in CPython?
According to the docs, "several". You could count up the ones in the builtin and stdlib types, then add the handful of generic ones, if you really wanted. But I'm not sure what it would tell you. (And it would be pretty version-specific. IIRC, the exact number even changed within the 3.3 tree, as there was an experiment with whether the new-style strings should use three different allocators or one.)
what is the function of each?
The object-specific allocators at level +3 are for specific uses cases that are worth optimizing. As the docs say:
For example, integer objects are managed differently within the heap than strings, tuples or dictionaries because integers imply different storage requirements and speed/space tradeoffs.
Below that, there are various generic supporting allocators at level +2 (and +1.5 and maybe +2.5)—at least an object allocator, an arena allocator, and a small-block allocator, etc.—but all but the first are private implementation details (meaning private even to the C-API; obviously all of it is private to Python code).
And below that, there's the raw allocator, whose function is to ask the OS for more memory when the higher-level allocators need it.
when is malloc acutally called?
The raw memory allocator (or its heap manager) should be the only thing that ever calls malloc. (In fact, it might not even necessarily call malloc; it might use functions like mmap or VirtualAlloc instead. But the point is that it's the only thing that ever asks the OS for memory.) There are a few exceptions within the core of Python, but they'll rarely be relevant.
The docs explicitly say that higher-level code should never try to operate on Python objects in memory obtained from malloc.
However, there are plenty of stdlib and extension modules that use malloc for purposes besides Python objects.
For example, a numpy array of 1000x1000 int32 values doesn't allocate 1 million Python ints, so it doesn't have to go through the int allocator. Instead, it just mallocs an array of 1 million C ints, and wraps them up in Python objects as needed when you access them.
How much memory does python allocate for itself at startup?
This is platform-specific, and a bit hard to figure out from the code. However, when I launch a new python3.3 interpreter on my 64-bit Mac, it starts of with 13.1MB of virtual memory, and almost immediately expands to 201MB. So, that should be a rough ballpark guide.
are there rules governing which data structures get first "dibs" on this memory?
Not really, no. A malicious or buggy object-specific allocator could immediately grab all of the pre-allocated memory and more, and there's nothing to stop it.
What happens to the memory used by an object when it is deleted (does python still hold on to the memory to allocate to another object in the future, or does the GC free up the memory for another process, say Google Chrome, to use)?
It goes back to the object-specific allocator, which may keep it on a freelist, or release it to the raw allocator, which keeps its own freelist. The raw allocator almost never releases memory back to the OS.
This is because there's usually no good reason to release memory back to a modern OS. If you have a ton of unused pages lying around, the OS's VM will just page them out if another process needs it. And when there is a good reason, it's almost always application-specific, and the simplest solution is to use multiple processes to manage your huge short-lived memory requirements.
When is a GC triggered?
It depends on what you mean by "a GC".
CPython uses refcounting; every time you release a reference to an object (by rebinding a variable or a slot in a collection, letting a variable go out of scope, etc.), if it was the last reference, it will be cleaned up immediately. This is explained in the Reference Counting section in the docs.
However, there's a problem with refcounting: if two objects reference each other, even when all outside references go away, they still won't get cleaned up. So, CPython has always had a cycle collector that periodically walks objects looking for cycles of objects that reference each other, but have no outside references. (It's a little more complicated, but that's the basic idea.) This is fully explained in the docs for the gc module. The collector can run when you ask it to explicitly, when the freelists are getting low, or when it hasn't run in a long time; this is dynamic and to some extent configurable, so it's hard to give a specific answer to "when".
lists are dynamic arrays, which means they need a contiguous piece of memory. This means that if I try to append an object into a list, whose underlying-C-data-structure array cannot be extended, the array is copied over onto a different part of memory, where a larger contiguous block is available. So how much space is allocated to this array when I initialize a list?
The code for this is mostly inside listobject.c. It's complicated; there are a bunch of special cases, like the code used by timsort for creating temporary intermediate lists and for non-in-place sorting. But ultimately, some piece of code decides it needs room for N pointers.
It's also not particularly interesting. Most lists are either never expanded, or expanded far beyond the original size, so doing extra allocation at the start wastes memory for static lists and doesn't help much for most growing lists. So, Python plays it conservative. I believe it starts by looking through its internal freelist that's not too much bigger than N pointers (it might also consolidate adjacent freed list storage; I don't know if it does), so it might overallocate a little bit occasionally, but generally it doesn't. The exact code should be in PyList_New.
At any rate, if there's no space in the list allocator's freelist, it drops down to the object allocator, and so on through the levels; it may end up hitting level 0, but usually it doesn't.
how much extra space is allocated to the new array, which now holds the old list and the appended object?
This is handled in list_resize, and this is the interesting part.
The only way to avoid list.append being quadratic is to over allocate geometrically. Overallocating by too small of a factor (like 1.2) wastes way too much time for the first few expansions; using too large of a factor (like 1.6) wastes way too much space for very large arrays. Python handles this by using a sequence that starts off at 2.0 but quickly converges toward somewhere around 1.25. According to the 3.3 source:
The growth pattern is: 0, 4, 8, 16, 25, 35, 46, 58, 72, 88, ...
You didn't ask specifically about sorted, but I know that's what prompted you.
Remember that timsort is primarily a merge sort, with an insertion sort for small sublists that aren't already sorted. So, most of its operations involve allocating a new list of about size 2N and freeing two lists of about size N. So, it can be almost as space- and allocation-efficient when copying as it would be in-place. There is up to O(log N) waste, but this usually isn't the factor that makes a copying sort slower.

Unexpectedly high memory usage in Google App Engine

I have a Python GAE app that stores data in each instance, and the memory usage is much higher than I’d expected. As an illustration, consider this test code which I’ve added to my app:
from google.appengine.ext import webapp
bucket = []
class Memory(webapp.RequestHandler):
def get(self):
global bucket
n = int(self.request.get('n'))
size = 0
for i in range(n):
text = '%10d' % i
bucket.append(text)
size += len(text)
self.response.out.write('Total number of characters = %d' % size)
A call to this handler with a value for query variable n will cause the instance to add n strings to its list, each 10 characters long.
If I call this with n=1 (to get everything loaded) and then check the instance memory usage on the production server, I see a figure of 29.4MB. If I then call it with n=100000 and check again, memory usage has jumped to 38.9MB. That is, my memory footprint has increased by 9.5MB to store only one million characters, nearly ten times what I’d expect. I believe that characters consume only one byte each, but even if that’s wrong there’s still a long way to go. Overhead of the list structure surely can’t explain it. I tried adding an explicit garbage collection call, but the figures didn’t change. What am I missing, and is there a way to reduce the footprint?
(Incidentally, I tried using a set instead of a list and found that after calling with n=100000 the memory usage increased by 13MB. That suggests that the set overhead for 100000 strings is 3.5MB more than that of lists, which is also much greater than expected.)
I know that I'm really late to the party here, but this isn't surprising at all...
Consider a string of length 1:
s = '1'
That's pretty small, right? Maybe somewhere on the order of 1 byte? Nope.
>>> import sys
>>> sys.getsizeof('1')
38
So there are approximately 37 bytes of overhead associated with each string that you create (all of those string methods need to be stored somewhere).
Additionally it's usually most efficient for your CPU to store items based on "word size" rather than byte size. On lots of systems, a "word" is 4 bytes...). I don't know for certain, but I wouldn't be surprised if python's memory allocator plays tricks there too to keep it running fairly quickly.
Also, don't forget that lists are represented as over-allocated arrays (to prevent huge performance problems each time you .append). It is possible that, when you make a list of 100k elements, python actually allocates pointers for 110k or more.
Finally, regarding set -- That's probably fairly easily explained by the fact that set are even more over-allocated than list (they need to avoid all those hash collisions after all). They end up having large jumps in memory usage as the set size grows in order to have enough free slots in the array to avoid hash collisions:
>>> sys.getsizeof(set([1]))
232
>>> sys.getsizeof(set([1, 2]))
232
>>> sys.getsizeof(set([1, 2, 3]))
232
>>> sys.getsizeof(set([1, 2, 3, 4]))
232
>>> sys.getsizeof(set([1, 2, 3, 4, 5]))
232
>>> sys.getsizeof(set([1, 2, 3, 4, 5, 6])) # resize!
744
The overhead of the list structure doesn't explain what you're seeing directly, but memory fragmentation does. And strings have a non-zero overhead in terms of underlying memory, so counting string lengths is going to undercount significantly.
I'm not an expert, but this is an interesting question. It seems like it's more of a python memory management issue than a GAE issue. Have you tried running it locally and comparing the memory usage on your local dev_appserver vs deployed on GAE? That should indicate whether it's the GAE platform, or just python.
Secondly, the python code you used is simple, but not very efficient, a list comprehension instead of the for loop should be more efficient. This should reduce the memory usage a bit:
''.join([`%10d` % i for i in range(n)])
Under the covers your growing string must be constantly reallocated. Every time through the for loop, there's a discarded string left lying around. I would have expected that triggering the garbage collector after your for loop should have cleaned up the extra strings though.
Try triggering the garbage collector before you check the memory usage.
import gc
gc.collect()
return len(gc.get_objects())
That should give you an idea if the garbage collector hasn't cleaned out some of the extra strings.
This is largely a response to dragonx.
The sample code exists only to illustrate the problem, so I wasn't concerned with small efficiencies. I am instead concerned about why the application consumes around ten times as much memory as there is actual data. I can understand there being some memory overhead, but this much?
Nonetheless, I tried using a list comprehension (without the join, to match my original) and the memory usage increases slightly, from 9.5MB to 9.6MB. Perhaps that's within the margin of error. Or perhaps the large range() expression sucks it up; it's released, no doubt, but better to use xrange(), I think. With the join the instance variable is set to one very long string, and the memory footprint unsurprisingly drops to a sensible 1.1MB, but this isn't the same case at all. You get the same 1.1MB just setting the instance variable to one million characters without using a list comprehension.
I'm not sure I agree that with my loop "there's a discarded string left lying around." I believe that the string is added to the list (by reference, if that's proper to say) and that no strings are discarded.
I had already tried explicit garbage collection, as my original question states. No help there.
Here's a telling result. Changing the length of the strings from 10 to some other number causes a proportional change in memory usage, but there's a constant in there as well. My experiments show that for every string added to the list there's an 85 byte overhead, no matter what the string length. Is this the cost for strings or for putting the strings into a list? I lean toward the latter. Creating a list of 100,000 None’s consumes 4.5MB, or around 45 bytes per None. This isn't as bad as for strings, but it's still pretty bad. And as I mentioned before, it's worse for sets than it is for lists.
I wish I understood why the overhead (or fragmentation) was this bad, but the inescapable conclusion seems to be that large collections of small objects are extremely expensive. You're probably right that this is more of a Python issue than a GAE issue.

Categories