Unexpectedly high memory usage in Google App Engine

Unexpectedly high memory usage in Google App Engine - python

I have a Python GAE app that stores data in each instance, and the memory usage is much higher than I’d expected. As an illustration, consider this test code which I’ve added to my app:
from google.appengine.ext import webapp
bucket = []
class Memory(webapp.RequestHandler):
def get(self):
global bucket
n = int(self.request.get('n'))
size = 0
for i in range(n):
text = '%10d' % i
bucket.append(text)
size += len(text)
self.response.out.write('Total number of characters = %d' % size)
A call to this handler with a value for query variable n will cause the instance to add n strings to its list, each 10 characters long.
If I call this with n=1 (to get everything loaded) and then check the instance memory usage on the production server, I see a figure of 29.4MB. If I then call it with n=100000 and check again, memory usage has jumped to 38.9MB. That is, my memory footprint has increased by 9.5MB to store only one million characters, nearly ten times what I’d expect. I believe that characters consume only one byte each, but even if that’s wrong there’s still a long way to go. Overhead of the list structure surely can’t explain it. I tried adding an explicit garbage collection call, but the figures didn’t change. What am I missing, and is there a way to reduce the footprint?
(Incidentally, I tried using a set instead of a list and found that after calling with n=100000 the memory usage increased by 13MB. That suggests that the set overhead for 100000 strings is 3.5MB more than that of lists, which is also much greater than expected.)

I know that I'm really late to the party here, but this isn't surprising at all...
Consider a string of length 1:
s = '1'
That's pretty small, right? Maybe somewhere on the order of 1 byte? Nope.
>>> import sys
>>> sys.getsizeof('1')
38
So there are approximately 37 bytes of overhead associated with each string that you create (all of those string methods need to be stored somewhere).
Additionally it's usually most efficient for your CPU to store items based on "word size" rather than byte size. On lots of systems, a "word" is 4 bytes...). I don't know for certain, but I wouldn't be surprised if python's memory allocator plays tricks there too to keep it running fairly quickly.
Also, don't forget that lists are represented as over-allocated arrays (to prevent huge performance problems each time you .append). It is possible that, when you make a list of 100k elements, python actually allocates pointers for 110k or more.
Finally, regarding set -- That's probably fairly easily explained by the fact that set are even more over-allocated than list (they need to avoid all those hash collisions after all). They end up having large jumps in memory usage as the set size grows in order to have enough free slots in the array to avoid hash collisions:
>>> sys.getsizeof(set([1]))
232
>>> sys.getsizeof(set([1, 2]))
232
>>> sys.getsizeof(set([1, 2, 3]))
232
>>> sys.getsizeof(set([1, 2, 3, 4]))
232
>>> sys.getsizeof(set([1, 2, 3, 4, 5]))
232
>>> sys.getsizeof(set([1, 2, 3, 4, 5, 6])) # resize!
744

The overhead of the list structure doesn't explain what you're seeing directly, but memory fragmentation does. And strings have a non-zero overhead in terms of underlying memory, so counting string lengths is going to undercount significantly.

I'm not an expert, but this is an interesting question. It seems like it's more of a python memory management issue than a GAE issue. Have you tried running it locally and comparing the memory usage on your local dev_appserver vs deployed on GAE? That should indicate whether it's the GAE platform, or just python.
Secondly, the python code you used is simple, but not very efficient, a list comprehension instead of the for loop should be more efficient. This should reduce the memory usage a bit:
''.join([`%10d` % i for i in range(n)])
Under the covers your growing string must be constantly reallocated. Every time through the for loop, there's a discarded string left lying around. I would have expected that triggering the garbage collector after your for loop should have cleaned up the extra strings though.
Try triggering the garbage collector before you check the memory usage.
import gc
gc.collect()
return len(gc.get_objects())
That should give you an idea if the garbage collector hasn't cleaned out some of the extra strings.

This is largely a response to dragonx.
The sample code exists only to illustrate the problem, so I wasn't concerned with small efficiencies. I am instead concerned about why the application consumes around ten times as much memory as there is actual data. I can understand there being some memory overhead, but this much?
Nonetheless, I tried using a list comprehension (without the join, to match my original) and the memory usage increases slightly, from 9.5MB to 9.6MB. Perhaps that's within the margin of error. Or perhaps the large range() expression sucks it up; it's released, no doubt, but better to use xrange(), I think. With the join the instance variable is set to one very long string, and the memory footprint unsurprisingly drops to a sensible 1.1MB, but this isn't the same case at all. You get the same 1.1MB just setting the instance variable to one million characters without using a list comprehension.
I'm not sure I agree that with my loop "there's a discarded string left lying around." I believe that the string is added to the list (by reference, if that's proper to say) and that no strings are discarded.
I had already tried explicit garbage collection, as my original question states. No help there.
Here's a telling result. Changing the length of the strings from 10 to some other number causes a proportional change in memory usage, but there's a constant in there as well. My experiments show that for every string added to the list there's an 85 byte overhead, no matter what the string length. Is this the cost for strings or for putting the strings into a list? I lean toward the latter. Creating a list of 100,000 None’s consumes 4.5MB, or around 45 bytes per None. This isn't as bad as for strings, but it's still pretty bad. And as I mentioned before, it's worse for sets than it is for lists.
I wish I understood why the overhead (or fragmentation) was this bad, but the inescapable conclusion seems to be that large collections of small objects are extremely expensive. You're probably right that this is more of a Python issue than a GAE issue.

Related

At what point am I using too much memory on a Mac?

I've tried really hard to figure out why my python is using 8 gigs of memory. I've even use gc.get_object() and measured the size of each object and only one of them was larger than 10 megs. Still, all of the objects, and there were about 100,000 of them, added up to 5.5 gigs. On the other hand, my computer is working fine, and the program is running at a reasonable speed. So is the fact that I'm using so much memory cause for concern?

As #bnaecker said this doesn't have a simple (i.e., yes/no) answer. It's only a problem if the combined RSS (resident set size) of all running processes exceeds the available memory thus causing excessive demand paging.
You didn't say how you calculated the size of each object. Hopefully it was by using sys.getsizeof() which should accurately include the overhead associated with each object. If you used some other method (such as calling the __sizeof() method directly) then your answer will be far lower than the correct value. However, even sys.getsizeof() won't account for wasted space due to memory alignment. For example, consider this experiment (using python 3.6 on macOS):
In [25]: x='x'*8193
In [26]: sys.getsizeof(x)
Out[26]: 8242
In [28]: 8242/4
Out[28]: 2060.5
Notice that last value. It implies that the object is using 2060 and 1/2 words of memory. Which is wrong since all allocations consume a multiple of a word. In fact, it looks to me like sys.getsizeof() does not correctly account for word alignment and padding of either the underlying object or the data structure that describes the object. Which means the value is smaller than the amount of memory actually used by the object. Multiplied by 100,000 objects that could represent a substantial amount of memory.
Also, many memory allocators will round up large allocations to a page size (typically a multiple of 4 KiB). Which results in "wasted" space that is probably not going to be included in the sys.getsizeof() return value.

Python: slow nested for loop

I need to find out an optimal selection of media, based on certain constraints. I am doing it in FOUR nested for loop and since it would take about O(n^4) iterations, it is slow. I had been trying to make it faster but it is still damn slow. My variables can be as high as couple of thousands.
Here is a small example of what I am trying to do:
max_disks = 5
max_ssds = 5
max_tapes = 1
max_BR = 1
allocations = []
for i in range(max_disks):
for j in range(max_ssds):
for k in range(max_tapes):
for l in range(max_BR):
allocations.append((i,j,k,l)) # this is just for example. In actual program, I do processing here, like checking for bandwidth and cost constraints, and choosing the allocation based on that.
It wasn't slow for up to hundreds of each media type but would slow down for thousands.
Other way I tried is :
max_disks = 5
max_ssds = 5
max_tapes = 1
max_BR = 1
allocations = [(i,j,k,l) for i in range(max_disks) for j in range(max_ssds) for k in range(max_tapes) for l in range(max_BR)]
This way it is slow even for such small numbers.
Two questions:
Why the second one is slow for small numbers?
How can I make my program work for big numbers (in thousands)?
Here is the version with itertools.product
max_disks = 500
max_ssds = 100
max_tapes = 100
max_BR = 100
# allocations = []
for i, j, k,l in itertools.product(range(max_disks),range(max_ssds),range(max_tapes),range(max_BR)):
pass
It takes 19.8 seconds to finish with these numbers.

From the comments, I got that you're working on a problem that can be rewritten as an ILP. You have several constraints, and need to find a (near) optimal solution.
Now, ILPs are quite difficult to solve, and brute-forcing them quickly becomes intractable (as you've already witnessed). This is why there are several really clever algorithms used in the industry that truly work magic.
For Python, there are quite a few interfaces that hook-up to modern solvers; for more details, see e.g. this SO post. You could also consider using an optimizer, like SciPy optimize, but those generally don't do integer programming.

Doing any operation in Python a trillion times is going to be slow. However, that's not all you're doing. By attempting to store all the trillion items in a single list you are storing lots of data in memory and manipulating it in a way that creates a lot of work for the computer to swap memory in and out once it no longer fits in RAM.
The way that Python lists work is that they allocate some amount of memory to store the items in the list. When you fill up the list and it needs to allocate more, Python will allocate twice as much memory and copy all the old entries into the new storage space. This is fine so long as it fits in memory - even though it has to copy all the contents of the list each time it expands the storage, it has to do so less frequently as it keeps doubling the size. The problem comes when it runs out of memory and has to swap unused memory out to disk. The next time it tries to resize the list, it has to reload from disk all the entries that are now swapped out to disk, then swap them all back out again to get space to write the new entries. So this creates lots of slow disk operations that will get in the way of your task and slow it down even more.
Do you really need to store every item in a list? What are you going to do with them when you're done? You could perhaps write them out to disk as you're going instead of accumulating them in a giant list, though if you have a trillion of them, that's still a very large amount of data! Or perhaps you're filtering most of them out? That will help.
All that said, without seeing the actual program itself, it's hard to know if you have a hope of completing this work by an exhaustive search. Can all the variables be on the thousands scale at once? Do you really need to consider every combination of these variables? When max_disks==2000, do you really need to distinguish the results for i=1731 from i=1732? For example, perhaps you could consider values of i 1,2,3,4,5,10,20,30,40,50,100,200,300,500,1000,2000? Or perhaps there's a mathematical solution instead? Are you just counting items?

Why is ''.join() faster than += in Python?

I'm able to find a bevy of information online (on Stack Overflow and otherwise) about how it's a very inefficient and bad practice to use + or += for concatenation in Python.
I can't seem to find WHY += is so inefficient. Outside of a mention here that "it's been optimized for 20% improvement in certain cases" (still not clear what those cases are), I can't find any additional information.
What is happening on a more technical level that makes ''.join() superior to other Python concatenation methods?

Let's say you have this code to build up a string from three strings:
x = 'foo'
x += 'bar' # 'foobar'
x += 'baz' # 'foobarbaz'
In this case, Python first needs to allocate and create 'foobar' before it can allocate and create 'foobarbaz'.
So for each += that gets called, the entire contents of the string and whatever is getting added to it need to be copied into an entirely new memory buffer. In other words, if you have N strings to be joined, you need to allocate approximately N temporary strings and the first substring gets copied ~N times. The last substring only gets copied once, but on average, each substring gets copied ~N/2 times.
With .join, Python can play a number of tricks since the intermediate strings do not need to be created. CPython figures out how much memory it needs up front and then allocates a correctly-sized buffer. Finally, it then copies each piece into the new buffer which means that each piece is only copied once.
There are other viable approaches which could lead to better performance for += in some cases. E.g. if the internal string representation is actually a rope or if the runtime is actually smart enough to somehow figure out that the temporary strings are of no use to the program and optimize them away.
However, CPython certainly does not do these optimizations reliably (though it may for a few corner cases) and since it is the most common implementation in use, many best-practices are based on what works well for CPython. Having a standardized set of norms also makes it easier for other implementations to focus their optimization efforts as well.

I think this behaviour is best explained in Lua's string buffer chapter.
To rewrite that explanation in context of Python, let's start with an innocent code snippet (a derivative of the one at Lua's docs):
s = ""
for l in some_list:
s += l
Assume that each l is 20 bytes and the s has already been parsed to a size of 50 KB. When Python concatenates s + l it creates a new string with 50,020 bytes and copies 50 KB from s into this new string. That is, for each new line, the program moves 50 KB of memory, and growing. After reading 100 new lines (only 2 KB), the snippet has already moved more than 5 MB of memory. To make things worse, after the assignment
s += l
the old string is now garbage. After two loop cycles, there are two old strings making a total of more than 100 KB of garbage. So, the language compiler decides to run its garbage collector and frees those 100 KB. The problem is that this will happen every two cycles and the program will run its garbage collector two thousand times before reading the whole list. Even with all this work, its memory usage will be a large multiple of the list's size.
And, at the end:
This problem is not peculiar to Lua: Other languages with true garbage
collection, and where strings are immutable objects, present a similar
behavior, Java being the most famous example. (Java offers the
structure StringBuffer to ameliorate the problem.)
Python strings are also immutable objects.

Python function slows down with presence of large list

I was testing the speeds of a few different ways to do complex iterations over some of my data, and I found something weird. It seems that having a large list local to some function slows down that function considerably, even if it is not touching that list. For example, creating 2 independent lists via 2 instances of the same generator function is about 2.5x slower the second time. If the first list is removed prior to creating the second, both iterators go at the same spee.
def f():
l1, l2 = [], []
for c1, c2 in generatorFxn():
l1.append((c1, c2))
# destroying l1 here fixes the problem
for c3, c4 in generatorFxn():
l2.append((c3, c4))
The lists end up about 3.1 million items long each, but I saw the same effect with smaller lists too. The first for loop takes about 4.5 seconds to run, the second takes 10.5. If I insert l1= [] or l1= len(l1) at the comment position, both for loops take 4.5 seconds.
Why does the speed of local memory allocation in a function have anything to do with the current size of that function's variables?
EDIT:
Disabling the garbage collector fixes everything, so must be due to it running constantly. Case closed!

When you create that many new objects (3 million tuples), the garbage collector gets bogged down. If you turn off garbage collection with gc.disable(), the issue goes away (and the program runs 4x faster to boot).

It's impossible to say without more detailed instrumentation.
As a very, very preliminary step, check your main memory usage. If your RAM is all filled up and your OS is paging to disk, your performance will be quite dreadful. In such a case, you may be best off taking your intermediate products and putting them somewhere other than in memory. If you only need sequential reads of your data, consider writing to a plain file; if your data follows a strict structure, consider persisting into a relational database.

My guess is that when the first list is made, there is more memory available, meaning less chance that the list needs to be reallocated as it grows.
After you take up a decent chunk of memory with the first list, your second list has a higher
chance of needing to be reallocated as it grows since python lists are dynamically sized.

The memory used by the data local to the function isn't going to be garbage-collected until the function returns. Unless you have a need to do slicing, using lists for large collections of data is not a great idea.
From your example it's not entirely clear what the purpose of creating these lists are. You might want to consider using generators instead of lists, especially if the lists are just going to be iterated. If you need to do slicing on the return data, cast the generators to lists at that time.

CPython internal structures

GAE has various limitations, one of which is size of biggest allocatable block of memory amounting to 1Mb (now 10 times more, but that doesn't change the question). The limitation means that one cannot put more then some number of items in list() as CPython would try to allocate contiguous memory block for element pointers. Having huge list()s can be considered bad programming practice, but even if no huge structure is created in program itself, CPython maintains some behind the scenes.
It appears that CPython is maintaining single global list of objects or something. I.e. application that has many small objects tend to allocate bigger and bigger single blocks of memory.
First idea was gc, and disabling it changes application behavior a bit but still some structures are maintained.
A simplest short application that experience the issue is:
a = b = []
number_of_lists = 8000000
for i in xrange(number_of_lists):
b.append([])
b = b[0]
Can anyone enlighten me how to prevent CPython from allocating huge internal structures when having many objects in application?

On a 32-bit system, each of the 8000000 lists you create will allocate 20 bytes for the list object itself, plus 16 bytes for a vector of list elements. So you are trying to allocate at least (20+16) * 8000000 = 20168000000 bytes, about 20 GB. And that's in the best case, if the system malloc only allocates exactly as much memory as requested.
I calculated the size of the list object as follows:
2 Pointers in the PyListObject structure itself (see listobject.h)
1 Pointer and one Py_ssize_t for the PyObject_HEAD part of the list object (see object.h)
one Py_ssize_t for the PyObject_VAR_HEAD (also in object.h)
The vector of list elements is slightly overallocated to avoid having to resize it at each append - see list_resize in listobject.c. The sizes are 0, 4, 8, 16, 25, 35, 46, 58, 72, 88, ... Thus, your one-element lists will allocate room for 4 elements.
Your data structure is a somewhat pathological example, paying the price of a variable-sized list object without utilizing it - all your lists have only a single element. You could avoid the 12 bytes overallocation by using tuples instead of lists, but to further reduce the memory consumption, you will have to use a different data structure that uses fewer objects. It's hard to be more specific, as I don't know what you are trying to accomplish.

I'm a bit confused as to what you're asking. In that code example, nothing should be garbage collected, as you're never actually killing off any references. You're holding a reference to the top level list in a and you're adding nested lists (held in b at each iteration) inside of that. If you remove the 'a =', then you've got unreferenced objects.
Edit: In response to the first part, yes, Python holds a list of objects so it can know what to cull. Is that the whole question? If not, comment/edit your question and I'll do my best to help fill in the gaps.

What are you trying to accomplish with the
a = b = []
and
b = b[0]
statements? It's certainly odd to see statements like that in Python, because they don't do what you might naively expect: in that example, a and b are two names for the same list (think pointers in C). If you're doing a lot of manipulation like that, it's easy to confuse the garbage collector (and yourself!) because you've got a lot of strange references floating around that haven't been properly cleared.
It's hard to diagnose what's wrong with that code without knowing why you want to do what it appears to be doing. Sure, it exposes a bit of interpreter weirdness... but I'm guessing you're approaching your problem in an odd way, and a more Pythonic approach might yield better results.

So that you're aware of it, Python has its own allocator. You can disable it using --without-pyalloc during the configure step.
However, the largest arena is 256KB so that shouldn't be the problem. You can also compile Python with debugging enabled, using --with-pydebug. This would give you more information about memory use.
I suspect your hunch and am sure that oefe's diagnosis are correct. A list uses contiguous memory, so if your list gets too large for a system arena then you're out of luck. If you're really adventurous you can reimplement PyList to use multiple blocks, but that's going to be a lot of work since various bits of Python expect contiguous data.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.