Python list memory reallocation issue - python

If I am using C-Python or jython (in Python 2.7), and for list ([]) data structure, if I continue to add new elements, will there be memory reallocation issue like how Java ArrayList have (since Java ArrayList requires continuous memory space, if current pre-allocated space is full, it needs re-allocate new larger continuous large memory space, and move existing elements to the new allocated space)?
http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b14/java/util/ArrayList.java#ArrayList.ensureCapacity%28int%29
regards,
Lin

The basic story, at least for the main Python, is that a list contains pointers to objects elsewhere in memory. The list is created with a certain free space (eg. for 8 pointers). When that fills up, it allocates more memory, and so on. Whether it moves the pointers from one block of memory to another, is a detail that most users ignore. In practice we just append/extend a list as needed and don't worry about memory use.
Why does creating a list from a list make it larger?
I assume jython uses the same approach, but you'd have to dig into its code to see how that translates to Java.
I mostly answer numpy questions. This is a numerical package that creates fixed sized multidimensional arrays. If a user needs to build such an array incrementally, we often recommend that they start with a list and append values. At the end they create the array. Appending to a list is much cheaper than rebuilding an array multiple times.

Internally python lists are Array of pointers as mentioned by hpaulj
The next question then is how can you extend the an Array in C as explained in the answer. Which explains this can be done using realloc function in C.
This lead me to look in to the behavior of realloc which mentions
The function may move the memory block to a new location (whose address is returned by the function).
From this my understanding is the array object is extended if contiguous memory is available, else memory block (containing the Array object not List object) is copied to newly allocated memory block with greater size.
This is my understanding, corrections are welcome if I am wrong.

Related

Different lists in Python in size and content still share the id, does memory matter?

I have read the answer from this question as well as the related questions about the issue of having different objects sharing the same id (which can be answered by this Python docs about id). However, in these questions, I notice that the contents of the objects are the same (thus the memory sizes are the same, too). I experiment with the list of different sizes and contents on both the IPython shell and .py file with CPython, and get the "same id" result, too:
print(id([1]), id([1,2,3]), id([1,2,3,4,1,1,1,1,1,1,1,1,1,1,1,1]))
# Result: 2067494928320 2067494928320 2067494928320
The result doesn't change despite how many elements or the size of the number (big or small) I add to the list
So I have a question here: when an id is given, does the list size have any effect on whether the id can be reused or not? I thought that it could because according to the docs above,
CPython implementation detail: This is the address of the object in memory.
and if the address does not have enough space for the list, then a new id should be given. But I'm quite surprised about the result above.
Make a list, and some items to it. the id remains the same:
In [21]: alist = []
In [22]: id(alist)
Out[22]: 139981602327808
In [23]: for i in range(29): alist.append(i)
In [24]: id(alist)
Out[24]: 139981602327808
But the memory use for this list occurs in several parts. There's some sort storage for the list instance itself (that's that the id references). Python is written in C, but all items are objects (as in C++).
The list also has a data buffer, think of it as a C array with fix size. It holds pointers to objects elsewhere in memory. That buffer has space for the current references plus some sort of growth space. As you add items to list, their references are inserted in the growth space. When that fills up, the list gets a new buffer, with more growth space. List append is relatively fast, with periodic slow downs as it copies references to the new buffer. All that occurs under the covers so that the Python programmer doesn't notice.
I suspect that in my example alist got a new buffer several times, but I don't there's any way to track or measure that.
Storage for the objects referenced by the list is another matter. cython creates small integer objects (up to 256) at the start, so my list (and yours) will have references to those unique integer objects. It also maintains some sort of cache of strings. But other things, such as larger numbers, other lists, dicts, custom class objects, are created as needed. Identically valued objects might well have different id.
So while the data buffer of the list is contiguous, the items referenced in the buffer are not.
By and large, that memory management is unimportant to programmers. It's only when data structures get very large that we need to worry. And that seems to occur more with object classes like numpy arrays, which have a very different memory use.

Why filling an array from the front is slow?

In the chapter on Arrays in the book Elements of Programming Interviews in Python, it is mentioned that Filling an array from the front is slow, so see if it’s possible to write values from the back.
What could be the possible reason for that?
Python lists, at least in CPython, the standard Python implementation, are actually implemented from a data structure perspective as arrays, not lists.
However, these are dynamically allocated and resized, so appending to the end of a Python-list is actually possible. It takes a somewhat variable amount of time to do so: CPython tries to allocate additional space when items are being appended beyond what is actually necessary, such that it doesn't need to allocate more space for every append operation. At best, appending, if space has already been allocated, is O(1), and since it is an array, indexing is also O(1).
What will take a long time, however, is adding something to the beginning of a list, as this would require shifting all the array values, and is O(n), just as popping the first element is.
Python language designers have decided to call these arrays lists instead of arrays, contradicting standard terminology, in part, I assume, because the dynamic resizing makes them different from standard, fixed-size lists.
Unless I'm mistaken, collections.deque implements a doubly-linked list, with the corresponding O(1) appends/pops on either side, and so on.

Is numpy array and python list optimized to be dynamically growing?

I have done over the time many things that require me using the list's .append() function, and also numpy.append() function for numpy arrays. I noticed that both grow really slow when sizes of the arrays are big.
I need an array that is dynamically growing for sizes of about 1 million elements. I can implement this myself, just like std::vector is made in C++, by adding buffer length (reserve length) that is not accessible from the outside. But do I have to reinvent the wheel? I imagine it should be implemented somewhere. So my question is: Does such a thing exist already in Python?
What I mean: Is there in Python an array type that is capable of dynamically growing with time complexity of O(C) most of the time?
The memory of numpy arrays is well described in its docs, and has been discussed here a lot. List memory layout has also been discussed, though usually just contrast to numpy.
A numpy array has a fixed size data buffer. 'growing' it requires creating a new array, and copying data to it. np.concatenate does that in compiled code. np.append as well as all the stack functions use concatenate.
A list has, as I understand it, a contiguous data buffer that contains pointers to objects else where in memeory. Python maintains some freespace in that buffer, so additions with list.append are relatively fast and easy. But when the freespace fills up, it has to create a new buffer and copy pointers. I can see where that could get expensive with large lists.
So a list will have store a pointer for each element, plus the element itself (e.g. a float) somewhere else in memory. In contrast the array of floats stores the floats themselves as contiguous bytes in its buffer. (Object dtype arrays are more like lists).
The recommended way to create an array iteratively is to build the list with append, and create the array once at the end. Repeated np.append or np.concatenate is relatively expensive.
deque was mentioned. I don't know much about how it stores its data. The docs say it can add elements at the start just as easily as at the end, but random access is slower than for a list. That implies that it stores data in some sort of linked list, so that finding the nth element requires traversing the n-1 links before it. So there's a trade off between growth ease and access speed.
Adding elements to the start of a list requires making a new list of pointers, with the new one(s) at the start. So adding, and removing elements from the start of a regular list, is much more expensive than doing that at the end.
Recommending software is outside of the core SO purpose. Others may make suggestions, but don't be surprised if this gets closed.
There are file formats like HDF5 that a designed for large data sets. They accommodate growth with features like 'chunking'. And there are all kinds of database packages.
Both use an underlying array. Instead, you can use collections.deque which is made for specifically adding and removing elements at both ends with O(1) complexity

Static Arrays in Python

I am learning how to program in python and am also learning theory as part of a computer science course. In programming i know that i can add additional variables to an array just by using the .append function, however in my theory classes we are told that arrays can neither be increase nor decreased in size.
How does this work in python?
Python uses resizable vectors under the hood. They maintain knowledge of how many elements are in the list as well as what the current total capacity is. When you try to add another element beyond the size of the collection, it allocates a new array with more capacity and populates it with the pointers to items from the original backing array. This is similar to java's ArrayList type, except that there's no way to specify the capacity in python
A detailed post on the implementation is here: http://www.laurentluce.com/posts/python-list-implementation/
They are not linked lists; there's no linked list type built into python, and the performance patterns are different.
Its not python but at one point in your future you will see this in other languages as well. Another common way this is solved that doesn't involve using a vector or a linked list is with dynamic arrays.
Essentially you create an array with a finite size. If the user calls append and you have no more room in your array. You create a new array that is 2x larger than the old array. Then copy all the elements over and append the new element.
The 2x is actually important because it keeps the insert time amortized constant. (That is more advanced algorithms though)
A list in python is akin to a linked list. They can grow dynamically and each element can point to anything.
If you're curious about what id dynamic and what isn't in Python then you should read about mutability vs immutability:
https://codehabitude.com/2013/12/24/python-objects-mutable-vs-immutable/
In the theory class, you learned about static arrays. We see these types of arrays in C usually. But in python, we have dynamic arrays which are extensible. Search for Linked List in google and you will gain further knowledge

CPython internal structures

GAE has various limitations, one of which is size of biggest allocatable block of memory amounting to 1Mb (now 10 times more, but that doesn't change the question). The limitation means that one cannot put more then some number of items in list() as CPython would try to allocate contiguous memory block for element pointers. Having huge list()s can be considered bad programming practice, but even if no huge structure is created in program itself, CPython maintains some behind the scenes.
It appears that CPython is maintaining single global list of objects or something. I.e. application that has many small objects tend to allocate bigger and bigger single blocks of memory.
First idea was gc, and disabling it changes application behavior a bit but still some structures are maintained.
A simplest short application that experience the issue is:
a = b = []
number_of_lists = 8000000
for i in xrange(number_of_lists):
b.append([])
b = b[0]
Can anyone enlighten me how to prevent CPython from allocating huge internal structures when having many objects in application?
On a 32-bit system, each of the 8000000 lists you create will allocate 20 bytes for the list object itself, plus 16 bytes for a vector of list elements. So you are trying to allocate at least (20+16) * 8000000 = 20168000000 bytes, about 20 GB. And that's in the best case, if the system malloc only allocates exactly as much memory as requested.
I calculated the size of the list object as follows:
2 Pointers in the PyListObject structure itself (see listobject.h)
1 Pointer and one Py_ssize_t for the PyObject_HEAD part of the list object (see object.h)
one Py_ssize_t for the PyObject_VAR_HEAD (also in object.h)
The vector of list elements is slightly overallocated to avoid having to resize it at each append - see list_resize in listobject.c. The sizes are 0, 4, 8, 16, 25, 35, 46, 58, 72, 88, ... Thus, your one-element lists will allocate room for 4 elements.
Your data structure is a somewhat pathological example, paying the price of a variable-sized list object without utilizing it - all your lists have only a single element. You could avoid the 12 bytes overallocation by using tuples instead of lists, but to further reduce the memory consumption, you will have to use a different data structure that uses fewer objects. It's hard to be more specific, as I don't know what you are trying to accomplish.
I'm a bit confused as to what you're asking. In that code example, nothing should be garbage collected, as you're never actually killing off any references. You're holding a reference to the top level list in a and you're adding nested lists (held in b at each iteration) inside of that. If you remove the 'a =', then you've got unreferenced objects.
Edit: In response to the first part, yes, Python holds a list of objects so it can know what to cull. Is that the whole question? If not, comment/edit your question and I'll do my best to help fill in the gaps.
What are you trying to accomplish with the
a = b = []
and
b = b[0]
statements? It's certainly odd to see statements like that in Python, because they don't do what you might naively expect: in that example, a and b are two names for the same list (think pointers in C). If you're doing a lot of manipulation like that, it's easy to confuse the garbage collector (and yourself!) because you've got a lot of strange references floating around that haven't been properly cleared.
It's hard to diagnose what's wrong with that code without knowing why you want to do what it appears to be doing. Sure, it exposes a bit of interpreter weirdness... but I'm guessing you're approaching your problem in an odd way, and a more Pythonic approach might yield better results.
So that you're aware of it, Python has its own allocator. You can disable it using --without-pyalloc during the configure step.
However, the largest arena is 256KB so that shouldn't be the problem. You can also compile Python with debugging enabled, using --with-pydebug. This would give you more information about memory use.
I suspect your hunch and am sure that oefe's diagnosis are correct. A list uses contiguous memory, so if your list gets too large for a system arena then you're out of luck. If you're really adventurous you can reimplement PyList to use multiple blocks, but that's going to be a lot of work since various bits of Python expect contiguous data.

Categories