How are lists implemented to be heterogeneous? I know that a list is a dynamic array of pointers that point to the memory location of the required element but how does this work when the elements the indexes point to are of different sizes.
Does the location the pointer points to contain information about what type this location holds and if so, how is that kind of information formatted, used, returned?
I understand how [1,2,3] should look in memory but not how [1,2,"abcdefg"] would look in memory.
Everything is an object in python: As you mention, lists are collections of 'pointers' to the memory location on the objects contained in the list.
As objects, they each know their type, attributes, properties, etc.
They therefore cannot exceed their own memory boundary, and a tally/accounting of the number of pointers pointed at them is invisibly kept by the python interpreter.
From a technical perspective (i.e. how to implement something like this), look up the Abstract Datatype (ADT) concept in C, e.g. here.
Related
I have read the answer from this question as well as the related questions about the issue of having different objects sharing the same id (which can be answered by this Python docs about id). However, in these questions, I notice that the contents of the objects are the same (thus the memory sizes are the same, too). I experiment with the list of different sizes and contents on both the IPython shell and .py file with CPython, and get the "same id" result, too:
print(id([1]), id([1,2,3]), id([1,2,3,4,1,1,1,1,1,1,1,1,1,1,1,1]))
# Result: 2067494928320 2067494928320 2067494928320
The result doesn't change despite how many elements or the size of the number (big or small) I add to the list
So I have a question here: when an id is given, does the list size have any effect on whether the id can be reused or not? I thought that it could because according to the docs above,
CPython implementation detail: This is the address of the object in memory.
and if the address does not have enough space for the list, then a new id should be given. But I'm quite surprised about the result above.
Make a list, and some items to it. the id remains the same:
In [21]: alist = []
In [22]: id(alist)
Out[22]: 139981602327808
In [23]: for i in range(29): alist.append(i)
In [24]: id(alist)
Out[24]: 139981602327808
But the memory use for this list occurs in several parts. There's some sort storage for the list instance itself (that's that the id references). Python is written in C, but all items are objects (as in C++).
The list also has a data buffer, think of it as a C array with fix size. It holds pointers to objects elsewhere in memory. That buffer has space for the current references plus some sort of growth space. As you add items to list, their references are inserted in the growth space. When that fills up, the list gets a new buffer, with more growth space. List append is relatively fast, with periodic slow downs as it copies references to the new buffer. All that occurs under the covers so that the Python programmer doesn't notice.
I suspect that in my example alist got a new buffer several times, but I don't there's any way to track or measure that.
Storage for the objects referenced by the list is another matter. cython creates small integer objects (up to 256) at the start, so my list (and yours) will have references to those unique integer objects. It also maintains some sort of cache of strings. But other things, such as larger numbers, other lists, dicts, custom class objects, are created as needed. Identically valued objects might well have different id.
So while the data buffer of the list is contiguous, the items referenced in the buffer are not.
By and large, that memory management is unimportant to programmers. It's only when data structures get very large that we need to worry. And that seems to occur more with object classes like numpy arrays, which have a very different memory use.
I am no expert in how Python lists are implemented but from what I understand, they are implemented as dynamic arrays rather than linked lists. My question is therefore, if python lists are implemented as arrays, why are they called 'lists' and not 'arrays'.
Is this just a semantic issue or is there some deeper technical reason behind this. Is the dynamic array implementation in Python close to a list implementation? Or is it because the dynamic array implementation makes its behaviour closer to a list's behaviour than an array? Or some other reason I do not understand?
To be clear, I am not asking specifically how or why Python lists are implemented as dynamic arrays, although that might be relevant to the answer.
They're named after the list abstract data type, not linked lists. This is similar to the naming of Java's List interface and C#'s List<T>.
To further elaborate on user2357112's answer, as pointed out in the wikipedia article:
In computer science, a list or sequence is an abstract data type that
represents a countable number of ordered values, where the same value
may occur more than once.
Further,
List data types are often implemented using array data structures or linked lists of some sort, but other data structures may be more appropriate for some applications.
In CPython, lists are implemented as dynamic arrays of pointers, and their behaviour is much closer to the List abstract data type than the Array abstract data type. From this perspective, the naming of 'List' is accurate.
At the end of the day when implementing a list what you want is constant(O(1)) access(a[i]), insert(a.append(i)) and delete(a.remove(i)) times. With a linked list some of this operations could be as slow as O(n), i.e. deleting the last element of linked lists if you don't have a pointer to the tail.
With dynamic arrays you get constant delete and access times but what about deleting? Here we get amortized constant time. What is that? If the array is full of N elements, the insert will take O(N) and you'll end up with an array of size 2N. This is a rare event, thus we say we have amortized O(1).
Hope it helps.
Sources:
https://docs.python.org/2/faq/design.html
I am learning how to program in python and am also learning theory as part of a computer science course. In programming i know that i can add additional variables to an array just by using the .append function, however in my theory classes we are told that arrays can neither be increase nor decreased in size.
How does this work in python?
Python uses resizable vectors under the hood. They maintain knowledge of how many elements are in the list as well as what the current total capacity is. When you try to add another element beyond the size of the collection, it allocates a new array with more capacity and populates it with the pointers to items from the original backing array. This is similar to java's ArrayList type, except that there's no way to specify the capacity in python
A detailed post on the implementation is here: http://www.laurentluce.com/posts/python-list-implementation/
They are not linked lists; there's no linked list type built into python, and the performance patterns are different.
Its not python but at one point in your future you will see this in other languages as well. Another common way this is solved that doesn't involve using a vector or a linked list is with dynamic arrays.
Essentially you create an array with a finite size. If the user calls append and you have no more room in your array. You create a new array that is 2x larger than the old array. Then copy all the elements over and append the new element.
The 2x is actually important because it keeps the insert time amortized constant. (That is more advanced algorithms though)
A list in python is akin to a linked list. They can grow dynamically and each element can point to anything.
If you're curious about what id dynamic and what isn't in Python then you should read about mutability vs immutability:
https://codehabitude.com/2013/12/24/python-objects-mutable-vs-immutable/
In the theory class, you learned about static arrays. We see these types of arrays in C usually. But in python, we have dynamic arrays which are extensible. Search for Linked List in google and you will gain further knowledge
Languages such as C++ require that an array hold elements of a single type. As I understand it, knowing the size of each element allows for pointer arithmetic, making access of a particular element O(1) time.
What about Python lists?
Python lists allow for mixing element types. Surely the implementation doesn't involve a slow-access data structure, such as a linked lists – right? Is accessing an element even constant time? If so, how does Python achieve it with variable element types?
Its a simple indexed lookup. Python stores references to objects in its lists, not the objects themselves. Consider a C++ list of (void*) pointers. Each pointer is a known size and array lookup is fast, but the things it points to can vary in size.
In Python, everything is an "object" (you can intuitively confirm that by something like (1).__add__(2)). So, roughly speaking, Python's list just contain references to the actual objects stored somewhere in memory. And if you look up an object via the list index - this is very, very simplified - it will redirect you to the actual object.
Here is a nice table that shows you the complexity (Big-Oh) of the different operations on lists.
I was wondering whether someone might know the answer to the following.
I'm using Python to build a character-based suffix tree. There are over 11 million nodes in the tree which fits in to approximately 3GB of memory. This was down from 7GB by using the slot class method rather than the Dict method.
When I serialise the tree (using the highest protocol) the resulting file is more than a hundred times smaller.
When I load the pickled file back in, it again consumes 3GB of memory. Where does this extra overhead come from, is it something to do with Pythons handling of memory references to class instances?
Update
Thank you larsmans and Gurgeh for your very helpful explanations and advice. I'm using the tree as part of an information retrieval interface over a corpus of texts.
I originally stored the children (max of 30) as a Numpy array, then tried the hardware version (ctypes.py_object*30), the Python array (ArrayType), as well as the dictionary and Set types.
Lists seemed to do better (using guppy to profile the memory, and __slots__['variable',...]), but I'm still trying to squash it down a bit more if I can. The only problem I had with arrays is having to specify their size in advance, which causes a bit of redundancy in terms of nodes with only one child, and I have quite a lot of them. ;-)
After the tree is constructed I intend to convert it to a probabilistic tree with a second pass, but may be I can do this as the tree is constructed. As construction time is not too important in my case, the array.array() sounds like something that would be useful to try, thanks for the tip, really appreciated.
I'll let you know how it goes.
If you try to pickle an empty list, you get:
>>> s = StringIO()
>>> pickle.dump([], s)
>>> s.getvalue()
'(l.'
and similarly '(d.' for an empty dict. That's three bytes. The in-memory representation of a list, however, contains
a reference count
a type ID, in turn containing a pointer to the type name and bookkeeping info for memory allocation
a pointer to a vector of pointers to actual elements
and yet more bookkeeping info.
On my machine, which has 64-bit pointers, the sizeof a Python list header object is 40 bytes, so that's one order of magnitude. I assume an empty dict will have similar size.
Then, both list and dict use an overallocation strategy to obtain amortized O(1) performance for their main operations, malloc introduces overhead, there's alignment, member attributes that you may or may not even be aware of and various other factors that get you the second order of magnitude.
Summing up: pickle is a pretty good compression algorithm for Python objects :)
Do you construct your tree once and then use it without modifying it further? In that case you might want to consider using separate structures for the dynamic construction and the static usage.
Dicts and objects are very good for dynamic modification, but they are not very space efficient in a read-only scenario. I don't know exactly what you are using your suffix tree for, but you could let each node be represented by a 2-tuple of a sorted array.array('c') and an equally long tuple of subnodes (a tuple instead of a vector to avoid overallocation). You traverse the tree using the bisect-module for lookup in the array. The index of a character in the array will correspond to a subnode in the subnode-tuple. This way you avoid dicts, objects and vector.
You could do something similar during the construction process, perhaps using a subnode-vector instead of subnode-tuple. But this will of course make construction slower, since inserting new nodes in a sorted vector is O(N).