Is it OK to create very large tuples in Python? - python

I have a quite large list (>1K elements) of objects of the same type in my Python program. The list is never modified - no elements are added, removed or changed. Are there any downside to putting the objects into a tuple instead of a list?
On the one hand, tuples are immutable so that matches my requirements. On the other hand, using such a large tuple just feels wrong. In my mind, tuples has always been for small collections. It's a double, a tripple, a quadruple... Not a two-thousand-and-fiftyseven-duple.
Is my fear of large tuples somehow justified? Is it bad for performance, unpythonic, or otherwise bad practice?

In CPython, go ahead. Under the covers, the only real difference between the storage of lists and tuples is that the C-level array holding the tuple elements is allocated in the tuple object, while a list object contains a pointer to a C-level array holding the list elements, which is allocated separately from the list object. The list implementation needs to do that because the list may grow, and so the memory containing the C-level vector may need to change its base address. A tuple can't change size, so the memory for it is allocated directly in the tuple object.
I've created tuples with millions of elements, and yet I lived to type about it ;-)
Obscure
In CPython, there can even be "a reason" to prefer giant tuples: the cyclic garbage collection scheme exempts a tuple from periodic scanning if it only contains immutable objects. Then the tuple can never be part of a cycle, so cyclic gc can ignore it. The same optimization cannot be used for lists; just because a list contains only immutable objects during one run of cyclic gc says nothing about whether that will still be the case during the next run.
This is almost never highly significant, but it can save a percent or so in a long-running program, and the benefit of exempting giant tuples grows the bigger they are.

Yes, it is OK.
However, depending on the operations you're doing, you might want to consider using the set function in Python. This will convert your input iterable (tuple, list, or other) to a set. Sets are nice for a few reasons, but especially because you get a unique list of items that has constant time lookup for items.
There's nothing "un-pythonic" about holding large data sets in memory, though.

Related

Understanding shared reference for lists and tuples [duplicate]

I've been trying to learn how CPython is implemented under the scenes. It's great that Python is high level, but I don't like treating it like a black box.
With that in mind, how are tuples implemented? I've had a look at the source (tupleobject.c), but it's going over my head.
I see that PyTuple_MAXSAVESIZE = 20 and PyTuple_MAXFREELIST = 2000, what is saving and the "free list"? (Will there be a performance difference between tuples of length 20/21 or 2000/2001? What enforces the maximum tuple length?)
As a caveat, everything in this answer is based on what I've gleaned from looking over the implementation you linked.
It seems that the standard implementation of a tuple is simply as an array. However, there are a bunch of optimizations in place to speed things up.
First, if you try to make an empty tuple, CPython instead will hand back a canonical object representing the empty tuple. As a result, it can save on a bunch of allocations that are just allocating a single object.
Next, to avoid allocating a bunch of small objects, CPython recycles memory for many small lists. There is a fixed constant (PyTuple_MAXSAVESIZE) such that all tuples less than this length are eligible to have their space reclaimed. Whenever an object of length less than this constant is deallocated, there is a chance that the memory associated with it will not be freed and instead will be stored in a "free list" (more on that in the next paragraph) based on its size. That way, if you ever need to allocate a tuple of size n and one has previously been allocated and is no longer in use, CPython can just recycle the old array.
The free list itself is implemented as an array of size PyTuple_MAXSAVESIZE storing pointers to unused tuples, where the nth element of the array points either to NULL (if no extra tuples of size n are available) or to a reclaimed tuple of size n. If there are multiple different tuples of size n that could be reused, they are chained together in a sort of linked list by having each tuple's zeroth entry point to the next tuple that can be reused. (Since there is only one tuple of length zero ever allocated, there is never a risk of reading a nonexistent zeroth element). In this way, the allocator can store some number of tuples of each size for reuse. To ensure that this doesn't use too much memory, there is a second constant PyTuple_MAXFREELIST that controls the maximum length of any of these linked lists within any bucket. There is then a secondary array of length PyTuple_MAXSAVESIZE that stores the length of the linked lists for tuples of each given length so that this upper limit isn't exceeded.
All in all, it's a very clever implementation!
Because in the course of normal operations Python will create and destroy a lot of small tuples, Python keeps an internal cache of small tuples for that purpose. This helps cut down on a lot of memory allocation and deallocation churn. For the same reasons small integers from -5 to 255 are interned (made into singletons).
The PyTuple_MAXSAVESIZE definition controls at the maximum size of tuples that qualify for this optimization, and the PyTuple_MAXFREELIST definition controls how many of these tuples keeps around in memory. When a tuple of length < PyTuple_MAXSAVESIZE is discarded, it is added to the free list if there is still room for one (in tupledealloc), to be re-used when Python creates a new small tuple (in PyTuple_New).
Python is being a little clever about how it stores these; for each tuple of length > 0, it'll reuse the first element of each cached tuple to chain up to PyTuple_MAXFREELIST tuples together into a linked list. So each element in the free_list array is a linked list of Python tuple objects, and all tuples in such a linked list are of the same size. The only exception is the empty tuple (length 0); only one is ever needed of these, it is a singleton.
So, yes, for tuples over length PyTuple_MAXSAVESIZE python is guaranteed to have to allocate memory separately for a new C structure, and that could affect performance if you create and discard such tuples a lot.
If you want to understand Python C internals, I do recommend you study the Python C API; it'll make it easier to understand the various structures Python uses to define objects, functions and methods in C.

Different lists in Python in size and content still share the id, does memory matter?

I have read the answer from this question as well as the related questions about the issue of having different objects sharing the same id (which can be answered by this Python docs about id). However, in these questions, I notice that the contents of the objects are the same (thus the memory sizes are the same, too). I experiment with the list of different sizes and contents on both the IPython shell and .py file with CPython, and get the "same id" result, too:
print(id([1]), id([1,2,3]), id([1,2,3,4,1,1,1,1,1,1,1,1,1,1,1,1]))
# Result: 2067494928320 2067494928320 2067494928320
The result doesn't change despite how many elements or the size of the number (big or small) I add to the list
So I have a question here: when an id is given, does the list size have any effect on whether the id can be reused or not? I thought that it could because according to the docs above,
CPython implementation detail: This is the address of the object in memory.
and if the address does not have enough space for the list, then a new id should be given. But I'm quite surprised about the result above.
Make a list, and some items to it. the id remains the same:
In [21]: alist = []
In [22]: id(alist)
Out[22]: 139981602327808
In [23]: for i in range(29): alist.append(i)
In [24]: id(alist)
Out[24]: 139981602327808
But the memory use for this list occurs in several parts. There's some sort storage for the list instance itself (that's that the id references). Python is written in C, but all items are objects (as in C++).
The list also has a data buffer, think of it as a C array with fix size. It holds pointers to objects elsewhere in memory. That buffer has space for the current references plus some sort of growth space. As you add items to list, their references are inserted in the growth space. When that fills up, the list gets a new buffer, with more growth space. List append is relatively fast, with periodic slow downs as it copies references to the new buffer. All that occurs under the covers so that the Python programmer doesn't notice.
I suspect that in my example alist got a new buffer several times, but I don't there's any way to track or measure that.
Storage for the objects referenced by the list is another matter. cython creates small integer objects (up to 256) at the start, so my list (and yours) will have references to those unique integer objects. It also maintains some sort of cache of strings. But other things, such as larger numbers, other lists, dicts, custom class objects, are created as needed. Identically valued objects might well have different id.
So while the data buffer of the list is contiguous, the items referenced in the buffer are not.
By and large, that memory management is unimportant to programmers. It's only when data structures get very large that we need to worry. And that seems to occur more with object classes like numpy arrays, which have a very different memory use.

Since Python lists can hold elements of different types, is accessing an element worse than constant time?

Languages such as C++ require that an array hold elements of a single type. As I understand it, knowing the size of each element allows for pointer arithmetic, making access of a particular element O(1) time.
What about Python lists?
Python lists allow for mixing element types. Surely the implementation doesn't involve a slow-access data structure, such as a linked lists – right? Is accessing an element even constant time? If so, how does Python achieve it with variable element types?
Its a simple indexed lookup. Python stores references to objects in its lists, not the objects themselves. Consider a C++ list of (void*) pointers. Each pointer is a known size and array lookup is fast, but the things it points to can vary in size.
In Python, everything is an "object" (you can intuitively confirm that by something like (1).__add__(2)). So, roughly speaking, Python's list just contain references to the actual objects stored somewhere in memory. And if you look up an object via the list index - this is very, very simplified - it will redirect you to the actual object.
Here is a nice table that shows you the complexity (Big-Oh) of the different operations on lists.

How is tuple implemented in CPython?

I've been trying to learn how CPython is implemented under the scenes. It's great that Python is high level, but I don't like treating it like a black box.
With that in mind, how are tuples implemented? I've had a look at the source (tupleobject.c), but it's going over my head.
I see that PyTuple_MAXSAVESIZE = 20 and PyTuple_MAXFREELIST = 2000, what is saving and the "free list"? (Will there be a performance difference between tuples of length 20/21 or 2000/2001? What enforces the maximum tuple length?)
As a caveat, everything in this answer is based on what I've gleaned from looking over the implementation you linked.
It seems that the standard implementation of a tuple is simply as an array. However, there are a bunch of optimizations in place to speed things up.
First, if you try to make an empty tuple, CPython instead will hand back a canonical object representing the empty tuple. As a result, it can save on a bunch of allocations that are just allocating a single object.
Next, to avoid allocating a bunch of small objects, CPython recycles memory for many small lists. There is a fixed constant (PyTuple_MAXSAVESIZE) such that all tuples less than this length are eligible to have their space reclaimed. Whenever an object of length less than this constant is deallocated, there is a chance that the memory associated with it will not be freed and instead will be stored in a "free list" (more on that in the next paragraph) based on its size. That way, if you ever need to allocate a tuple of size n and one has previously been allocated and is no longer in use, CPython can just recycle the old array.
The free list itself is implemented as an array of size PyTuple_MAXSAVESIZE storing pointers to unused tuples, where the nth element of the array points either to NULL (if no extra tuples of size n are available) or to a reclaimed tuple of size n. If there are multiple different tuples of size n that could be reused, they are chained together in a sort of linked list by having each tuple's zeroth entry point to the next tuple that can be reused. (Since there is only one tuple of length zero ever allocated, there is never a risk of reading a nonexistent zeroth element). In this way, the allocator can store some number of tuples of each size for reuse. To ensure that this doesn't use too much memory, there is a second constant PyTuple_MAXFREELIST that controls the maximum length of any of these linked lists within any bucket. There is then a secondary array of length PyTuple_MAXSAVESIZE that stores the length of the linked lists for tuples of each given length so that this upper limit isn't exceeded.
All in all, it's a very clever implementation!
Because in the course of normal operations Python will create and destroy a lot of small tuples, Python keeps an internal cache of small tuples for that purpose. This helps cut down on a lot of memory allocation and deallocation churn. For the same reasons small integers from -5 to 255 are interned (made into singletons).
The PyTuple_MAXSAVESIZE definition controls at the maximum size of tuples that qualify for this optimization, and the PyTuple_MAXFREELIST definition controls how many of these tuples keeps around in memory. When a tuple of length < PyTuple_MAXSAVESIZE is discarded, it is added to the free list if there is still room for one (in tupledealloc), to be re-used when Python creates a new small tuple (in PyTuple_New).
Python is being a little clever about how it stores these; for each tuple of length > 0, it'll reuse the first element of each cached tuple to chain up to PyTuple_MAXFREELIST tuples together into a linked list. So each element in the free_list array is a linked list of Python tuple objects, and all tuples in such a linked list are of the same size. The only exception is the empty tuple (length 0); only one is ever needed of these, it is a singleton.
So, yes, for tuples over length PyTuple_MAXSAVESIZE python is guaranteed to have to allocate memory separately for a new C structure, and that could affect performance if you create and discard such tuples a lot.
If you want to understand Python C internals, I do recommend you study the Python C API; it'll make it easier to understand the various structures Python uses to define objects, functions and methods in C.

Why can tuples contain mutable items?

If a tuple is immutable then why can it contain mutable items?
It is seemingly a contradiction that when a mutable item such as a list does get modified, the tuple it belongs to maintains being immutable.
That's an excellent question.
The key insight is that tuples have no way of knowing whether the objects inside them are mutable. The only thing that makes an object mutable is to have a method that alters its data. In general, there is no way to detect this.
Another insight is that Python's containers don't actually contain anything. Instead, they keep references to other objects. Likewise, Python's variables aren't like variables in compiled languages; instead the variable names are just keys in a namespace dictionary where they are associated with a corresponding object. Ned Batchhelder explains this nicely in his blog post. Either way, objects only know their reference count; they don't know what those references are (variables, containers, or the Python internals).
Together, these two insights explain your mystery (why an immutable tuple "containing" a list seems to change when the underlying list changes). In fact, the tuple did not change (it still has the same references to other objects that it did before). The tuple could not change (because it did not have mutating methods). When the list changed, the tuple didn't get notified of the change (the list doesn't know whether it is referred to by a variable, a tuple, or another list).
While we're on the topic, here are a few other thoughts to help complete your mental model of what tuples are, how they work, and their intended use:
Tuples are characterized less by their immutability and more by their intended purpose.
Tuples are Python's way of collecting heterogeneous pieces of information under one roof. For example,
s = ('www.python.org', 80)
brings together a string and a number so that the host/port pair can be passed around as a socket, a composite object. Viewed in that light, it is perfectly reasonable to have mutable components.
Immutability goes hand-in-hand with another property, hashability. But hashability isn't an absolute property. If one of the tuple's components isn't hashable, then the overall tuple isn't hashable either. For example, t = ('red', [10, 20, 30]) isn't hashable.
The last example shows a 2-tuple that contains a string and a list. The tuple itself isn't mutable (i.e. it doesn't have any methods that for changing its contents). Likewise, the string is immutable because strings don't have any mutating methods. The list object does have mutating methods, so it can be changed. This shows that mutability is a property of an object type -- some objects have mutating methods and some don't. This doesn't change just because the objects are nested.
Remember two things. First, immutability is not magic -- it is merely the absence of mutating methods. Second, objects don't know what variables or containers refer to them -- they only know the reference count.
Hope, this was useful to you :-)
That's because tuples don't contain lists, strings or numbers. They contain references to other objects.1 The inability to change the sequence of references a tuple contains doesn't mean that you can't mutate the objects associated with those references.2
1. Objects, values and types (see: second to last paragraph)
2. The standard type hierarchy (see: "Immutable sequences")
As I understand it, this question needs to be rephrased as a question about design decisions: Why did the designers of Python choose to create an immutable sequence type that can contain mutable objects?
To answer this question, we have to think about the purpose tuples serve: they serve as fast, general-purpose sequences. With that in mind, it becomes quite obvious why tuples are immutable but can contain mutable objects. To wit:
Tuples are fast and memory efficient: Tuples are faster to create than lists because they are immutable. Immutability means that tuples can be created as constants and loaded as such, using constant folding. It also means they're faster and more memory efficient to create because there's no need for overallocation, etc. They're a bit slower than lists for random item access, but faster again for unpacking (at least on my machine). If tuples were mutable, then they wouldn't be as fast for purposes such as these.
Tuples are general-purpose: Tuples need to be able to contain any kind of object. They're used to (quickly) do things like variable-length argument lists (via the * operator in function definitions). If tuples couldn't hold mutable objects, they would be useless for things like this. Python would have to use lists, which would probably slow things down, and would certainly be less memory efficient.
So you see, in order to fulfill their purpose, tuples must be immutable, but also must be able to contain mutable objects. If the designers of Python wanted to create an immutable object that guarantees that all the objects it "contains" are also immutable, they would have to create a third sequence type. The gain is not worth the extra complexity.
First of all, the word "immutable" can mean many different things to different people. I particularly like how Eric Lippert categorized immutability in his blog post [archive 2012-03-12]. There, he lists these kinds of immutability:
Realio-trulio immutability
Write-once immutability
Popsicle immutability
Shallow vs deep immutability
Immutable facades
Observational immutability
These can be combined in various ways to make even more kinds of immutability, and I'm sure more exist. The kind of immutability you seems interested in deep (also known as transitive) immutability, in which immutable objects can only contain other immutable objects.
The key point of this is that deep immutability is only one of many, many kinds of immutability. You can adopt whichever kind you prefer, as long as you are aware that your notion of "immutable" probably differs from someone else's notion of "immutable".
You cannot change the id of its items. So it will always contain the same items.
$ python
>>> t = (1, [2, 3])
>>> id(t[1])
12371368
>>> t[1].append(4)
>>> id(t[1])
12371368
I'll go out on a limb here and say that the relevant part here is that while you can change the contents of a list, or the state of an object, contained within a tuple, what you can't change is that the object or list is there. If you had something that depended on thing[3] being a list, even if empty, then I could see this being useful.
One reason is that there is no general way in Python to convert a mutable type into an immutable one (see the rejected PEP 351, and the linked discussion for why it was rejected). Thus, it would be impossible to put various types of objects in tuples if it had this restriction, including just about any user-created non-hashable object.
The only reason that dictionaries and sets have this restriction is that they require the objects to be hashable, since they are internally implemented as hash tables. But note that, ironically, dictionaries and sets themselves are not immutable (or hashable). Tuples do not use an object's hash, so its mutability does not matter.
A tuple is immutable in the sense that the tuple itself can not expand or shrink, not that all the items contained themselves are immutable. Otherwise tuples are dull.

Categories