[Python 3.1]
I'm following up on the design concept that tuples should be of known length (see this comment), and unknown length tuples should be replaced with lists in most circumstances. My question is under what circumstances should I deviate from that rule?
For example, I understand that tuples are faster to create from string and numeric literals than lists (see another comment). So, if I have performance-critical code where there are numerous calculations such as sumproduct(tuple1, tuple2), should I redefine them to work on lists despite a performance hit? (sumproduct((x, y, z), (a, b, c)) is defined as x * a + y * b + z * c, and its arguments have unspecified but equal lengths).
And what about the tuple that is automatically built by Python when using def f(*x)? I assume it's not something I should coerce to list every time I use it.
Btw, is (x, y, z) faster to create than [x, y, z] (for variables rather than literals)?
In my mind, the only interesting distinction between tuples and lists is that lists are mutable and tuples are not. The other distinctions that people mention seem completely artificial to me: tuples are like structs and lists are like arrays (this is where the "tuples should be a known length" comes from). But how is struct-ness aligned with immutability? It isn't.
The only distinction that matters is the distinction the language makes: mutability. If you need to modify the object, definitely use a list. If you need to hash the object (as a key in a dict, or an element of a set), then you need it to be immutable, so use a tuple. That's it.
I always use the most the appropriate data structure for the job and do not really worry about if a tuple would save me half a millisecond here or there. Pre-obfuscating your code does not usually pay off in the end. If the code runs too slow you can always profile it later and change the .01% of code where it really matters.
All the things you are talking about are tied in to the implementation of the python version and the hardware it is running on. You can always time those things your self to see what they would be on your machine.
A common example of this is the 'old immutable strings are slow to concatenate' in python. This was true about 10 years ago, and then they changed the implementation in 2.4 or 2.5. If you do your own tests they now run faster than lists, but people are convinced of this still today and use silly constructs that actually ran slower!
under what circumstances should I deviate from that [tuples should be of known length] rule?
None.
It's a matter of meaning. If an object has meaning based on a fixed number of elements, then it's a tuple. (x,y) coordinates, (c,m,y,k) colors, (lat, lon) position, etc., etc.
A tuple has a fixed number of elements based on the problem domain in general and the specifics of the problem at hand.
Designing a tuple with an indefinite number of elements makes little sense. When do we switch from (x,y) to (x,y,z) and then to (x,y,z,w) coordinates? Not by simply concatenating a value as if it's a list? If we're moving from 2-d to 3-d coordinates there's usually some pretty fancy math to map the coordinate systems. Not appending an element to a list.
What does it mean to move from (r,g,b) colors to something else? What is the 4th color in the rgb system? For that matter, what's the fifth color in the cmyk ststem?
Tuples do not change size.
*args is a tuple because it is immutable. Yes, it has an indefinite number of arguments, but it's a rare counter-exmaple to tuples of known, defined sizes.
What to do about an indefinite length tuple. This counter-example is so profound that we have two choices.
Reject the very idea that tuples are fixed-length, and constrained by the problem,. The very idea of (x,y) coordinates and (r,g,b) colors is utterly worthless and wrong because of this counter-example. Fixed-length tuples? Never.
Always convert all *args to lists to always have a fussy level of unthinking conformance to a design principle. Covert to lists? Always.
I love all or nothing choices, since they make software engineering so simplistic and unthinking.
Perhaps, in these corner cases, there's a tiny scrap of "this requires thinking" here. A tiny scrap.
Yes, *args is a tuple. Yes, it's of indefinite length. Yes, it's a counter-example where "fixed by the problem domain" is trumped by "simply immutable".
This leads us to the third choice in the case where a sequence is immutable for a different reason. You'll never mutate it, so it's okay to be a tuple of indefinite size. In the even-more-rare case where you're popping values of *args because you're treating it like a stack or a queue, then you might want to make a list out of it. But we can't pre-solve all possible problems.
Sometimes Thinking Is Required.
When you're doing design, you design a tuple for a reason. To impose a meaningful structure on your data. Fixed-length number of elements? Tuple. Variable number of elements (i.e., mutable)? List.
In this case, you should probably consider using numpy and numpy arrays.
There is some overhead converting to and from numpy arrays, but if you are doing a bunch of calculation it will be much faster
Related
I am no expert in how Python lists are implemented but from what I understand, they are implemented as dynamic arrays rather than linked lists. My question is therefore, if python lists are implemented as arrays, why are they called 'lists' and not 'arrays'.
Is this just a semantic issue or is there some deeper technical reason behind this. Is the dynamic array implementation in Python close to a list implementation? Or is it because the dynamic array implementation makes its behaviour closer to a list's behaviour than an array? Or some other reason I do not understand?
To be clear, I am not asking specifically how or why Python lists are implemented as dynamic arrays, although that might be relevant to the answer.
They're named after the list abstract data type, not linked lists. This is similar to the naming of Java's List interface and C#'s List<T>.
To further elaborate on user2357112's answer, as pointed out in the wikipedia article:
In computer science, a list or sequence is an abstract data type that
represents a countable number of ordered values, where the same value
may occur more than once.
Further,
List data types are often implemented using array data structures or linked lists of some sort, but other data structures may be more appropriate for some applications.
In CPython, lists are implemented as dynamic arrays of pointers, and their behaviour is much closer to the List abstract data type than the Array abstract data type. From this perspective, the naming of 'List' is accurate.
At the end of the day when implementing a list what you want is constant(O(1)) access(a[i]), insert(a.append(i)) and delete(a.remove(i)) times. With a linked list some of this operations could be as slow as O(n), i.e. deleting the last element of linked lists if you don't have a pointer to the tail.
With dynamic arrays you get constant delete and access times but what about deleting? Here we get amortized constant time. What is that? If the array is full of N elements, the insert will take O(N) and you'll end up with an array of size 2N. This is a rare event, thus we say we have amortized O(1).
Hope it helps.
Sources:
https://docs.python.org/2/faq/design.html
I have a quite large list (>1K elements) of objects of the same type in my Python program. The list is never modified - no elements are added, removed or changed. Are there any downside to putting the objects into a tuple instead of a list?
On the one hand, tuples are immutable so that matches my requirements. On the other hand, using such a large tuple just feels wrong. In my mind, tuples has always been for small collections. It's a double, a tripple, a quadruple... Not a two-thousand-and-fiftyseven-duple.
Is my fear of large tuples somehow justified? Is it bad for performance, unpythonic, or otherwise bad practice?
In CPython, go ahead. Under the covers, the only real difference between the storage of lists and tuples is that the C-level array holding the tuple elements is allocated in the tuple object, while a list object contains a pointer to a C-level array holding the list elements, which is allocated separately from the list object. The list implementation needs to do that because the list may grow, and so the memory containing the C-level vector may need to change its base address. A tuple can't change size, so the memory for it is allocated directly in the tuple object.
I've created tuples with millions of elements, and yet I lived to type about it ;-)
Obscure
In CPython, there can even be "a reason" to prefer giant tuples: the cyclic garbage collection scheme exempts a tuple from periodic scanning if it only contains immutable objects. Then the tuple can never be part of a cycle, so cyclic gc can ignore it. The same optimization cannot be used for lists; just because a list contains only immutable objects during one run of cyclic gc says nothing about whether that will still be the case during the next run.
This is almost never highly significant, but it can save a percent or so in a long-running program, and the benefit of exempting giant tuples grows the bigger they are.
Yes, it is OK.
However, depending on the operations you're doing, you might want to consider using the set function in Python. This will convert your input iterable (tuple, list, or other) to a set. Sets are nice for a few reasons, but especially because you get a unique list of items that has constant time lookup for items.
There's nothing "un-pythonic" about holding large data sets in memory, though.
I have a really big, like huge, Dictionary (it isn't really but pretend because it is easier and not relevant) that contains the same strings over and over again. I have verified that I can store a lot more in memory if I do poor man's compression on the system and instead store INTs that correspond to the string.
animals = ['ape','butterfly,'cat','dog']
exists in a list and therefore has an index value such that animals.index('cat') returns 2
This allows me to store in my object BobsPets = set(2,3)
rather than Cat and Dog
For the number of items the memory savings are astronomical. (Really Don't try and dissuade me that is well tested.
Currently I then convert the INTs back to Strings with a FOR loop
tempWordList = set()
for IntegOfIndex in TempSet:
tempWordList.add(animals[IntegOfIndex])
return tempWordList
This code works. It feels "Pythonic," but it feels like there should be a better way. I am in Python 2.7 on AppEngine if that matters. It may since I wonder if Numpy has something I missed.
I have about 2.5 Million things in my object, and each has an average of 3 of these "pets" and there 7500-ish INTs that represent the pets. (no they aren't really pets)
I have considered using a Dictionary with the position instead of using Index. This doesn't seem faster, but am interested if anyone thinks it should be. (it took more memory and seemed to be the same speed or really close)
I am considering running a bunch of tests with Numpy and its array's rather than lists, but before I do, I thought I'd ask the audience and see if I would be wasting time on something that I have already reached the best solution for.
Last thing, The solution should be pickable since I do that for loading and transferring data.
It turns out that since my list of strings is fixed, and I just wish the index of the string, I am building what is essentially an index array that is immutable. Which is in short a Tuple.
Moving to a Tuple rather than a list gains about 30% improvement in speed. Far more than I would have anticipated.
The bonus is largest on very large lists. It seems that each time you cross a bit threshold the bonus increases, so in sub 1024 lists their is basically no bonus and at a million there is pretty significant.
The Tuple also uses very slightly less memory for the same data.
An aside, playing with the lists of integers, you can make these significantly smaller by using a NUMPY array, but the advantage doesn't extend to pickling. The Pickles will be about 15% larger. I think this is because of the object description being stored in the pickle, but I didn't spend much time looking.
So in short the only change was to make the Animals list a Tuple. I really was hoping the answer was something more exotic.
I feel like this must have been asked before (probably more than once), so potential apologies in advance, but I can't find it anywhere (here or through Google).
Anyway, when explaining the difference between lists and tuples in Python, the second thing mentioned, after tuples being immutable, is that lists are best for homogeneous data and tuples are best for heterogeneous data. But nobody seems to think to explain why that's the case. So why is that the case?
First of all, that guideline is only sort of true. You're free to use tuples for homogenous data and lists for heterogenous data, and there may be cases where that's a fine thing to do. One important case is if you need the collection to the hashable so you can use it as a dictionary key; in this case you must use a tuple, even if all the elements are homogenous in nature.
Also note that the homogenous/heterogenous distinction is really about the semantics of the data, not just the types. A sequence of a name, occupation, and address would probably be considered heterogenous, even though all three might be represented as strings. So it's more important to think about what you're going to do with the data (i.e., will you actually treat the elements the same) than about what types they are.
That said, I think one reason lists are preferred for homogenous data is because they're mutable. If you have a list of several things of the same kind, it may make sense to add another one to the list, or take one away; when you do that, you're still left with a list of things of the same kind.
By contrast, if you have a collection of things of heterogenous kinds, it's usually because you have a fixed structure or "schema" to them (e.g., the first one is an ID number, the second is a name, the third is an address, or whatever). In this case, it doesn't make sense to add or remove an element from the collection, because the collection is an integrated whole with specified roles for each element. You can't add an element without changing your whole schema for what the elements represent.
In short, changes in size are more natural for homogenous collections than for heterogenous collections, so mutable types are more natural for homogenous collections.
The difference is philosophical more than anything.
A tuple is meant to be a shorthand for fixed and predetermined data meanings. For example:
person = ("John", "Doe")
So, this example is a person, who has a first name and last name. The fixed nature of this is the critical factor. Not the data type. Both "John" and "Doe" are strings, but that is not the point. The benefit of this is unchangeable nature:
You are never surprised to find a value missing. person always has two values. Always.
You are never surprised to find something added. Unlike a dictionary, another bit of code can't "add a new key" or attribute
This predictability is called immutability
It is just a fancy way of saying it has a fixed structure.
One of the direct benefits is that it can be used as a dictionary key. So:
some_dict = {person: "blah blah"}
works. But:
da_list = ["Larry", "Smith"]
some_dict = {da_list: "blah blah"}
does not work.
Don't let the fact that element reference is similar (person[0] vs da_list[0]) throw you off. person[0] is a first name. da_list[0] is simply the first item in a list at this moment in time.
It's not a rule, it's just a tradition.
In many languages, lists must be homogenous and tuples must be fixed-length. This is true of C++, C#, Haskell, Rust, etc. Tuples are used as anonymous structures. It is the same way in mathematics.
Python's type system, however, does not allow you to make these distinctions: you can make tuples of dynamic length, and you can make lists with heterogeneous data. So you are allowed to do whatever you want with lists and tuples in Python, it just might be surprising to other people reading your code. This is especially true if the people reading your code have a background in mathematics or are more familiar with other languages.
Lists are often used for iterating over them, and performing the same operation to every element in the list. Lots of list operations are based on that. For that reason, it's best to have every element be the same type, lest you get an exception because an item was the wrong type.
Tuples are more structured data; they're immutable so if you handle them correctly you won't run into type errors. That's the data structure you'd use if you specifically want to combine multiple types (like an on-the-fly struct).
In the article linked below, the author compares the efficiency of different string concatenation methodologies in python:
http://www.skymind.com/~ocrow/python_string/
One thing that I did not understand is, why does method 3 (Mutable Character Arrays) result in a significantly slower performance than method 4 (joining a list of strings)
both of them are mutable and I would think that they should have comparable performance.
"Both of them are mutable" is misleading you a bit.
It's true that in the list-append method, the list is mutable. But building up the list isn't the slow part. If you have 1000 strings of average length 1000, you're doing 1000000 mutations to the array, but only 1000 mutations to the list (plus 1000 increfs to string objects).
In particular, that means the array will have to spend 1000x as much time expanding (allocating new storage and copying the whole thing so far).
The slow part for the list method is the str.join call at the end. But that isn't mutable, and doesn't require any expanding. It uses two passes, to first calculate the size needed, then copy everything into it.
Also, that code inside str.join has had (and has continued to have since that article was written 9 years ago) a lot of work to optimize it, because it's a very common, and recommended, idiom that many real programs depend on every day; array has barely been touched since it was first added to the language.
But if you really want to understand the differences, you have to look at the source. In 2.7, the main work for the array method is in array_fromstring, while the main work for the list method is in string_join. You can see how the latter takes advantage of the fact that we already know all of the strings we're going to be joining up at the start, while the former can't.