Const List of Homogeneous Data - python

How do I define a constant list of namedtuples? I wish to define something like this in a module and use it throughout my code. I will then use filters on it when I'm only interested in ADULTS for example.
However, I dont want other coders to be able to corrupt this data, which they can do if I use a list. I realise that I could use a tuple but my understanding is that there is a de facto standard that tuples are used for heterogeneous data while I actually have homogeneous data:
http://jtauber.com/blog/2006/04/15/python_tuples_are_not_just_constant_lists/
If I have understood this article correctly.
human_map = [
Activity(ADULT, EATING, OFTEN, 5),
Activity(ADULT, SLEEPING, PERIODIC, 24),
Activity(BABY, CRYING, OFTEN, 1)]

Lists and tuples may be used for heterogeneous data, but in Python there's nothing wrong with using them for homogeneous data.
Other languages provide basic collections (eg arrays in C/C++) that can only hold homogeneous data; more complicated collections must be used for heterogeneous data, and in those languages it is best practice to use the more efficient arrays for homogeneous data, rather than incurring the overheads of the fancier types of collection. But that logic does not apply to Python.

Lists and tuples are both for arbitrary Python objects. You can use one or another: the only useful difference is that tuples are immutable.
There are only 3 types of immutable sequences in Python: strings, unicode and tuples (in Python 2), or strings, tuples and bytes (in Python 3).
So, if you need an immutable list of objects, use tuples.
Reference: https://docs.python.org/2/reference/datamodel.html#the-standard-type-hierarchy

No matter which method you use, there's very little to stop a developer from making your name human_map refer to an entirely different immutable container. As an example:
human_map = (some, immutable, elements,)
And then the malicious developer does something like the following:
import humans
humans.human_map = (something, else, entirely,)
At some point in Python, you just need to stop worrying and trust the clients of your module a little. If people want to shoot themselves in the foot, let them.

Related

Why does Python allow mixed type lists?

Is there a reason that Python allows lists to contain multiple types? If I had a mixed collection of objects I would think the safe data type to use would be a tuple. Also, I find it strange that list methods (like sort) can be called on mixed lists so I assume there must be a good reason for allowing this. It appears at first glance that this would make writing type safe functions much more difficult.
The difference between a tuple and a list is that lists are mutable and tuples are not. So its not about data type safety, its about whether or not you want the elements to be able to be changed

Lists are for homogeneous data and tuples are for heterogeneous data... why?

I feel like this must have been asked before (probably more than once), so potential apologies in advance, but I can't find it anywhere (here or through Google).
Anyway, when explaining the difference between lists and tuples in Python, the second thing mentioned, after tuples being immutable, is that lists are best for homogeneous data and tuples are best for heterogeneous data. But nobody seems to think to explain why that's the case. So why is that the case?
First of all, that guideline is only sort of true. You're free to use tuples for homogenous data and lists for heterogenous data, and there may be cases where that's a fine thing to do. One important case is if you need the collection to the hashable so you can use it as a dictionary key; in this case you must use a tuple, even if all the elements are homogenous in nature.
Also note that the homogenous/heterogenous distinction is really about the semantics of the data, not just the types. A sequence of a name, occupation, and address would probably be considered heterogenous, even though all three might be represented as strings. So it's more important to think about what you're going to do with the data (i.e., will you actually treat the elements the same) than about what types they are.
That said, I think one reason lists are preferred for homogenous data is because they're mutable. If you have a list of several things of the same kind, it may make sense to add another one to the list, or take one away; when you do that, you're still left with a list of things of the same kind.
By contrast, if you have a collection of things of heterogenous kinds, it's usually because you have a fixed structure or "schema" to them (e.g., the first one is an ID number, the second is a name, the third is an address, or whatever). In this case, it doesn't make sense to add or remove an element from the collection, because the collection is an integrated whole with specified roles for each element. You can't add an element without changing your whole schema for what the elements represent.
In short, changes in size are more natural for homogenous collections than for heterogenous collections, so mutable types are more natural for homogenous collections.
The difference is philosophical more than anything.
A tuple is meant to be a shorthand for fixed and predetermined data meanings. For example:
person = ("John", "Doe")
So, this example is a person, who has a first name and last name. The fixed nature of this is the critical factor. Not the data type. Both "John" and "Doe" are strings, but that is not the point. The benefit of this is unchangeable nature:
You are never surprised to find a value missing. person always has two values. Always.
You are never surprised to find something added. Unlike a dictionary, another bit of code can't "add a new key" or attribute
This predictability is called immutability
It is just a fancy way of saying it has a fixed structure.
One of the direct benefits is that it can be used as a dictionary key. So:
some_dict = {person: "blah blah"}
works. But:
da_list = ["Larry", "Smith"]
some_dict = {da_list: "blah blah"}
does not work.
Don't let the fact that element reference is similar (person[0] vs da_list[0]) throw you off. person[0] is a first name. da_list[0] is simply the first item in a list at this moment in time.
It's not a rule, it's just a tradition.
In many languages, lists must be homogenous and tuples must be fixed-length. This is true of C++, C#, Haskell, Rust, etc. Tuples are used as anonymous structures. It is the same way in mathematics.
Python's type system, however, does not allow you to make these distinctions: you can make tuples of dynamic length, and you can make lists with heterogeneous data. So you are allowed to do whatever you want with lists and tuples in Python, it just might be surprising to other people reading your code. This is especially true if the people reading your code have a background in mathematics or are more familiar with other languages.
Lists are often used for iterating over them, and performing the same operation to every element in the list. Lots of list operations are based on that. For that reason, it's best to have every element be the same type, lest you get an exception because an item was the wrong type.
Tuples are more structured data; they're immutable so if you handle them correctly you won't run into type errors. That's the data structure you'd use if you specifically want to combine multiple types (like an on-the-fly struct).

how do you distinguish numpy arrays from Python's built-in objects

PEP8 has naming conventions for e.g. functions (lowercase), classes (CamelCase) and constants (uppercase).
It seems to me that distinguishing between numpy arrays and built-ins such as lists is probably more important as the same operators such as "+" actually mean something totally different.
Does anyone have any naming conventions to help with this?
You may use a prefix np_ for numpy arrays, thus distinguishing them from other variables.
numpy arrays and lists should occupy similar syntactic roles in your code and as such I wouldn't try to distinguish between them by naming conventions. Since everything in python is an object the usual naming conventions are there not to help distinguish type so much as usage. Data, whether represented in a list or a numpy.ndarray has the same usage.
I agree that it's awkward that eg. + means different things for lists and arrays. I implicitly deal with this by never putting anything like numerical data in a list but rather always in an array. That way I know if I want to concatenate blocks of data I should be using numpy.hstack. That said, there are definitely cases where I want to build up a list through concatenation and turn it into a numpy array when I'm done. In those cases the code block is usually short enough that it's clear what's going on. Some comments in the code never hurt.

Why can tuples contain mutable items?

If a tuple is immutable then why can it contain mutable items?
It is seemingly a contradiction that when a mutable item such as a list does get modified, the tuple it belongs to maintains being immutable.
That's an excellent question.
The key insight is that tuples have no way of knowing whether the objects inside them are mutable. The only thing that makes an object mutable is to have a method that alters its data. In general, there is no way to detect this.
Another insight is that Python's containers don't actually contain anything. Instead, they keep references to other objects. Likewise, Python's variables aren't like variables in compiled languages; instead the variable names are just keys in a namespace dictionary where they are associated with a corresponding object. Ned Batchhelder explains this nicely in his blog post. Either way, objects only know their reference count; they don't know what those references are (variables, containers, or the Python internals).
Together, these two insights explain your mystery (why an immutable tuple "containing" a list seems to change when the underlying list changes). In fact, the tuple did not change (it still has the same references to other objects that it did before). The tuple could not change (because it did not have mutating methods). When the list changed, the tuple didn't get notified of the change (the list doesn't know whether it is referred to by a variable, a tuple, or another list).
While we're on the topic, here are a few other thoughts to help complete your mental model of what tuples are, how they work, and their intended use:
Tuples are characterized less by their immutability and more by their intended purpose.
Tuples are Python's way of collecting heterogeneous pieces of information under one roof. For example,
s = ('www.python.org', 80)
brings together a string and a number so that the host/port pair can be passed around as a socket, a composite object. Viewed in that light, it is perfectly reasonable to have mutable components.
Immutability goes hand-in-hand with another property, hashability. But hashability isn't an absolute property. If one of the tuple's components isn't hashable, then the overall tuple isn't hashable either. For example, t = ('red', [10, 20, 30]) isn't hashable.
The last example shows a 2-tuple that contains a string and a list. The tuple itself isn't mutable (i.e. it doesn't have any methods that for changing its contents). Likewise, the string is immutable because strings don't have any mutating methods. The list object does have mutating methods, so it can be changed. This shows that mutability is a property of an object type -- some objects have mutating methods and some don't. This doesn't change just because the objects are nested.
Remember two things. First, immutability is not magic -- it is merely the absence of mutating methods. Second, objects don't know what variables or containers refer to them -- they only know the reference count.
Hope, this was useful to you :-)
That's because tuples don't contain lists, strings or numbers. They contain references to other objects.1 The inability to change the sequence of references a tuple contains doesn't mean that you can't mutate the objects associated with those references.2
1. Objects, values and types (see: second to last paragraph)
2. The standard type hierarchy (see: "Immutable sequences")
As I understand it, this question needs to be rephrased as a question about design decisions: Why did the designers of Python choose to create an immutable sequence type that can contain mutable objects?
To answer this question, we have to think about the purpose tuples serve: they serve as fast, general-purpose sequences. With that in mind, it becomes quite obvious why tuples are immutable but can contain mutable objects. To wit:
Tuples are fast and memory efficient: Tuples are faster to create than lists because they are immutable. Immutability means that tuples can be created as constants and loaded as such, using constant folding. It also means they're faster and more memory efficient to create because there's no need for overallocation, etc. They're a bit slower than lists for random item access, but faster again for unpacking (at least on my machine). If tuples were mutable, then they wouldn't be as fast for purposes such as these.
Tuples are general-purpose: Tuples need to be able to contain any kind of object. They're used to (quickly) do things like variable-length argument lists (via the * operator in function definitions). If tuples couldn't hold mutable objects, they would be useless for things like this. Python would have to use lists, which would probably slow things down, and would certainly be less memory efficient.
So you see, in order to fulfill their purpose, tuples must be immutable, but also must be able to contain mutable objects. If the designers of Python wanted to create an immutable object that guarantees that all the objects it "contains" are also immutable, they would have to create a third sequence type. The gain is not worth the extra complexity.
First of all, the word "immutable" can mean many different things to different people. I particularly like how Eric Lippert categorized immutability in his blog post [archive 2012-03-12]. There, he lists these kinds of immutability:
Realio-trulio immutability
Write-once immutability
Popsicle immutability
Shallow vs deep immutability
Immutable facades
Observational immutability
These can be combined in various ways to make even more kinds of immutability, and I'm sure more exist. The kind of immutability you seems interested in deep (also known as transitive) immutability, in which immutable objects can only contain other immutable objects.
The key point of this is that deep immutability is only one of many, many kinds of immutability. You can adopt whichever kind you prefer, as long as you are aware that your notion of "immutable" probably differs from someone else's notion of "immutable".
You cannot change the id of its items. So it will always contain the same items.
$ python
>>> t = (1, [2, 3])
>>> id(t[1])
12371368
>>> t[1].append(4)
>>> id(t[1])
12371368
I'll go out on a limb here and say that the relevant part here is that while you can change the contents of a list, or the state of an object, contained within a tuple, what you can't change is that the object or list is there. If you had something that depended on thing[3] being a list, even if empty, then I could see this being useful.
One reason is that there is no general way in Python to convert a mutable type into an immutable one (see the rejected PEP 351, and the linked discussion for why it was rejected). Thus, it would be impossible to put various types of objects in tuples if it had this restriction, including just about any user-created non-hashable object.
The only reason that dictionaries and sets have this restriction is that they require the objects to be hashable, since they are internally implemented as hash tables. But note that, ironically, dictionaries and sets themselves are not immutable (or hashable). Tuples do not use an object's hash, so its mutability does not matter.
A tuple is immutable in the sense that the tuple itself can not expand or shrink, not that all the items contained themselves are immutable. Otherwise tuples are dull.

Python: variable-length tuples

[Python 3.1]
I'm following up on the design concept that tuples should be of known length (see this comment), and unknown length tuples should be replaced with lists in most circumstances. My question is under what circumstances should I deviate from that rule?
For example, I understand that tuples are faster to create from string and numeric literals than lists (see another comment). So, if I have performance-critical code where there are numerous calculations such as sumproduct(tuple1, tuple2), should I redefine them to work on lists despite a performance hit? (sumproduct((x, y, z), (a, b, c)) is defined as x * a + y * b + z * c, and its arguments have unspecified but equal lengths).
And what about the tuple that is automatically built by Python when using def f(*x)? I assume it's not something I should coerce to list every time I use it.
Btw, is (x, y, z) faster to create than [x, y, z] (for variables rather than literals)?
In my mind, the only interesting distinction between tuples and lists is that lists are mutable and tuples are not. The other distinctions that people mention seem completely artificial to me: tuples are like structs and lists are like arrays (this is where the "tuples should be a known length" comes from). But how is struct-ness aligned with immutability? It isn't.
The only distinction that matters is the distinction the language makes: mutability. If you need to modify the object, definitely use a list. If you need to hash the object (as a key in a dict, or an element of a set), then you need it to be immutable, so use a tuple. That's it.
I always use the most the appropriate data structure for the job and do not really worry about if a tuple would save me half a millisecond here or there. Pre-obfuscating your code does not usually pay off in the end. If the code runs too slow you can always profile it later and change the .01% of code where it really matters.
All the things you are talking about are tied in to the implementation of the python version and the hardware it is running on. You can always time those things your self to see what they would be on your machine.
A common example of this is the 'old immutable strings are slow to concatenate' in python. This was true about 10 years ago, and then they changed the implementation in 2.4 or 2.5. If you do your own tests they now run faster than lists, but people are convinced of this still today and use silly constructs that actually ran slower!
under what circumstances should I deviate from that [tuples should be of known length] rule?
None.
It's a matter of meaning. If an object has meaning based on a fixed number of elements, then it's a tuple. (x,y) coordinates, (c,m,y,k) colors, (lat, lon) position, etc., etc.
A tuple has a fixed number of elements based on the problem domain in general and the specifics of the problem at hand.
Designing a tuple with an indefinite number of elements makes little sense. When do we switch from (x,y) to (x,y,z) and then to (x,y,z,w) coordinates? Not by simply concatenating a value as if it's a list? If we're moving from 2-d to 3-d coordinates there's usually some pretty fancy math to map the coordinate systems. Not appending an element to a list.
What does it mean to move from (r,g,b) colors to something else? What is the 4th color in the rgb system? For that matter, what's the fifth color in the cmyk ststem?
Tuples do not change size.
*args is a tuple because it is immutable. Yes, it has an indefinite number of arguments, but it's a rare counter-exmaple to tuples of known, defined sizes.
What to do about an indefinite length tuple. This counter-example is so profound that we have two choices.
Reject the very idea that tuples are fixed-length, and constrained by the problem,. The very idea of (x,y) coordinates and (r,g,b) colors is utterly worthless and wrong because of this counter-example. Fixed-length tuples? Never.
Always convert all *args to lists to always have a fussy level of unthinking conformance to a design principle. Covert to lists? Always.
I love all or nothing choices, since they make software engineering so simplistic and unthinking.
Perhaps, in these corner cases, there's a tiny scrap of "this requires thinking" here. A tiny scrap.
Yes, *args is a tuple. Yes, it's of indefinite length. Yes, it's a counter-example where "fixed by the problem domain" is trumped by "simply immutable".
This leads us to the third choice in the case where a sequence is immutable for a different reason. You'll never mutate it, so it's okay to be a tuple of indefinite size. In the even-more-rare case where you're popping values of *args because you're treating it like a stack or a queue, then you might want to make a list out of it. But we can't pre-solve all possible problems.
Sometimes Thinking Is Required.
When you're doing design, you design a tuple for a reason. To impose a meaningful structure on your data. Fixed-length number of elements? Tuple. Variable number of elements (i.e., mutable)? List.
In this case, you should probably consider using numpy and numpy arrays.
There is some overhead converting to and from numpy arrays, but if you are doing a bunch of calculation it will be much faster

Categories