Does `bytearray` not contradict the Zen of Python? - python

Python PEP 3137 introduced bytearray as a mutable 8-bit array type. However, a list of the immutable bytes type accomplishes the same goal, and actually has better performance, albeit perhaps a clunkier syntax. This new type contradicts the Zen of Python
There should be one-- and preferably only one --obvious way to do it.
So my question is: is there any documented major advantage or design consideration for using a bytearray over a list of bytes?
So far, I have not found a piece of motivation documented in the PEP or in the documentation pages. In fact, the documentation treats them as near-equals:
The bytearray type is a mutable sequence of integers in the range 0 <= x < 256. It has most of the usual methods of mutable sequences...
And then,
List and bytearray objects support additional operations that allow in-place modification of the object. Other mutable sequence types (when added to the language) should also support these operations.
As bytearrays are statically typed (as 8-bit unsigned integers) one might expect a performance increase, but as mentioned above the inverse is probably true. Also, there should be no memory advantage for a bytearray over a list of bytes. I could imagine that there was a need to a itertools.chain-style mutable type, but this is not mentioned anywhere and does not seem to be the design goal.

First, a list-style container for sequences of objects that can seamlessly iterate through them (something that itertools.chain provides, actually) definitely seems convenient. As PEP 3137 mentions, the array module might have served for this purpose, but was "far from ideal".
The authors wished for "both a mutable and an immutable bytes type", which might mean that the design goal was having the same interface for both a mutable (perhaps using non-contiguous memory for fast insertions and deletions), as well as an immutable (definitely contiguous memory) implementation. This created, as far as I can appreciate, a jumble of sequence types, including array.array, bytes, bytearray, str, unicode, list, tuple and generator. In essence, they provide interfaces to statically typed or ducked type, mutable or unmutable sequences which are stored in memory or evaluated on-the-fly (generators). Most of these do follow the rationale of the Abstract Base Classes, but I do think that there is more design to be done before this can be deemed Pythonic.
Note: I'm leaving this answer open for editing hoping that people might contribute insights or corrections.

Related

Does type(a_subclass) return a singleton in Python?

I'm new to Python, and I need to compare two responses to type(a_subclass) for equality (i.e., I need to be sure the two subclasses of a parent class are exactly the same.) Can I safely assume that type returns singletons and use the is operator? Or is == necessary? Is there something else I should be worried about, like comparing package and module names?
Note the similar question here, to which the answers are confusing at best. I'm kinda looking for a 'yes' or a 'no.'
I wouldn’t use the word “singleton” for type objects (in this question or the near-duplicate): a singleton is the only object of its type, and type(1) and type(None) are two objects of the same type (namely, type). The proper way to phrase this question would seem to be “Can two type objects represent the same type?”.
If so, it might be reasonable for them to compare ==, as suggested in the question. (Conversely, one could imagine different types comparing == based on something like their sizes or names, but we can ignore such useless possibilities.) However, types are mutable in Python, the simplest example being that you can assign static variables as Class.var=2. Plainly it must be possible to read that value with type(Class()).var, which implies that they are the same object. While it would be possible to implement such syntax through a proxy object, it is correct to assume that no such needless complexity is introduced (especially as it would involve additional memory allocation and reference counting overhead).
Seen from this perspective, the answer is almost a tautology: Python types are defined by the identity of their representative objects, so of course the same object must be used everywhere.

Why do we need str type? Why not just byte-strings?

Python3 has unicode strings (str) and bytes. We already have bytestring literals and methods. Why do we need two different types, instead of just byte strings of various encodings?
The answer to your question depends on the meaning of the word "need."
We certainly don't need the str type in the sense that everything we can compute with the type we can also compute without it (as you know quite well from your well-worded question).
But we can also understand "need" from the point of view of convenience. Isn't it nice to have a sqrt function? Or log or exp or sin? You could write these yourself, but why bother? A standard library designer will add functions that are useful and convenient.
It is the same for the language itself. Do we "need" a while loop? Not really, we can use tail-recursive functions. Do we "need" list comprehensions? Tons of things in Python are not primitive. For that matter do we "need" high level languages. John von Neumann himself once asked "why would you want more than machine language?"
It is the same with str and bytes. The type str, while not necessary, is a nice, time-saving, convenient thing to have. It gives us an interface as a sequence of characters, so that we can manipulate text character-by-character without:
us having to write all the encoding and decoding logic ourselves, or
bloating the string interface with multiple sets of iterators, like each_byte and each_char.
As you suspect, we could have one type which exposes the byte sequence and the character sequence (as Ruby's String class does). The Python designers wanted to separate those usages into two separate types. You can convert an object of one type into the other very easily. By having two types, they are saying that separation of concerns (and usages) is more important than having fewer built-in types. Ruby makes a different choice.
TL;DR It's a matter of preference in language design: separation of concerns by distinct type rather than by different methods on the same type.
Because bytes should not be considered strings, and strings should not be considered bytes. Python3 gets this right, no matter how jarring this feels to the brand new developer.
In Python 2.6, if I read data from a file, and I passed the "r" flag, the text would be read in the current locale by default, which would be a string, while passing the "rb" flag would create a series of bytes. Indexing the data is entirely different, and methods that take a str may be unsure of whether I am using bytes or a str. This gets worse since for ASCII data the two are often synonymous, meaning that code which works in simple test cases or English locales will fail upon encountering non-ASCII characters.
There was therefore a conscious effort to ensure bytes and strings were not identical: that one was a sequence of "dumb bytes", and the other was a Unicode string with the optimal encoding for the data to preserve O(1) indexing (ASCII, UCS-2, or UTF-32, depending on the data used, I believe).
In Python 2, the Unicode string was used to disambiguate text from "dumb bytes", however, str was treated as text by many users.
Or, to quote the Benevolent Dictator:
Python's current string objects are overloaded. They serve to hold both sequences of characters and sequences of bytes. This overloading of purpose leads to confusion and bugs. In future versions of Python, string objects will be used for holding character data. The bytes object will fulfil the role of a byte container. Eventually the unicode type will be renamed to str and the old str type will be removed.
tl;dr version
Forcing the separation of bytes and str forces coders to be conscious of their difference, to short-term dissatisfaction, but better code long-term. It's a conscious choice after years of experience: that forcing you to be conscious of the difference immediately will save you days in a debugger later.
Byte strings with different encodings are incompatible with each other, but until Python 3 there was nothing in the language to remind you of this fact. It turns out that mixing different character encodings is a surprisingly common problem in today's world, leading to far too many bugs.
Also it's often just easier to work with whole characters, without having to worry that you just modified a byte that accidentally rendered your 4-byte character into an invalid sequence.
There are at least two reasons:
the str type has an important property "one element = one character".
the str type does not depend on encoding.
Just imagine how would you implement a simple operation like reversing a string (rword = word[::-1]) if word were a bytestring with some encoding.

Use cases for bytearray

Can you point out a scenario in which Python's bytearray is useful? Is it simply a non-unicode string that supports list methods which can be easily achieved anyway with str objects?
I understand some think of it as "the mutable string". However, when would I need such a thing? And: unlike strings or lists, I am strictly limited to ascii, so in any case I'd prefer the others, is that not true?
Bach,
bytearray is a mutable block of memory. To that end, you can push arbitrary bytes into it and manage your memory allocations carefully. While many Python programmers never worry about memory use, there are those of us that do. I manage buffer allocations in high load network operations. There are numerous applications of a mutable array -- too many to count.
It isn't correct to think of the bytearray as a mutable string. It isn't. It is an array of bytes. Strings have a specific structure. bytearrays do not. You can interpret the bytes however you wish. A string is one way. Be careful though, there are many corner cases when using Unicode strings. ASCII strings are comparatively trivial. Modern code runs across borders. Hence, Python's rather complete Unicode-based string classes.
A bytearray reveals memory allocated by the buffer protocol. This very rich protocol is the foundation of Python's interoperation with C and enables numpy and other direct access memory technologies. In particular, it allows memory to be easily structured into multi-dimensional arrays in either C or FORTRAN order.
You may never have to use a bytearray.
Anon, Andrew
deque seems to need much more space than bytearray.
>>> sys.getsizeof(collections.deque(range(256)))
1336
>>> sys.getsizeof(bytearray(range(256)))
293
I Guess this is because of the layout.
If you need samples using bytearray I suggest searching the online code with nullege.
bytearray also has one more advantage: you do not need to import anything for that. This means that people will use it - whether it makes sense or not.
Further reading about bytearray: the bytes type in python 2.7 and PEP-358

Why can tuples contain mutable items?

If a tuple is immutable then why can it contain mutable items?
It is seemingly a contradiction that when a mutable item such as a list does get modified, the tuple it belongs to maintains being immutable.
That's an excellent question.
The key insight is that tuples have no way of knowing whether the objects inside them are mutable. The only thing that makes an object mutable is to have a method that alters its data. In general, there is no way to detect this.
Another insight is that Python's containers don't actually contain anything. Instead, they keep references to other objects. Likewise, Python's variables aren't like variables in compiled languages; instead the variable names are just keys in a namespace dictionary where they are associated with a corresponding object. Ned Batchhelder explains this nicely in his blog post. Either way, objects only know their reference count; they don't know what those references are (variables, containers, or the Python internals).
Together, these two insights explain your mystery (why an immutable tuple "containing" a list seems to change when the underlying list changes). In fact, the tuple did not change (it still has the same references to other objects that it did before). The tuple could not change (because it did not have mutating methods). When the list changed, the tuple didn't get notified of the change (the list doesn't know whether it is referred to by a variable, a tuple, or another list).
While we're on the topic, here are a few other thoughts to help complete your mental model of what tuples are, how they work, and their intended use:
Tuples are characterized less by their immutability and more by their intended purpose.
Tuples are Python's way of collecting heterogeneous pieces of information under one roof. For example,
s = ('www.python.org', 80)
brings together a string and a number so that the host/port pair can be passed around as a socket, a composite object. Viewed in that light, it is perfectly reasonable to have mutable components.
Immutability goes hand-in-hand with another property, hashability. But hashability isn't an absolute property. If one of the tuple's components isn't hashable, then the overall tuple isn't hashable either. For example, t = ('red', [10, 20, 30]) isn't hashable.
The last example shows a 2-tuple that contains a string and a list. The tuple itself isn't mutable (i.e. it doesn't have any methods that for changing its contents). Likewise, the string is immutable because strings don't have any mutating methods. The list object does have mutating methods, so it can be changed. This shows that mutability is a property of an object type -- some objects have mutating methods and some don't. This doesn't change just because the objects are nested.
Remember two things. First, immutability is not magic -- it is merely the absence of mutating methods. Second, objects don't know what variables or containers refer to them -- they only know the reference count.
Hope, this was useful to you :-)
That's because tuples don't contain lists, strings or numbers. They contain references to other objects.1 The inability to change the sequence of references a tuple contains doesn't mean that you can't mutate the objects associated with those references.2
1. Objects, values and types (see: second to last paragraph)
2. The standard type hierarchy (see: "Immutable sequences")
As I understand it, this question needs to be rephrased as a question about design decisions: Why did the designers of Python choose to create an immutable sequence type that can contain mutable objects?
To answer this question, we have to think about the purpose tuples serve: they serve as fast, general-purpose sequences. With that in mind, it becomes quite obvious why tuples are immutable but can contain mutable objects. To wit:
Tuples are fast and memory efficient: Tuples are faster to create than lists because they are immutable. Immutability means that tuples can be created as constants and loaded as such, using constant folding. It also means they're faster and more memory efficient to create because there's no need for overallocation, etc. They're a bit slower than lists for random item access, but faster again for unpacking (at least on my machine). If tuples were mutable, then they wouldn't be as fast for purposes such as these.
Tuples are general-purpose: Tuples need to be able to contain any kind of object. They're used to (quickly) do things like variable-length argument lists (via the * operator in function definitions). If tuples couldn't hold mutable objects, they would be useless for things like this. Python would have to use lists, which would probably slow things down, and would certainly be less memory efficient.
So you see, in order to fulfill their purpose, tuples must be immutable, but also must be able to contain mutable objects. If the designers of Python wanted to create an immutable object that guarantees that all the objects it "contains" are also immutable, they would have to create a third sequence type. The gain is not worth the extra complexity.
First of all, the word "immutable" can mean many different things to different people. I particularly like how Eric Lippert categorized immutability in his blog post [archive 2012-03-12]. There, he lists these kinds of immutability:
Realio-trulio immutability
Write-once immutability
Popsicle immutability
Shallow vs deep immutability
Immutable facades
Observational immutability
These can be combined in various ways to make even more kinds of immutability, and I'm sure more exist. The kind of immutability you seems interested in deep (also known as transitive) immutability, in which immutable objects can only contain other immutable objects.
The key point of this is that deep immutability is only one of many, many kinds of immutability. You can adopt whichever kind you prefer, as long as you are aware that your notion of "immutable" probably differs from someone else's notion of "immutable".
You cannot change the id of its items. So it will always contain the same items.
$ python
>>> t = (1, [2, 3])
>>> id(t[1])
12371368
>>> t[1].append(4)
>>> id(t[1])
12371368
I'll go out on a limb here and say that the relevant part here is that while you can change the contents of a list, or the state of an object, contained within a tuple, what you can't change is that the object or list is there. If you had something that depended on thing[3] being a list, even if empty, then I could see this being useful.
One reason is that there is no general way in Python to convert a mutable type into an immutable one (see the rejected PEP 351, and the linked discussion for why it was rejected). Thus, it would be impossible to put various types of objects in tuples if it had this restriction, including just about any user-created non-hashable object.
The only reason that dictionaries and sets have this restriction is that they require the objects to be hashable, since they are internally implemented as hash tables. But note that, ironically, dictionaries and sets themselves are not immutable (or hashable). Tuples do not use an object's hash, so its mutability does not matter.
A tuple is immutable in the sense that the tuple itself can not expand or shrink, not that all the items contained themselves are immutable. Otherwise tuples are dull.

What is the theory behind mutable and immutable types?

One of the things that I admire about Python is its distinction between mutable and immutable types. Having spent a while programming in c before coming to Python, I was astonished at how easily Python does away with all the complexities of pointer dereferencing that drive me mad in c. In Python everything just works the way I expect, and I quickly realized that the mutable/immutable distinction plays an important part in that.
There are still a few wrinkles, of course (mutable function argument defaults being a notable example) but overall, I feel that the mutable/immutable distinction greatly clarifies the question of what variables and their values are and how they ought to behave.
But where does it come from? I have to assume that GvR was not the first person to conceive of this distinction, and that Python was not the first language to use it. I'm interested in hearing about earlier languages that used this concept, as well as any early theoretical discussions of it.
If you like the idea of immutability, you should check out a pure functional language. In Haskell all (pure) variables are immutable (yet still referred to as "variables", but there you go). It is a great idea - you and the compiler both know that passing something into a function cannot change it in any way.
In C there is no explicit concept of immutable because the language is based on copy semantic. In Python instead values are always passed by reference and immutability plays an important role to keep the language manageable.
In Python you don't have pointers because everything is indeed a pointer! Imagine what it could mean for a Python program that even number objects could change value over time... you would be forced to play tricks like
class Shape:
def __init__(self, points):
self.points = points[:] # make a copy of the list
not only for lists and dicts, but also for numeric values and strings.
So basically in Python some types have immutable instances because they normally are used as "values" and you don't care about identity. In the rare cases in which you need for example a mutable object with a numeric value you need to explicitly wrap it up in say a class instance.
In other languages substantially based on reference semantic like LISP immutability is a choice left to the programmer and not a constraint even if many functions and idioms are supporting it (a big percentage of LISP standard library functions are non-destructive and it's indeed sort of a shame that destructive ones aren't always clearly distinguishable by the name).
So for example strings are mutable LISP but probably not many LISP programs actually modify strings in place because that would mean giving up the nice possibility of sharing and would require explicitly copying strings in many places (in most cases strings are just values, not objects in which you care about identity). Are strings immutable in LISP? No. Are programs mutating them? Almost never.
Not leaving a choice to the programmer is in the spirit of the Python language. It feels great when the choices you are forced to are in line with your idea... less decisions and things are exactly like you wanted them to be.
There are however in my opinion two different dangers with this approach:
Problems arise when those pre-made choices are NOT in line on what you would like or need to do. Python is a wonderful language, but with a different dictator could become pure hell.
Understanding and making choiches forces you to think and expands your mind. Not making choices instead just puts a cage on your mind and after a while you may not even realize that what you are using is just ONE possibility, not the only possibility. After a while you may begin to just think that "things MUST be done this way": you don't even feel you are forced because your mind has already been "mutilated".
Objective C is loaded with mutable/immutable distinctions (to the point where there are both NSString and NSMutableString, for example); it predates Python by about 8 years. Smalltalk, from which Objective C inherited much of its OO design, uses the concept to a lesser extent (notably, strings are not immutable; the trend these days is towards immutable strings as in Python, Ruby, etc.).
I'd recommend reading the following entries from Wikipedia:
Immutable object
Functional Programming

Categories