What is Python's sequence protocol?

What is Python's sequence protocol? - python

Python does a lot with magic methods and most of these are part of some protocol. I am familiar with the "iterator protocol" and the "number protocol" but recently stumbled over the term "sequence protocol". But even after some research I'm not exactly sure what the "sequence protocol" is.
For example the C API function PySequence_Check checks (according to the documentation) if some object implements the "sequence protocol". The source code indicates that this is a class that's not a dict but implements a __getitem__ method which is roughly identical to what the documentation on iter also states:
[...]must support the sequence protocol (the __getitem__() method with integer arguments starting at 0).[...]
But the requirement to start with 0 isn't something that's "implemented" in PySequence_Check.
Then there is also the collections.abc.Sequence type, which basically says the instance has to implement __reversed__, __contains__, __iter__ and __len__.
But by that definition a class implementing the "sequence protocol" isn't necessarily a Sequence, for example the "data model" and the abstract class guarantee that a sequence has a length. But a class just implementing __getitem__ (passing the PySequence_Check) throws an exception when using len(an_instance_of_that_class).
Could someone please clarify for me the difference between a sequence and the sequence protocol (if there's a definition for the protocol besides reading the source code) and when to use which definition?

It's not really consistent.
Here's PySequence_Check:
int
PySequence_Check(PyObject *s)
{
if (PyDict_Check(s))
return 0;
return s != NULL && s->ob_type->tp_as_sequence &&
s->ob_type->tp_as_sequence->sq_item != NULL;
}
PySequence_Check checks if an object provides the C sequence protocol, implemented through a tp_as_sequence member in the PyTypeObject representing the object's type. This tp_as_sequence member is a pointer to a struct containing a bunch of functions for sequence behavior, such as sq_item for item retrieval by numeric index and sq_ass_item for item assignment.
Specifically, PySequence_Check requires that its argument is not a dict, and that it provides sq_item.
Types with a __getitem__ written in Python will provide sq_item regardless of whether they're conceptually sequences or mappings, so a mapping written in Python that doesn't inherit from dict will pass PySequence_Check.
On the other hand, collections.abc.Sequence only checks whether an object concretely inherits from collections.abc.Sequence or whether its class (or a superclass) is explicitly registered with collections.abc.Sequence. If you just implement a sequence yourself without doing either of those things, it won't pass isinstance(your_sequence, Sequence). Also, most classes registered with collections.abc.Sequence don't support all of collections.abc.Sequence's methods. Overall, collections.abc.Sequence is a lot less reliable than people commonly expect it to be.
As for what counts as a sequence in practice, it's usually anything that supports __len__ and __getitem__ with integer indexes starting at 0 and isn't a mapping. If the docs for a function say it takes any sequence, that's almost always all it needs. Unfortunately, "isn't a mapping" is hard to test for, for reasons similar to how "is a sequence" is hard to pin down.

For a type to be in accordance with the sequence protocol, these 4 conditions must be met:
Retrieve elements by index
item = seq[index]
Find items by value
index = seq.index(item)
Count items
num = seq.count(item)
Produce a reversed sequence
r = reversed(seq)

Related

Difference between collections.abc.Sequence and typing.Sequence [duplicate]

This question already has an answer here:
collections.Iterable vs typing.Iterable in type annotation and checking for Iterable
(1 answer)
Closed 21 days ago.
This post was edited and submitted for review 21 days ago.
I was reading an article and about collection.abc and typing class in the python standard library and discover both classes have the same features.
I tried both options using the code below and got the same results
from collections.abc import Sequence
def average(sequence: Sequence):
return sum(sequence) / len(sequence)
print(average([1, 2, 3, 4, 5])) # result is 3.0
from typing import Sequence
def average(sequence: Sequence):
return sum(sequence) / len(sequence)
print(average([1, 2, 3, 4, 5])) # result is 3.0
Under what condition will collection.abc become a better option to typing. Are there benefits of using one over the other?

Good on you for using type annotations! As the documentations says, if you are on Python 3.9+, you should most likely never use typing.Sequence due to its deprecation. Since the introduction of generic alias types in 3.9 the collections.abc classes all support subscripting and should be recognized correctly by static type checkers of all flavors.
So the benefit of using collections.abc.T over typing.T is mainly that the latter is deprecated and should not be used.
As mentioned by jsbueno in his answer, annotations will never have runtime implications either way, unless of course they are explicitly picked up by a piece of code. They are just an essential part of good coding style. But your function would still work, i.e. your script would execute without error, even if you annotated your function with something absurd like def average(sequence: 4%3): ....
Proper annotations are still extremely valuable. Thus, I would recommend you get used to some of the best practices as soon as possible. (A more-or-less strict static type checker like mypy is very helpful for that.) For one thing, when you are using generic types like Sequence, you should always provide the appropriate type arguments. Those may be type variables, if your function is also generic or they may be concrete types, but you should always include them.
In your case, assuming you expect the contents of your sequence to be something that can be added with the same type and divided by an integer, you might want to e.g. annotate it as Sequence[float]. (In the Python type system, float is considered a supertype of int, even though there is no nominal inheritance.)
Another recommendation is to try and be as broad as possible in the parameter types. (This echoes the Python paradigm of dynamic typing.) The idea is that you just specify that the object you expect must be able to "quack", but you don't say it must be a duck.
In your example, since you are reliant on the argument being compatible with sum as well as with len, you should consider what types those functions expect. The len function is simple, since it basically just calls the __len__ method of the object you pass to it. The sum function is more nuanced, but in your case the relevant part is that it expects an iterable of elements that can be added (e.g. float).
If you take a look at the collections ABCs, you'll notice that Sequence actually offers much more than you need, being that it is a reversible collection. A Collection is the broadest built-in type that fulfills your requirements because it has __iter__ (from Iterable) and __len__ (from Sized). So you could do this instead:
from collections.abc import Collection
def average(numbers: Collection[float]) -> float:
return sum(numbers) / len(numbers)
(By the way, the parameter name should not reflect its type.)
Lastly, if you wanted to go all out and be as broad as possible, you could define your own protocol that is even broader than Collection (by getting rid of the Container inheritance):
from collections.abc import Iterable, Sized
from typing import Protocol, TypeVar
T = TypeVar("T", covariant=True)
class SizedIterable(Sized, Iterable[T], Protocol[T]):
...
def average(numbers: SizedIterable[float]) -> float:
return sum(numbers) / len(numbers)
This has the advantage of supporting very broad structural subtyping, but is most likely overkill.
(For the basics of Python typing, PEP 483 and PEP 484 are a must-read.)

Actually, in your code you need neither of those:
Typing with annotations, which is what you are doing with your imported Sequences class is an optional feature, meant for (1) quick documentation; (2) checking of the code before it is run by static code analysers such as Mypy.
The fact is that some IDEs use the result of static checking by default in their recomented configurations, and they can make it look like code without annotations is "faulty": it is not - this is an optional feature.
As long as the object you pass into your function respect some of the Sequence interface it will need, it will work (it needs __len__ and __getitem__ as is)
Just run your code without annotations and see it work:
def average(myvariable):
return sum(myvariable) / len(myvariable)
That said, here is what is happening: list is "the sequence" by excellence in Python, and implements everything a sequence needs.
typing.Sequence is just an indicator for the static-checker tools that the data marked with it should respect the Sequence protocol, and does nothing at run time. You can't instantiate it. You can inherit from it (probably) but just to specialize other markers for typing, not for anything that will have any effect during actual program execution.
On the other hand collections.abc.Sequence predates the optional typing recomendations in PEP 484: it works as both a "virtual super class" which can indicate everything that works as a sequence in runtime (through the use of isinstance) (*). AND it can be used as a solid base class to implement fully functional cusotm Sequence classes of your own: just inherit from collections.abc.Sequence and implement functional __getitem__ and __len__ methods as indicated in the docs here: https://docs.python.org/3/library/collections.abc.html (that is for read only sequences - for mutable sequences, check collections.abc.MutableSequence, of course).
(*) for your custom sequence implementation to be recognized as a Sequence proper it has to be "registered" in runtime with a call to collections.abc.Sequence.register. However, AFAIK, most tools for static type checking do not recognize this, and will error in their static analysis)

Can some operators in Python not be overloaded properly?

I am studying Scott Meyers' More Effective C++. Item 7 advises to never overload && and ||, because their short-circuit behavior cannot be replicated when the operators are turned into function calls (or is this no longer the case?).
As operators can also be overloaded in Python, I am curious whether this situation exists there as well. Is there any operator in Python (2.x, 3.x) that, when overridden, cannot be given its original meaning?
Here is an example of 'original meaning'
class MyInt {
public:
MyInt operator+(MyInt &m) {
return MyInt(this.val + m.val);
};
int val;
MyInt(int v) : val(v){}
}

Exactly the same rationale applies to Python. You shouldn't (and can't) overload and and or, because their short-circuiting behavior cannot be expressed in terms of functions. not isn't permitted either - I guess this is because there's no guarantee that it will be invoked at all.
As pointed out in the comments, the proposal to allow the overloading of logical and and or was officially rejected.

The assignment operator can also not be overloaded.
class Thing: ...
thing = Thing()
thing = 'something else'
There is nothing you can override in Thing to change the behavior of the = operator.
(You can overload property assignment though.)

In Python, all object methods that represent operators are treated "equal": their precedences are described in the language model, and there is no conflict with overriding any.
But both C++ "&&" and "||" - in Python "and" and "or" - are not available in Python as object methods to start with - they check for the object truthfulness, though - which is defined by __bool__. If __bool__is not implemented, Python check for a __len__ method, and check if its output is zero, in which case the object's truth value is False. In all other cases its truth value is True. That makes it for any semantic problems that would arise from combining overriding with the short-circuiting behavior.
Note one can override & and | by implementing __and__ and __or__ with no problems.
As for the other operators, although not directly related, one should just take care with __getattribute__ - the method called when retrieving any attribute from an object (we normally don't mention it as an operator) - including calls from within itself. The __getattr__ is also in place, and is just invoked at the end of the attribute search chain, when an attribute is not found.

How to write getitem cleanly?

In Python, when implementing a sequence type, I often (relatively speaking) find myself writing code like this:
class FooSequence(collections.abc.Sequence):
# Snip other methods
def __getitem__(self, key):
if isinstance(key, int):
# Get a single item
elif isinstance(key, slice):
# Get a whole slice
else:
raise TypeError('Index must be int, not {}'.format(type(key).__name__))
The code checks the type of its argument explicitly with isinstance(). This is regarded as an antipattern within the Python community. How do I avoid it?
I cannot use functools.singledispatch, because that's quite deliberately incompatible with methods (it will attempt to dispatch on self, which is entirely useless since we're already dispatching on self via OOP polymorphism). It works with #staticmethod, but what if I need to get stuff out of self?
Casting to int() and then catching the TypeError, checking for a slice, and possibly re-raising is still ugly, though perhaps slightly less so.
It might be cleaner to convert integers into one-element slices and handle both situations with the same code, but that has its own problems (return 0 or [0]?).

As much as it seems odd, I suspect that the way you have it is the best way to go about things. Patterns generally exist to encompass common use cases, but that doesn't mean that they should be taken as gospel when following them makes life more difficult. The main reason that PEP 443 gives for balking at explicit typechecking is that it is "brittle and closed to extension". However, that mainly applies to custom functions that take a number of different types at any time. From the Python docs on __getitem__:
For sequence types, the accepted keys should be integers and slice objects. Note that the special interpretation of negative indexes (if the class wishes to emulate a sequence type) is up to the __getitem__() method. If key is of an inappropriate type, TypeError may be raised; if of a value outside the set of indexes for the sequence (after any special interpretation of negative values), IndexError should be raised. For mapping types, if key is missing (not in the container), KeyError should be raised.
The Python documentation explicitly states the two types that should be accepted, and what to do if an item that is not of those two types is provided. Given that the types are provided by the documentation itself, it's unlikely to change (doing so would break far more implementations than just yours), so it's likely not worth the trouble to go out of your way to code against Python itself potentially changing.
If you're set on avoiding explicit typechecking, I would point you toward this SO answer. It contains a concise implementation of a #methdispatch decorator (not my name, but i'll roll with it) that lets #singledispatch work with methods by forcing it to check args[1] (arg) rather than args[0] (self). Using that should allow you to use custom single dispatch with your __getitem__ method.
Whether or not you consider either of these "pythonic" is up to you, but remember that while The Zen of Python notes that "Special cases aren't special enough to break the rules", it then immediately notes that "practicality beats purity". In this case, just checking for the two types that the documentation explicitly states are the only things __getitem__ should support seems like the practical way to me.

The antipattern is for code to do explicit type checking, which means using the type() function. Why? Because then a subclass of the target type will no longer work. For instance, __getitem__ can use an int, but using type() to check for it means an int-subclass, which would work, will fail only because type() does not return int.
When a type-check is necessary, isinstance is the appropriate way to do it as it does not exclude subclasses.
When writing __dunder__ methods, type checking is necessary and expected -- using isinstance().
In other words, your code is perfectly Pythonic, and its only problem is the error message (it doesn't mention slices).

I'm not aware of a way to avoid doing it once. That's just the tradeoff of using a dynamically-typed language in this way. However, that doesn't mean you have to do it over and over again. I would solve it once by creating an abstract class with split out method names, then inherit from that class instead of directly from Sequence, like:
class UnannoyingSequence(collections.abc.Sequence):
def __getitem__(self, key):
if isinstance(key, int):
return self.getitem(key)
elif isinstance(key, slice):
return self.getslice(key)
else:
raise TypeError('Index must be int, not {}'.format(type(key).__name__))
# default implementation in terms of getitem
def getslice(self, key):
# Get a whole slice
class FooSequence(UnannoyingSequence):
def getitem(self, key):
# Get a single item
# optional efficient, type-specific implementation not in terms of getitem
def getslice(self, key):
# Get a whole slice
This cleans up FooSequence enough that I might even do it this way if I only had the one derived class. I'm sort of surprised the standard library doesn't already work that way.

To stay pythonic, you have work with the semantics rather than the type of the objects. So if you have some parameter as accessor to a sequence, just use it like that. Use the abstraction for a parameter as long as possible. If you expect a set of user identifiers, do not expect a set, but rather some data structure with a method add. If you expect some text, do not expect a unicode object, but rather some container for characters featuring encode and decode methods.
I assume in general you want to do something like "Use the behavior of the base implementation unless some special value is provided. If you want to implement __getitem__, you can use a case distinction where something different happens if one special value is provided. I'd use the following pattern:
class FooSequence(collections.abc.Sequence):
# Snip other methods
def __getitem__(self, key):
try:
if key == SPECIAL_VALUE:
return SOMETHING_SPECIAL
else:
return self.our_baseclass_instance[key]
except AttributeError:
raise TypeError('Wrong type: {}'.format(type(key).__name__))
If you want to distinguish between a single value (in perl terminology "scalar") and a sequence (in Java terminology "collection"), then it is pythonically fine to determine whether an iterator is implemented. You can either use a try-catch pattern or hasattr as I do now:
>>> a = 42
>>> b = [1, 3, 5, 7]
>>> c = slice(1, 42)
>>> hasattr(a, "__iter__")
False
>>> hasattr(b, "__iter__")
True
>>> hasattr(c, "__iter__")
False
>>>
Applied to our example:
class FooSequence(collections.abc.Sequence):
# Snip other methods
def __getitem__(self, key):
try:
if hasattr(key, "__iter__"):
return map(lambda x: WHATEVER(x), key)
else:
return self.our_baseclass_instance[key]
except AttributeError:
raise TypeError('Wrong type: {}'.format(type(key).__name__))
Dynamic programming languages like python and ruby use duck typing. And a duck is an animal, that walks like a duck, swims like a duck and quacks like a duck. Not because somebody calls it a "duck".

Python C-API: Using `PySequence_Length` with dictionaries

I'm trying to use PySequence_Length to get the length of a Python dictionary in C. I realize I can use PyDict_Size, but I'm interested in using a more generic function in certain contexts.
PyObject* d = PyDict_New();
Py_ssize_t res = PySequence_Length(d);
printf("Result : %ld\n", res);
if (res == -1) PyErr_Print();
This fails, and prints the error:
TypeError: object of type 'dict' has no len()
My question is: why does this fail? Although Python dictionary objects don't support the Sequence protocol, the documentation for PySequence_Length says:
Py_ssize_t PySequence_Length(PyObject *o)
Returns the number of objects in sequence o on success, and -1 on
failure. For objects that do not provide sequence protocol, this is
equivalent to the Python expression len(o).
Since a dictionary type does have a __len__ attribute, and since the expression len(d) (where d is a dictionary) properly returns the length in Python, I don't understand why PySequence_Length should fail in C.
Any explanation? Is the documentation incorrect here?

The documentation is misleading, yes. A dict is not a sequence, even though it does implement some parts of the sequence protocol (for containment tests, which are part of the sequence protocol.) This particular distinction in the Python/C types API is unfortunate, but it's an artifact of a design choice made decades ago. The documentation reflects that distinction, albeit in an equally awkward way. What it tries to say is that for Python classes it's the same thing as len(o), regardless of what the Python class pretends to be. For C types, if the type does not implement the sequence version of the sizefunc, PySequence_Length() will raise an exception without even considering whether the type has the mapping version of the sizefunc.
If you are not entirely sure whether you have a sequence or not, you should use PyObject_Size() instead. In fact, there's very little reason to call PySequence_Length(); normally you either know the type (because you just created it, and you can call a type-specific length function or macro like PyList_GET_SIZE()) or you don't even know if it'll be a sequence.

What is the difference between the int and index methods in Python 3?

The Data Model section of the Python 3.2 documentation provides the following descriptions for the __int__ and __index__ methods:
object.__int__(self)
Called to implement the built-in [function int()]. Should return [an integer].
object.__index__(self)
Called to implement operator.index(). Also called whenever Python needs an integer object (such as in slicing, or in the built-in bin(), hex() and oct() functions). Must return an integer.
I understand that they're used for different purposes, but I've been unable to figure out why two different methods are necessary. What is the difference between these methods? Is it safe to just alias __index__ = __int__ in my classes?

See PEP 357: Allowing Any Object to be Used for Slicing.
The nb_int method is used for coercion and so means something
fundamentally different than what is requested here. This PEP
proposes a method for something that can already be thought of as
an integer communicate that information to Python when it needs an
integer. The biggest example of why using nb_int would be a bad
thing is that float objects already define the nb_int method, but
float objects should not be used as indexes in a sequence.
Edit: It seems that it was implemented in Python 2.5.

I believe you'll find the answer in PEP 357, which has this abstract:
This PEP proposes adding an nb_index
slot in PyNumberMethods and an
__index__ special method so that arbitrary objects can be used
whenever integers are explicitly needed in Python, such as in slice
syntax (from which the slot gets its name).

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.