Python - "in" statement search slow for list of objects - python

I'm hoping someone can explain why searching a list of object references is so much slower than searching a normal list. This is using the python "in" keyword to search which I thought runs at "C compiler" speed. I thought a list is just an array of object references (pointers) so the search should be extremely fast. Both lists are exactly 412236 bytes in memory.
Normal list (takes 0.000 seconds to search):
alist = ['a' for x in range(100000)]
if 'b' in alist:
print("Found")
List of object references (takes 0.469 !! seconds to search):
class Spam:
pass
spamlist = [Spam() for x in range(100000)]
if Spam() in spamlist:
print("Found")
Edit: So apparently this has something to do with old-style classes having way more overhead than new style classes. My script that was bogging down with only 400 objects can now easily handle up to 10000 objects simply by making all my classes inherit from the "object" class. Just when I thought I knew Python!.
I've read about new-style vs old-style before but it was never mentioned that old-style classes can be up to 100x slower than new style ones. What is the best way to search a list of object instances for a particular instance?
1. Keep using the "in" statement but make sure all classes are new style.
2. Perform some other type of search using the "is" statement like:
[obj for obj in spamlist if obj is target]
3. Some other more Pythonic way?

This is mostly due to the different special method lookup mechanics of old-style classes.
>>> timeit.timeit("Spam() in l", """
... # Old-style
... class Spam: pass
... l = [Spam() for i in xrange(100000)]""", number=10)
3.0454677856675403
>>> timeit.timeit("Spam() in l", """
... # New-style
... class Spam(object): pass
... l = [Spam() for i in xrange(100000)]""", number=10)
0.05137817007346257
>>> timeit.timeit("'a' in l", 'l = ["b" for i in xrange(100000)]', number=10)
0.03013876870841159
As you can see, the version where Spam inherits from object runs much faster, almost as fast as the case with strings.
The in operator for lists uses == to compare items for equality. == is defined to try the objects' __eq__ methods, their __cmp__ methods, and pointer comparison, in that order.
For old-style classes, this is implemented in a straightforward but slow manner. Python has to actually look for the __eq__ and __cmp__ methods in each instance's dict and the dicts of each instance's class and superclasses. __coerce__ gets looked up too, as part of the 3-way compare process. When none of these methods actually exist, that's something like 12 dict lookups just to get to the pointer comparison. There's a bunch of other overhead besides the dict lookups, and I'm not actually sure which aspects of the process are the most time-consuming, but suffice it to say that the procedure is more expensive than it could be.
For built-in types and new-style classes, things are better. First, Python doesn't look for special methods on the instance's dict. This saves some dict lookups and enables the next part. Second, type objects have C-level function pointers corresponding to the Python-level special methods. When a special method is implemented in C or doesn't exist, the corresponding function pointer allows Python to skip the method lookup procedure entirely. This means that in the new-style case, Python can quickly detect that it should skip straight to the pointer comparison.
As for what you should do, I'd recommend using in and new-style classes. If you find that this operation is becoming a bottleneck, but you need old-style classes for backward compatibility, any(x is y for y in l) runs about 20 times faster than x in l:
>>> timeit.timeit('x in l', '''
... class Foo: pass
... x = Foo(); l = [Foo()] * 100000''', number=10)
2.8618816054721936
>>> timeit.timeit('any(x is y for y in l)', '''
... class Foo: pass
... x = Foo(); l = [Foo()] * 100000''', number=10)
0.12331640524583776

This is not the right answer for your question but this will a very good knowledge for who wants to understand how 'in' keywords works under the hood :
ceval sourcecode : ceval.c source code
abstract.c sourcecode : abstract.c source code
From the mail : mail about 'in' keywords
Expalantion from the mail thread:
I'm curious enough about this (OK, I admit it, I like to be right, too
;) to dig in to the details, if anyone is interested...one of the
benefits of Python being open-source is you can find out how it works...
First step, look at the bytecodes:
>>> import dis
>>> def f(x, y):
... return x in y
...
>>> dis.dis(f)
2 0 LOAD_FAST 0 (x)
3 LOAD_FAST 1 (y)
6 COMPARE_OP 6 (in)
9 RETURN_VALUE
So in is implemented as a COMPARE_OP. Looking in ceval.c for
COMPARE_OP, it has some optimizations for a few fast compares, then
calls cmp_outcome() which, for 'in', calls PySequence_Contains().
PySequence_Contains() is implemented in abstract.c. If the container
implements __contains__, that is called, otherwise
_PySequence_IterSearch() is used.
_PySequence_IterSearch() calls PyObject_GetIter() to constuct an
iterator on the sequence, then goes into an infinite loop (for (;;))
calling PyIter_Next() on the iterator until the item is found or the
call to PyIter_Next() returns an error.
PyObject_GetIter() is also in abstract.c. If the object has an
__iter__() method, that is called, otherwise PySeqIter_New() is called
to construct an iterator.
PySeqIter_New() is implemented in iterobject.c. It's next() method is in
iter_iternext(). This method calls __getitem__() on its wrapped object
and increments an index for next time.
So, though the details are complex, I think it is pretty fair to say
that the implementation uses a while loop (in _PySequence_IterSearch())
and a counter (wrapped in PySeqIter_Type) to implement 'in' on a
container that defines __getitem__ but not __iter__.
By the way the implementation of 'for' also calls PyObject_GetIter(), so
it uses the same mechanism to generate an iterator for a sequence that
defines __getitem__().

Python creates one immutable 'a' object and each element in a list points to the same object. Since Spam() is mutable, each instance is a different object, and dereferencing the pointers in spamlist will access many areas in RAM. The performance difference may have something to do with hardware cache hits/misses.
Obviously the performance difference would be even greater if you are including list creation time in your results (instead of just Spam() in spamlist. Also try x = Spam(); x in spamlist to see if that makes a difference.
I am curious how any(imap(equalsFunc, spamlist)) compares.

Using the the test of alist = ['a' for x in range(100000)] can be very misleading because of string interning. Turns out that Python will intern (in most cases) short immutables -- especially strings -- so that they are all the same object.
Demo:
>>> alist=['a' for x in range(100000)]
>>> len(alist)
100000
>>> len({id(x) for x in alist})
1
You can see that while a list of 100000 strings is created it is only comprised of one interned object.
A more fair case would be to use a call to object to guarantee that each is a unique Python object:
>>> olist=[object() for x in range(100000)]
>>> len(olist)
100000
>>> len({id(x) for x in olist})
100000
If you compare the in operator with olist you will find the timing to be similar.

Related

Function and method calls in Python [duplicate]

I know that python has a len() function that is used to determine the size of a string, but I was wondering why it's not a method of the string object?
Strings do have a length method: __len__()
The protocol in Python is to implement this method on objects which have a length and use the built-in len() function, which calls it for you, similar to the way you would implement __iter__() and use the built-in iter() function (or have the method called behind the scenes for you) on objects which are iterable.
See Emulating container types for more information.
Here's a good read on the subject of protocols in Python: Python and the Principle of Least Astonishment
Jim's answer to this question may help; I copy it here. Quoting Guido van Rossum:
First of all, I chose len(x) over x.len() for HCI reasons (def __len__() came much later). There are two intertwined reasons actually, both HCI:
(a) For some operations, prefix notation just reads better than postfix — prefix (and infix!) operations have a long tradition in mathematics which likes notations where the visuals help the mathematician thinking about a problem. Compare the easy with which we rewrite a formula like x*(a+b) into x*a + x*b to the clumsiness of doing the same thing using a raw OO notation.
(b) When I read code that says len(x) I know that it is asking for the length of something. This tells me two things: the result is an integer, and the argument is some kind of container. To the contrary, when I read x.len(), I have to already know that x is some kind of container implementing an interface or inheriting from a class that has a standard len(). Witness the confusion we occasionally have when a class that is not implementing a mapping has a get() or keys() method, or something that isn’t a file has a write() method.
Saying the same thing in another way, I see ‘len‘ as a built-in operation. I’d hate to lose that. /…/
Python is a pragmatic programming language, and the reasons for len() being a function and not a method of str, list, dict etc. are pragmatic.
The len() built-in function deals directly with built-in types: the CPython implementation of len() actually returns the value of the ob_size field in the PyVarObject C struct that represents any variable-sized built-in object in memory. This is much faster than calling a method -- no attribute lookup needs to happen. Getting the number of items in a collection is a common operation and must work efficiently for such basic and diverse types as str, list, array.array etc.
However, to promote consistency, when applying len(o) to a user-defined type, Python calls o.__len__() as a fallback. __len__, __abs__ and all the other special methods documented in the Python Data Model make it easy to create objects that behave like the built-ins, enabling the expressive and highly consistent APIs we call "Pythonic".
By implementing special methods your objects can support iteration, overload infix operators, manage contexts in with blocks etc. You can think of the Data Model as a way of using the Python language itself as a framework where the objects you create can be integrated seamlessly.
A second reason, supported by quotes from Guido van Rossum like this one, is that it is easier to read and write len(s) than s.len().
The notation len(s) is consistent with unary operators with prefix notation, like abs(n). len() is used way more often than abs(), and it deserves to be as easy to write.
There may also be a historical reason: in the ABC language which preceded Python (and was very influential in its design), there was a unary operator written as #s which meant len(s).
There is a len method:
>>> a = 'a string of some length'
>>> a.__len__()
23
>>> a.__len__
<method-wrapper '__len__' of str object at 0x02005650>
met% python -c 'import this' | grep 'only one'
There should be one-- and preferably only one --obvious way to do it.
There are some great answers here, and so before I give my own I'd like to highlight a few of the gems (no ruby pun intended) I've read here.
Python is not a pure OOP language -- it's a general purpose, multi-paradigm language that allows the programmer to use the paradigm they are most comfortable with and/or the paradigm that is best suited for their solution.
Python has first-class functions, so len is actually an object. Ruby, on the other hand, doesn't have first class functions. So the len function object has it's own methods that you can inspect by running dir(len).
If you don't like the way this works in your own code, it's trivial for you to re-implement the containers using your preferred method (see example below).
>>> class List(list):
... def len(self):
... return len(self)
...
>>> class Dict(dict):
... def len(self):
... return len(self)
...
>>> class Tuple(tuple):
... def len(self):
... return len(self)
...
>>> class Set(set):
... def len(self):
... return len(self)
...
>>> my_list = List([1,2,3,4,5,6,7,8,9,'A','B','C','D','E','F'])
>>> my_dict = Dict({'key': 'value', 'site': 'stackoverflow'})
>>> my_set = Set({1,2,3,4,5,6,7,8,9,'A','B','C','D','E','F'})
>>> my_tuple = Tuple((1,2,3,4,5,6,7,8,9,'A','B','C','D','E','F'))
>>> my_containers = Tuple((my_list, my_dict, my_set, my_tuple))
>>>
>>> for container in my_containers:
... print container.len()
...
15
2
15
15
Something missing from the rest of the answers here: the len function checks that the __len__ method returns a non-negative int. The fact that len is a function means that classes cannot override this behaviour to avoid the check. As such, len(obj) gives a level of safety that obj.len() cannot.
Example:
>>> class A:
... def __len__(self):
... return 'foo'
...
>>> len(A())
Traceback (most recent call last):
File "<pyshell#8>", line 1, in <module>
len(A())
TypeError: 'str' object cannot be interpreted as an integer
>>> class B:
... def __len__(self):
... return -1
...
>>> len(B())
Traceback (most recent call last):
File "<pyshell#13>", line 1, in <module>
len(B())
ValueError: __len__() should return >= 0
Of course, it is possible to "override" the len function by reassigning it as a global variable, but code which does this is much more obviously suspicious than code which overrides a method in a class.

Why are reversed and sorted of different types in Python?

reversed's type is "type":
>>> type(reversed)
<class 'type'>
sorted's type is "builtin function or method":
>>> type(sorted)
<class 'builtin_function_or_method'>
However, they seem the same in nature. Excluding the obvious difference in functionality (reversing vs. sorting sequences), what's the reason for this difference in implementation?
The difference is that reversed is an iterator (it's also lazy-evaluating) and sorted is a function that works "eagerly".
All built-in iterators (at least in python-3.x) like map, zip, filter, reversed, ... are implemented as classes. While the eager-operating built-ins are functions, e.g. min, max, any, all and sorted.
>>> a = [1,2,3,4]
>>> r = reversed(a)
<list_reverseiterator at 0x2187afa0240>
You actually need to "consume" the iterator to get the values (e.g. list):
>>> list(r)
[4, 3, 2, 1]
On the other hand this "consuming" part isn't needed for functions like sorted:
>>> s = sorted(a)
[1, 2, 3, 4]
In the comments it was asked why these are implemented as classes instead of functions. That's not really easy to answer but I'll try my best:
Using lazy-evaluating operations has one huge benefit: They are very memory efficient when chained. They don't need to create intermediate lists unless they are explicitly "requested". That was the reason why map, zip and filter were changed from eager-operating functions (python-2.x) to lazy-operating classes (python-3.x).
Generally there are two ways in Python to create iterators:
classes that return self in their __iter__ method
generator functions - functions that contain a yield
However (at least CPython) implements all their built-ins (and several standard library modules) in C. It's very easy to create iterator classes in C but I haven't found any sensible way to create generator functions based on the Python-C-API. So the reason why these iterators are implemented as classes (in CPython) might just be convenience or the lack of (fast or implementable) alternatives.
There is an additional reason to use classes instead of generators: You can implement special methods for classes but you can't implement them on generator functions. That might not sound impressive but it has definite advantages. For example most iterators can be pickled (at least on Python-3.x) using the __reduce__ and __setstate__ methods. That means you can store them on the disk, and allows copying them. Since Python-3.4 some iterators also implement __length_hint__ which makes consuming these iterators with list (and similar) much faster.
Note that reversed could easily be implemented as factory-function (like iter) but unlike iter, which can return two unique classes, reversed can only return one unique class.
To illustrate the possible (and unique) classes you have to consider a class that has no __iter__ and no __reversed__ method but are iterable and reverse-iterable (by implementing __getitem__ and __len__):
class A(object):
def __init__(self, vals):
self.vals = vals
def __len__(self):
return len(self.vals)
def __getitem__(self, idx):
return self.vals[idx]
And while it makes sense to add an abstraction layer (a factory function) in case of iter - because the returned class is depending on the number of input arguments:
>>> iter(A([1,2,3]))
<iterator at 0x2187afaed68>
>>> iter(min, 0) # actually this is a useless example, just here to see what it returns
<callable_iterator at 0x1333879bdd8>
That reasoning doesn't apply to reversed:
>>> reversed(A([1,2,3]))
<reversed at 0x2187afaec50>
What's the difference between reversed and sorted?
Interestingly, reversed is not a function, while sorted is.
Open a REPL session and type help(reversed):
class reversed(object)
| reversed(sequence) -> reverse iterator over values of the sequence
|
| Return a reverse iterator
It is indeed a class which is used to return a reverse iterator.
Okay, so reversed isn't a function. But why not?
This is a bit hard to answer. One explanation is that iterators have lazy evaluation. This requires some sort of container to store information about the current state of the iterator at any given time. This is best done through an object, and hence, a class.

Python: Testing equivalence of sets of custom classes when all instances are unique by definition?

Using Python 2.6, with the set() builtin, not sets.set.
I have defined some custom data abstraction classes, which will be made members of some sets using the builtin set() object.
The classes are already being stored in a separate structure, before being divided up into sets. All instances of the classes are declared first. No class instances are created or deleted after the first set is declared. No two class instances are ever considered to be "equal" to each other. (Two instances of the class, containing identical data, are considered not the same. A == B is False for all A,B where B is not A.)
Given the above, will there be any reasonable difference between these strategies for testing set_a == set_b?:
Option 1: Store integers in the sets that uniquely identify instances of my class.
Option 2: Store instances of my class, and implement __hash__() and __eq__() to compare id(self) == id(other). (This may not be necessary? Do default implementations of these functions in object just do the same thing but faster?) Possibly use an instance variable that increments every time a new instance calls __init__(). (Not thread safe?)
or,
Option 3: The instances are already stored and looked up in dictionaries keyed by rather long strings. The strings are what most directly represents what the instances are, and are kept unique. I thought storing these strings in the sets would be a RAM overhead and/or create a bunch of extra runtime by calling __eq__() and __hash__(). If this is not the case, I should store the strings directly. (But I think what I've read so far tells me it is the case.)
I'm somewhat new to sets in Python. I've figured out some of what I need to know already, just want to make sure I'm not overlooking something tricky or drawing a false conclusion somewhere.
I might be misunderstanding the question, but this is how Python behaves by default:
class Foo(object):
pass
a = Foo()
b = Foo()
c = Foo()
x = set([a, b])
y = set([a, b])
z = set([a, c])
print x == y # True
print x == z # False
Do default implementations of these functions in object just do the same thing but faster?
Yes. User-defined classes have __cmp__() and __hash__() methods by default; with them, all objects compare unequal (except with themselves) and x.__hash__() returns id(x). docs

What objects are guaranteed to have different identity?

ORIGINAL QUESTION:
(My question applies to Python 3.2+, but I doubt this has changed since Python 2.7.)
Suppose I use an expression that we usually expect to create an object. Examples: [1,2,3]; 42; 'abc'; range(10); True; open('readme.txt'); MyClass(); lambda x : 2 * x; etc.
Suppose two such expressions are executed at different times and "evaluate to the same value" (i.e., have the same type, and compare as equal). Under what conditions does Python provide what I call a distinct object guarantee that the two expressions actually create two distinct objects (i.e., x is y evaluates as False, assuming the two objects are bound to x and y, and both are in scope at the same time)?
I understand that for objects of any mutable type, the "distinct object guarantee" holds:
x = [1,2]
y = [1,2]
assert x is not y # guaranteed to pass
I also know for certain immutable types (str, int) the guarantee does not hold; and for certain other immutable types (bool, NoneType), the opposite guarantee holds:
x = True
y = not not x
assert x is not y # guaranteed to fail
x = 2
y = 3 - 1
assert x is not y # implementation-dependent; likely to fail in CPython
x = 1234567890
y = x + 1 - 1
assert x is not y # implementation-dependent; likely to pass in CPython
But what about all the other immutable types?
In particular, can two tuples created at different times have the same identity?
The reason I'm interested in this is that I represent nodes in my graph as tuples of int, and the domain model is such that any two nodes are distinct (even if they are represented by tuples with the same values). I need to create sets of nodes. If Python guarantees that tuples created at different times are distinct objects, I could simply subclass tuple to redefine equality to mean identity:
class DistinctTuple(tuple):
__hash__ = tuple.__hash__
def __eq__(self, other):
return self is other
x = (1,2)
y = (1,2)
s = set(x,y)
assert len(s) == 1 # pass; but not what I want
x = DistinctTuple(x)
y = DistinctTuple(y)
s = set(x,y)
assert len(s) == 2 # pass; as desired
But if tuples created at different times are not guaranteed to be distinct, then the above is a terrible technique, which hides a dormant bug that may appear at random and may be very hard to replicate and find. In that case, subclassing won't help; I will actually need to add to each tuple, as an extra element, a unique id. Alternatively, I can convert my tuples to lists. Either way, I'd use more memory. Obviously, I'd prefer not to use these alternatives unless my original subclassing solution is unsafe.
My guess is that Python does not offer the "distinct object guarantee" for immutable types, either built-in or user-defined. But I haven't found a clear statement about it in the documentation.
UPDATE 1:
#LuperRouch #larsmans Thank you for the discussion and the answer so far. Here's the last issue I'm still unclear with:
Is there any chance that the creation of an object of a user-defined
type results in a reuse of an existing object?
If this is possible, I'd like to know how I can verify for any class I work with whether it might exhibit such a behavior.
Here's my understanding. Any time an object of a user-defined class is created, the class' __new__() method is called first. If this method is overridden, nothing in the language would prevent the programmer from returning a reference to an existing object, thus violating my "distinct object guarantee". Obviously, I can observe it by examining the class definition.
I am not sure what happens if a user-defined class does not override __new__() (or explicitly relies __new__() from the base class). If I write
class MyInt(int):
pass
the object creation is handled by int.__new__(). I would expect that this means I may sometimes see the following assertion fail:
x = MyInt(1)
y = MyInt(1)
assert x is not y # may fail, since int.__new__() might return the same object twice?
But in my experimentation with CPython I could not achieve such behavior. Does this mean the language provides "distinct object guarantee" for user-defined classes that don't override __new__, or is it just an arbitrary implementation behavior?
UPDATE 2:
While my DistinctTuple turned out to be a perfectly safe implementation, I now understand that my design idea of using DistinctTuple to model nodes is very bad.
The identity operator is already available in the language; making == behave in the same way as is is logically superfluous.
Worse, if == could have been done something useful, I made it unavailable. For instance, it's quite likely that somewhere in my program I'll want to see if two nodes are represented by the same pair of integers; == would have been perfect for that - and in fact, that's what it does by default...
Worse yet, most people actually do expect == to compare some "value" rather than identity - even for a user-defined class. They would be caught unawares with my override that only looks at identity.
Finally... the only reason I had to redefine == was to allow multiple nodes with the same tuple representation to be part of a set. This is the wrong way to go about it! It's not == behavior that needs to change, it's the container type! I simply needed to use multisets instead of sets.
In short, while my question may have some value for other situations, I am absolutely convinced that creating class DistinctTuple is a terrible idea for my use case (and I strongly suspect it has no valid use case at all).
Python reference, section 3, Data model:
for immutable types, operations that compute new values may actually return a reference to any existing object with the same type and value, while for mutable objects this is not allowed.
(Emphasis added.)
In practice, it seems CPython only caches the empty tuple:
>>> 1 is 1
True
>>> (1,) is (1,)
False
>>> () is ()
True
Is there any chance that the creation of an object of a user-defined type results in a reuse of an existing object?
This will happen if, and only if, the user-defined type is explicitly designed to do that. With __new__() or some metaclass.
I'd like to know how I can verify for any class I work with whether it might exhibit such a behavior.
Use the source, Luke.
When it comes to int, small integers are pre-allocated, and these pre-allocated integers are used wherever you create of calculate with integers. You can't get this working when you do MyInt(1) is MyInt(1), because what you have there are not integers. However:
>>> MyInt(1) + MyInt(1) is 2
True
This is because of course MyInt(1) + MyInt(1) does not return a MyInt. It returns an int, because that's what the __add__ of an integer returns (and that's where the check for pre-allocated integers occur as well). This if anything just shows that subclassing int in general isn't particularly useful. :-)
Does this mean the language provides "distinct object guarantee" for user-defined classes that don't override new, or is it just an arbitrary implementation behavior?
It doesn't guarantee it, because there is no need to do so. The default behavior is to create a new object. You have to override it if you don't want that to happen. Having a guarantee makes no sense.
If Python guarantees that tuples created at different times are distinct objects, I could simply subclass tuple to redefine equality to mean identity.
You seem to be confused about how subclassing works: If B subclasses A, then B gets to use all of A's methods[1] -- but the A methods will be working on instances of B, not of A. This holds even for __new__:
--> class Node(tuple):
... def __new__(cls):
... obj = tuple.__new__(cls)
... print(type(obj))
... return obj
...
--> n = Node()
<class '__main__.Node'>
As #larsman pointed out in the Python reference:
for immutable types, operations that compute new values may actually return a reference to any existing object with the same type and value, while for mutable objects this is not allowed
However, keep in mind this passage is talking about Python's built-in types, not user-defined types (which can go crazy pretty much any way they like).
I understand the above excerpt to guarantee that Python will not return a new mutable object that is the same as an existing object, and classes that are user-defined and created in Python code are inherently mutable (again, see note above about crazy user-defined classes).
A more complete Node class (note you don't need to explicitly refererence tuple.__hash__):
class Node(tuple):
__slots__ = tuple()
__hash__ = tuple.__hash__
def __eq__(self, other):
return self is other
def __ne__(self, other):
return self is not other
--> n1 = Node()
--> n2 = Node()
--> n1 is n2
False
--> n1 == n2
False
--> n1 != n2
True
--> n1 <= n2
True
--> n1 < n2
False
As you can see from the last two comparisons, you may want to also override the __le__ and __ge__ methods.
[1] The only exception I am aware of is __hash__ -- if __eq__ is defined on the subclass but the subclass wants the parent class' __hash__ it has to explicitly say so (this is a Python 3 change).

Why does Python code use len() function instead of a length method?

I know that python has a len() function that is used to determine the size of a string, but I was wondering why it's not a method of the string object?
Strings do have a length method: __len__()
The protocol in Python is to implement this method on objects which have a length and use the built-in len() function, which calls it for you, similar to the way you would implement __iter__() and use the built-in iter() function (or have the method called behind the scenes for you) on objects which are iterable.
See Emulating container types for more information.
Here's a good read on the subject of protocols in Python: Python and the Principle of Least Astonishment
Jim's answer to this question may help; I copy it here. Quoting Guido van Rossum:
First of all, I chose len(x) over x.len() for HCI reasons (def __len__() came much later). There are two intertwined reasons actually, both HCI:
(a) For some operations, prefix notation just reads better than postfix — prefix (and infix!) operations have a long tradition in mathematics which likes notations where the visuals help the mathematician thinking about a problem. Compare the easy with which we rewrite a formula like x*(a+b) into x*a + x*b to the clumsiness of doing the same thing using a raw OO notation.
(b) When I read code that says len(x) I know that it is asking for the length of something. This tells me two things: the result is an integer, and the argument is some kind of container. To the contrary, when I read x.len(), I have to already know that x is some kind of container implementing an interface or inheriting from a class that has a standard len(). Witness the confusion we occasionally have when a class that is not implementing a mapping has a get() or keys() method, or something that isn’t a file has a write() method.
Saying the same thing in another way, I see ‘len‘ as a built-in operation. I’d hate to lose that. /…/
Python is a pragmatic programming language, and the reasons for len() being a function and not a method of str, list, dict etc. are pragmatic.
The len() built-in function deals directly with built-in types: the CPython implementation of len() actually returns the value of the ob_size field in the PyVarObject C struct that represents any variable-sized built-in object in memory. This is much faster than calling a method -- no attribute lookup needs to happen. Getting the number of items in a collection is a common operation and must work efficiently for such basic and diverse types as str, list, array.array etc.
However, to promote consistency, when applying len(o) to a user-defined type, Python calls o.__len__() as a fallback. __len__, __abs__ and all the other special methods documented in the Python Data Model make it easy to create objects that behave like the built-ins, enabling the expressive and highly consistent APIs we call "Pythonic".
By implementing special methods your objects can support iteration, overload infix operators, manage contexts in with blocks etc. You can think of the Data Model as a way of using the Python language itself as a framework where the objects you create can be integrated seamlessly.
A second reason, supported by quotes from Guido van Rossum like this one, is that it is easier to read and write len(s) than s.len().
The notation len(s) is consistent with unary operators with prefix notation, like abs(n). len() is used way more often than abs(), and it deserves to be as easy to write.
There may also be a historical reason: in the ABC language which preceded Python (and was very influential in its design), there was a unary operator written as #s which meant len(s).
There is a len method:
>>> a = 'a string of some length'
>>> a.__len__()
23
>>> a.__len__
<method-wrapper '__len__' of str object at 0x02005650>
met% python -c 'import this' | grep 'only one'
There should be one-- and preferably only one --obvious way to do it.
There are some great answers here, and so before I give my own I'd like to highlight a few of the gems (no ruby pun intended) I've read here.
Python is not a pure OOP language -- it's a general purpose, multi-paradigm language that allows the programmer to use the paradigm they are most comfortable with and/or the paradigm that is best suited for their solution.
Python has first-class functions, so len is actually an object. Ruby, on the other hand, doesn't have first class functions. So the len function object has it's own methods that you can inspect by running dir(len).
If you don't like the way this works in your own code, it's trivial for you to re-implement the containers using your preferred method (see example below).
>>> class List(list):
... def len(self):
... return len(self)
...
>>> class Dict(dict):
... def len(self):
... return len(self)
...
>>> class Tuple(tuple):
... def len(self):
... return len(self)
...
>>> class Set(set):
... def len(self):
... return len(self)
...
>>> my_list = List([1,2,3,4,5,6,7,8,9,'A','B','C','D','E','F'])
>>> my_dict = Dict({'key': 'value', 'site': 'stackoverflow'})
>>> my_set = Set({1,2,3,4,5,6,7,8,9,'A','B','C','D','E','F'})
>>> my_tuple = Tuple((1,2,3,4,5,6,7,8,9,'A','B','C','D','E','F'))
>>> my_containers = Tuple((my_list, my_dict, my_set, my_tuple))
>>>
>>> for container in my_containers:
... print container.len()
...
15
2
15
15
Something missing from the rest of the answers here: the len function checks that the __len__ method returns a non-negative int. The fact that len is a function means that classes cannot override this behaviour to avoid the check. As such, len(obj) gives a level of safety that obj.len() cannot.
Example:
>>> class A:
... def __len__(self):
... return 'foo'
...
>>> len(A())
Traceback (most recent call last):
File "<pyshell#8>", line 1, in <module>
len(A())
TypeError: 'str' object cannot be interpreted as an integer
>>> class B:
... def __len__(self):
... return -1
...
>>> len(B())
Traceback (most recent call last):
File "<pyshell#13>", line 1, in <module>
len(B())
ValueError: __len__() should return >= 0
Of course, it is possible to "override" the len function by reassigning it as a global variable, but code which does this is much more obviously suspicious than code which overrides a method in a class.

Categories