Let's say I have a referentially transparent function. It is very easy to memoize it; for example:
def memoize(obj):
memo = {}
#functools.wraps(obj)
def memoizer(*args, **kwargs):
combined_args = args + (kwd_mark,) + tuple(sorted(kwargs.items()))
if combined_args not in memo:
memo[combined_args] = obj(*args, **kwargs)
return cache[combined_args]
return memoizer
#memoize
def my_function(data, alpha, beta):
# ...
Now suppose that the data argument to my_function is huge; say, it's a frozenset with millions of elements. In this case, the cost of memoization is prohibitive: every time, we'd have to calculate hash(data) as part of the dictionary lookup.
I can make the memo dictionary an attribute to data instead of an object inside memoize decorator. This way I can skip the data argument entirely when doing the cache lookup since the chance that another huge frozenset will be the same is negligible. However, this approach ends up polluting an argument passed to my_function. Worse, if I have two or more large arguments, this won't help at all (I can only attach memo to one argument).
Is there anything else that can be done?
Well, you can use "hash" there with no fears. A frozenset's hash is not calculated more than once by Python - just when it is created - check the timings:
>>> timeit("frozenset(a)", "a=range(100)")
3.26825213432312
>>> timeit("hash(a)", "a=frozenset(range(100))")
0.08160710334777832
>>> timeit("(lambda x:x)(a)", "a=hash(frozenset(range(100)))")
0.1994171142578125
Don't forget Python's "hash" builtin calls the object's __hash__ method, which has its return value defined at creation time for built-in hasheable objects. Above you can see that calling a identity lambda function is more than twice slower than calling "hash (a)"
So, if all your arguments are hasheable, just add their hash when creating "combined_args" - else, just write its creation so that you use hash for frozenset (and maybe other) types, with a conditional.
It turns out that the built-in __hash__ is not that bad, since it caches its own value after the first calculation. The real performance hit comes from the built-in __eq__, since it doesn't short-circuit on identical objects, and actually goes through the full comparison every time, making it very costly.
One approach I thought of is to subclass the built-in class for all large arguments:
class MyFrozenSet(frozenset):
__eq__ = lambda self, other : id(self) == id(other)
__hash__ = lambda self : id(self)
This way, dictionary lookup would be instantaneous. But the equality for the new class will be broken.
A better solution is probably this: only when the dictionary lookup is performed, the large arguments can be wrapped inside a special class that redefines __eq__ and __hash__ to be return the wrapped object's id(). The obvious implementation of the wrapper is a bit annoying, since it requires copying all the standard frozenset methods. Perhaps deriving it from the relevant ABC class may make it easier.
Related
I noticed that when I use user-defined objects (that override the __hash__ method) as keys of my dicts in Python, lookup time increases by at least a factor 5.
This behaviour is observed even when I use very basic hash methods such as in the following example:
class A:
def __init__(self, a):
self.a = a
def __hash__(self):
return hash(self.a)
def __eq__(self, other):
if not isinstance(other, A):
return NotImplemented
return (self.a == other.a and self.__class__ ==
other.__class__)
# get an instance of class A
mya = A(42)
# define dict
d1={mya:[1,2], 'foo':[3,4]}
If I time the access through the two different keys I observe a significant difference in performance
%timeit d1['foo']
results in ~ 100 ns. Whereas
%timeit d1[mya]
results in ~ 600 ns.
If I remove the overwriting of __hash__ and __eq__ methods, performance is at the same level as for a default object
Is there a way to avoid this loss in performance and still implement a customised hash calculation ?
The default CPython __hash__ implementation for a custom class is written in C and uses the memory address of the object. Therefore, it does not have to access absolutely anthing from the object and can be done very quickly, as it is just a single integer operation in CPU, if even that.
The "very basic" __hash__ from the example is not as simple as it may seem:
def __hash__(self):
return hash(self.a)
This has to read the attribute a of self, which I'd say in this case will call object.__getattribute__(self, 'a'), and that will look for the value of 'a' in __dict__. This already involves calculating hash('a') and looking it up. Then, the returned value will be passed to hash.
To answer the additional question:
Is there a way to implement a faster __hash__ method that returns
predictable values, I mean that are not randomly computed at each run
as in the case of the memory address of the object ?
Anything accessing attributes of objects will be slower than the implementation which does not need to access attributes, but you could make attribute access faster by using __slots__, or implementing a highly optimized C extension for the class.
There is, however, another question: is this really a problem? I cannot really believe that an application is becoming slow because of slow __hash__. __hash__ should still be pretty fast unless the dictionary has trillions of entries, but then, everything else would become slow and ask for bigger changes...
I did some testing and have to make a correction. Using __slots__ is not going to help in this case at all. My tests actually showed that in CPython 3.7 the above class becomes slightly slower when using __slots__.
I am trying to have a list returned when I call list() on a class. What's the best way to do this.
class Test():
def __init__(self):
self.data = [1,2,3]
def aslist(self):
return self.data
a = Test()
list(a)
[1,2,3]
I want when list(a) is called for it to run the aslist function and ideally I'd like to implement asdict that works when dict() is called
I'd like to be able to do this with dict, int and all other type casts
Unlike many other languages you might be used to (e.g., C++), Python doesn't have any notion of "type casts" or "conversion operators" or anything like that.
Instead, Python types' constructors are generally written to some more generic (duck-typed) protocol.
The first thing to do is to go to the documentation for whichever constructor you care about and see what it wants. Start in Builtin Functions, even if most of them will link you to an entry in Builtin Types.
Many of them will link to an entry for the relevant special method in the Data Model chapter.
For example, int says:
… If x defines __int__(), int(x) returns x.__int__(). If x defines __trunc__(), it returns x.__trunc__() …
You can then follow the link to __int__, although in this case there's not much extra information:
Called to implement the built-in functions complex(), int() and float(). Should return a value of the appropriate type.
So, you want to define an __int__ method, and it should return an int:
class MySpecialZero:
def __int__(self):
return 0
The sequence and set types (like list, tuple, set, frozenset) are a bit more complicated. They all want an iterable:
An object capable of returning its members one at a time. Examples of iterables include all sequence types (such as list, str, and tuple) and some non-sequence types like dict, file objects, and objects of any classes you define with an __iter__() method or with a __getitem__() method that implements Sequence semantics.
This is explained a bit better under the iter function, which may not be the most obvious place to look:
… object must be a collection object which supports the iteration protocol (the __iter__() method), or it must support the sequence protocol (the __getitem__() method with integer arguments starting at 0) …
And under __iter__ in the Data Model:
This method is called when an iterator is required for a container. This method should return a new iterator object that can iterate over all the objects in the container. For mappings, it should iterate over the keys of the container.
Iterator objects also need to implement this method; they are required to return themselves. For more information on iterator objects, see Iterator Types.
So, for your example, you want to be an object that iterates over the elements of self.data, which means you want an __iter__ method that returns an iterator over those elements. The easiest way to do that is to just call iter on self.data—or, if you want that aslist method for other reasons, maybe call iter on what that method returns:
class Test():
def __init__(self):
self.data = [1,2,3]
def aslist(self):
return self.data
def __iter__(self):
return iter(self.aslist())
Notice that, as Edward Minnix explained, Iterator and Iterable are separate things. An Iterable is something that can produce an Iterator when you call its __iter__ method. All Iterators are Iterables (they produce themselves), but many Iterables are not Iterators (Sequences like list, for example).
dict (and OrderedDict, etc.) is also a bit complicated. Check the docs, and you'll see that it wants either a mapping (that is, something like a dict) or an iterable of key-value pairs (those pairs themselves being iterables). In this case, unless you're implementing a full mapping, you probably want the fallback:
class Dictable:
def __init__(self):
self.names, self.values = ['a', 'b', 'c'], [1, 2, 3]
def __iter__(self):
return zip(self.names, self.values)
Almost everything else is easy, like int—but notice that str, bytes, and bytearray are sequences.
Meanwhile, if you want your object to be convertible to an int or to a list or to a set, you might want it to also act a lot like one in other ways. If that's the case, look at collections.abc and numbers, which not provide helpers that are not only abstract base classes (used if you need to check whether some type meets some protocol), but also mixins (used to help you implement the protocol).
For example, a full Sequence is expected to provide most of the same methods as a tuple—about 7 of them—but if you use the mixin, you only need to define 2 yourself:
class MySeq(collections.abc.Sequence):
def __init__(self, iterable):
self.data = tuple(iterable)
def __getitem__(self, idx):
return self.data[idx]
def __len__(self):
return len(self.data)
Now you can use a MySeq almost anywhere you could use a tuple—including constructing a list from it, of course.
For some types, like MutableSequence, the shortcuts help even more—you get 17 methods for the price of 5.
If you want the same object to be list-able and dict-able… well, then you run into a limitation of the design. list wants an iterable. dict wants an iterable of pairs, or a mapping—which is a kind of iterable. So, rather than infinite choices, you only really have two:
Iterate keys and implement __getitem__ with those keys for dict, so list gives a list of those keys.
Iterate key-value pairs for dict, so list gives a list of those key-value pairs.
Obviously if you want to actually act like a Mapping, you only have one choice, the first one.
The fact that the sequence and mapping protocols overlap has been part of Python from the beginning, inherent in the fact that you can use the [] operator on both of them, and has been retained with every major change since, even though it's made other features (like the whole ABC model) more complicated. I don't know if anyone's ever given a reason, but presumably it's similar to the reason for the extended-slicing design. In other words, making dicts and other mappings a lot easier and more readable to use is worth the cost of making them a little more complicated and less flexible to implement.
This can be done with overloading special methods. You will need to define the __iter__ method for your class, making it iterable. This means anything expecting an iterable (like most collections constructors like list, set, etc.) will then work with your object.
class Test:
...
def __iter__(self):
return iter(self.data)
Note: You will need to wrap the returned object with iter() so that it is an iterator (there is a difference between iterable and iterator). A list is iterable (can be iterated over), but not an iterator (supports __next__, raises StopIteration when done etc.)
Reading How to implement a good __hash__ function in python - can I not write eq as
def __eq__(self, other):
return isinstance(other, self.__class__) and hash(other) == hash(self)
def __ne__(self, other):
return not self.__eq__(other)
def __hash__(self):
return hash((self.firstfield, self.secondfield, totuple(self.thirdfield)))
? Of course I am going to implement __hash__(self) as well. I have rather clearly defined class members. I am going to turn them all into tuples and make a total tuple out of those and hash that.
Generally speaking, a hash function will have collisions. If you define equality in terms of a hash, you're running the risk of entirely dissimilar items comparing as equal, simply because they ended up with the same hash code. The only way to avoid this would be if your class only had a small, fixed number of possible instances, and you somehow ensured that each one of those had a distinct hash. If your class is simple enough for that to be practical, then it is almost certainly simple enough for you to just compare instance variables to determine equality directly. Your __hash__() implementation would have to examine all the instance variables anyway, in order to calculate a meaningful hash.
Of course you can define __eq__ and __ne__ the way you have in your example, but unless you also explicitly define __hash__, you will get a TypeError: unhashable type exception any time you try to compare equality between two objects of that type.
Since you're defining some semblance of value to your object by defining an __eq__ method, a more important question is what do you consider the value of your object to be? By looking at your code, your answer to that is, "the value of this object is it's hash". Without knowing the contents of your __hash__ method, it's impossible to evaluate the quality or validity of your __eq__ method.
Pardon incompetence of style from Python novice here.
I have a class that takes one parameter for establishing the initial data. There are two ways how the initial data can come in: either a list of strings, or a dictionary with string keys and integer values.
Right now I implement only one version of the constructor, the one that takes the dictionary for parameter, with {} as default value. The list parameter init is implemented as a method, ie
myClass = MyClass()
myClass.initList(listVar)
I can surely live with this, but this certainly is not perfect. So, I decided to turn here for some Pythonic wisdom: how such polymorphic constructors should be implemented? Should I try and fail to read initData.keys() in order to sniff if this is dictionary? Or maybe sniffing parameter types to implement lousy polymorphism where it's not welcome by design is considered non-pythonic?
In an ideal world you'd write one constructor that could take either a list or dict without knowing the difference (i.e. duck typed). Of course, this isn't very realistic since these are pretty different ducks.
Understandably, too, you have a little heartburn about the idea of checking the actual instance types, because it breaks with the idea of duck typing. But, in python 2.6 an interesting module called abc was introduced which allows the definition of "abstract base classes". To be considered an instance of an abstract base class, one doesn't actually have to inherit from it, but rather only has to implement all its abstract methods.
The collections module includes some abstract base classes that would be of interest here, namely collections.Sequence and collections.Mapping. Thus you could write your __init__ functions like:
def __init__(self, somedata):
if isinstance(somedata, collections.Sequence):
# somedata is a list or some other Sequence
elif isinstance(somedata, collections.Mapping):
# somedata is a dict or some other Mapping
http://docs.python.org/2/library/collections.html#collections-abstract-base-classes contains the specifics of which methods are provided by each ABC. If you stick to these, then your code can now accept any object which fits one of these abstract base classes. And, as far as taking the builtin dict and list types, you can see that:
>>> isinstance([], collections.Sequence)
True
>>> isinstance([], collections.Mapping)
False
>>> isinstance({}, collections.Sequence)
False
>>> isinstance({}, collections.Mapping)
True
And, almost by accident, you just made it work for tuple too. You probably didn't care if it was really a list, just that you can read the elements out of it. But, if you had checked isinstance(somedata, list) you would have ruled out tuple. This is what using an ABC buys you.
As #Jan-PhilipGehrcke notes, pythonic can be hard to quantify. To me it means:
easy to read
easy to maintain
simple is better than complex is better than complicated
etcetera, etcetera, and so forth (see the Zen of Python for the complete list, which you get by typing import this in the interpreter)
So, the most pythonic solution depends on what you have to do for each supported initializer, and how many of them you have. I would say if you have only a handful, and each one can be handled by only a few lines of code, then use isinstance and __init__:
class MyClass(object):
def __init__(self, initializer):
"""
initialize internal data structures with 'initializer'
"""
if isinstance(initializer, dict):
for k, v in itit_dict.items():
# do something with k & v
setattr(self, k, v)
elif isinstance(initializer, (list, tuple)):
for item in initializer:
setattr(self, item, None)
On the other hand, if you have many possible initializers, or if any one of them requires a lot of code to handle, then you'll want to have one classmethod constructor for each possible init type, with the most common usage being in __init__:
class MyClass(object):
def __init__(self, init_dict={}):
"""
initialize internal data structures with 'init_dict'
"""
for k, v in itit_dict.items():
# do something with k & v
setattr(self, k, v)
#classmethod
def from_sequence(cls, init_list):
"""
initialize internal data structures with 'init_list'
"""
result = cls()
for item in init_list:
setattr(result, item, None)
return result
This keeps each possible constructor simple, clean, and easy to understand.
As a side note: using mutable objects as defaults (like I do in the above __init__) needs to be done with care; the reason is that defaults are only evaluated once, and then whatever the result is will be used for every subsequent invocation. This is only a problem when you modify that object in your function, because those modifications will then be seen by every subsequent invocation -- and unless you wanted to create a cache that's probably not the behavior you were looking for. This is not a problem with my example because I am not modifying init_dict, just iterating over it (which is a no-op if the caller hasn't replaced it as it's empty).
There is no function overloading in Python. Using if isinstance(...) in your __init__() method would be very simple to read and understand:
class Foo(object):
def __init__(self, arg):
if isinstance(arg, dict):
...
elif isinstance(arg, list):
...
else:
raise Exception("big bang")
You can use *args and **kwargs to do it.
But if you want to know what type of parametr is you should use type() or isinstance().
It seems a common and quick way to create a stock __hash__() for any given Python object is to return hash(str(self)), if that object implements __str__(). Is this efficient, though? Per this SO answer, a hash of a tuple of the object's attributes is "good", but doesn't seem to indicate if it's the most efficient for Python. Or would it be better to implement a __hash__() for each object and use a real hashing algorithm from this page and mixup the values of the individual attributes into the final value returned by __hash__()?
Pretend I've implemented the Jenkins hash routines from this SO question. Which __hash__() would be better to use?:
# hash str(self)
def __hash__(self):
return hash(str(self))
# hash of tuple of attributes
def __hash__(self):
return hash((self.attr1, self.attr2, self.attr3,
self.attr4, self.attr5, self.attr6))
# jenkins hash
def __hash__(self):
from jenkins import mix, final
a = self.attr1
b = self.attr2
c = self.attr3
a, b, c = mix(a, b, c)
a += self.attr4
b += self.attr5
c += self.attr6
a, b, c = final(a, b, c)
return c
Assume the attrs in the sample object are all integers for simplicity. Also assume that all objects derive from a base class and that each objects implements its own __str__(). The tradeoff in using the first hash is that I could implement that in the base class as well and not add additional code to each of the derived objects. But if the second or third __hash__() implementations are better in some way, does that offset the cost of the added code to each derived object (because each may have different attributes)?
Edit: the import in the third __hash__() implementation is there only because I didn't want to draft out an entire example module + objects. Assume that import really happens at the top of the module, not on each invocation of the function.
Conclusion: Per the answer and comments on this closed SO question, it looks like I really want the tuple hash implementation, not for speed or efficiency, but because of the underlying duality of __hash__ and __eq__. Since a hash value is going to have a limited range of some form (be it 32 or 64 bits, for example), in the event you do have a hash collision, object equality is then checked. So since I do implement __eq__() for each object by using tuple comparison of self/other's attributes, I also want to implement __hash__() using an attribute tuple so that I respect the hash/equality nature of things.
Your second one has an important performance pessimization: it's importing two names each time the function is called. Of course, how performant it is relative to the string-hash version depends on how the string is generated.
That said, when you have attributes that define equality for the object, and those attributes are themselves hashable types, the simplest (and almost certainly best-performing) approach is going to be to hash a tuple containing those attribute values.
def __hash__(self):
return hash((self.attr1, self.attr2, self.attr3))