How to quickly compute a hash for a collection of objects?

How to quickly compute a hash for a collection of objects? - python

Consider a function f(*x) which takes a lot of arguments *x. Based on these arguments (objects), the function f composes a rather complex object o and returns it. o implements __call__, so o itself serves as a function. Since the composition of o is pretty time consuming and in my scenario there is no point in having multiple instances of o based on the same arguments *x, they are to be cached.
The question is now: How to efficiently compute a hash based on multiple arguments *x? Currently I am using a python dictionary, and i concatenate the str() representations of each x to build each key. It works in my scenario, but it feels rather awkward. I need to call the resulting objects o in a very high frequency, so I suspect the repeated call of str() and the string concatenations waste a lot of computation time.

You can use the hash built-in function, combining the hashes of the items in x together. The typical way to do this (see e.g. the documentation) would be an xor across all the hashes of the individual objects:
it is advised to somehow mix together (e.g. using exclusive or) the hash values for the components of the object that also play a part in comparison of objects
To implement that in a functional way, using operator and reduce:
from functools import reduce # only required in Python 3.x
from operator import xor
def hashed(*x):
return reduce(xor, map(hash, x))
See also this question.

Since version 3.2, Python already contains an implementation of an LRU cache that you can use
to cache your functions' results based on their arguments:
functools.lru_cache
Example:
from functools import lru_cache
#lru_cache(maxsize=32)
def f(*args):
"""Expensive function
"""
print("f(%s) has been called." % (args, ))
return sum(args)
print(f(1, 2, 3))
print(f(1, 2, 3, 4))
print(f(1, 2, 3))
print(f.cache_info())
Output:
f((1, 2, 3)) has been called.
6
f((1, 2, 3, 4)) has been called.
10
6
CacheInfo(hits=1, misses=2, maxsize=32, currsize=2)
(Notice how f(1, 2, 3) only got called once)
As suggested in the comments, it's probably best to simply use the hash()es of your arguments to build the cache-key for your arguments - that's what lru_cache already does for you.
If you're still on Python 2.7, Raymond Hettinger has posted some recipes with LRU caches that you could use in your own code.

Related

lru_cache not leveraging cache with same input but passed in different ways

The problem is related to lru_cache, when I try to pass the same argument to the same cached function but in a different manner, the lru_cache will not be able to leverage cache. I would like to know if there is a better implementation that tackle the problem.
from functools import lru_cache
#lru_cache(maxsize=32)
def fn(x,y):
print('no cache')
return 1
fn(1,1)
>>>>no cache
fn(x=1,1)
>>>>no cache
fn(x=1,y=1)
>>>>no cache

tldr; Looking at the lru_cache source code, it doesn't do what you want because it would sacrifice speed and reliability for a very very limited use-case. (as an aside, it is worth reading through the source code to see all the nice optimizations they made.)
long-winded response: specifically, look at _make_key and how it is built for speed.
The purpose of the function is to take the arguments and keyword arguments you've passed in and make them into a unique key as quickly and reliably as possible. Since the cache is being stored in a dictionary, it further optimizes this by making the keys a special object that hashes very efficiently for even faster lookup in the dictionary. All this makes the lru_cache really efficient and ensures it incurs an absolute minimum of overhead.
If you have a function like
def my_function(x, y=1, z=3):
return x + (y*2) + (z**2)
if you call my_function(1, 2), it looks up args=(1, 2) in the cache. It doesn't have to know that 1 is bound to x and 2 is bound to y. If you call my_function(2, z=1) it cannot assume that z is the next positional argument (which it isn't) because the cache is ignorant of the actual function signature. It will look up the cache key of args=(2,), kwargs={'z': 1}.
Imagine if the cache had to "know" the signature of each function. In this case it would have to know that all of these evaluate equal:
args=(5, 1, 3) # currently stored in the cache as _HashedSeq((5, 1, 3))
args=(5,), kwargs={'z': 3} # currently stored as _HashedSeq((5, kwd_mard, ('z', 3)))
args=(5,), kwargs={'y': 1}
args=(), kwargs={'z': 3, 'y': 1, 'x': 5}
This would be opening up a terrible can of worms that would involve using inspect to get the signature and tons of extra overhead and evaluating each kind of signature to see if it equaled one already in the cache. Considering the efforts that the authors made to streamline this function, it would make lru_cache stop being good at what it's supposed to do, which is quickly and efficiently call up answers that were already computed.

Ring resolve the key regardless of the order of keys provided. As commented in other answers, it is slower than lru_cache. If you are interested in the implementation, it merges positional and keywords arguments with a loop. See: https://github.com/youknowone/ring/blob/0.7.3/ring/callable.py#L23
import ring
#ring.lru(maxsize=32)
def fn(x,y):
print('no cache')
return 1
fn(1,1)
>>>>no cache
fn(1,y=1)
fn(x=1,y=1)

Why are reversed and sorted of different types in Python?

reversed's type is "type":
>>> type(reversed)
<class 'type'>
sorted's type is "builtin function or method":
>>> type(sorted)
<class 'builtin_function_or_method'>
However, they seem the same in nature. Excluding the obvious difference in functionality (reversing vs. sorting sequences), what's the reason for this difference in implementation?

The difference is that reversed is an iterator (it's also lazy-evaluating) and sorted is a function that works "eagerly".
All built-in iterators (at least in python-3.x) like map, zip, filter, reversed, ... are implemented as classes. While the eager-operating built-ins are functions, e.g. min, max, any, all and sorted.
>>> a = [1,2,3,4]
>>> r = reversed(a)
<list_reverseiterator at 0x2187afa0240>
You actually need to "consume" the iterator to get the values (e.g. list):
>>> list(r)
[4, 3, 2, 1]
On the other hand this "consuming" part isn't needed for functions like sorted:
>>> s = sorted(a)
[1, 2, 3, 4]
In the comments it was asked why these are implemented as classes instead of functions. That's not really easy to answer but I'll try my best:
Using lazy-evaluating operations has one huge benefit: They are very memory efficient when chained. They don't need to create intermediate lists unless they are explicitly "requested". That was the reason why map, zip and filter were changed from eager-operating functions (python-2.x) to lazy-operating classes (python-3.x).
Generally there are two ways in Python to create iterators:
classes that return self in their __iter__ method
generator functions - functions that contain a yield
However (at least CPython) implements all their built-ins (and several standard library modules) in C. It's very easy to create iterator classes in C but I haven't found any sensible way to create generator functions based on the Python-C-API. So the reason why these iterators are implemented as classes (in CPython) might just be convenience or the lack of (fast or implementable) alternatives.
There is an additional reason to use classes instead of generators: You can implement special methods for classes but you can't implement them on generator functions. That might not sound impressive but it has definite advantages. For example most iterators can be pickled (at least on Python-3.x) using the __reduce__ and __setstate__ methods. That means you can store them on the disk, and allows copying them. Since Python-3.4 some iterators also implement __length_hint__ which makes consuming these iterators with list (and similar) much faster.
Note that reversed could easily be implemented as factory-function (like iter) but unlike iter, which can return two unique classes, reversed can only return one unique class.
To illustrate the possible (and unique) classes you have to consider a class that has no __iter__ and no __reversed__ method but are iterable and reverse-iterable (by implementing __getitem__ and __len__):
class A(object):
def __init__(self, vals):
self.vals = vals
def __len__(self):
return len(self.vals)
def __getitem__(self, idx):
return self.vals[idx]
And while it makes sense to add an abstraction layer (a factory function) in case of iter - because the returned class is depending on the number of input arguments:
>>> iter(A([1,2,3]))
<iterator at 0x2187afaed68>
>>> iter(min, 0) # actually this is a useless example, just here to see what it returns
<callable_iterator at 0x1333879bdd8>
That reasoning doesn't apply to reversed:
>>> reversed(A([1,2,3]))
<reversed at 0x2187afaec50>

What's the difference between reversed and sorted?
Interestingly, reversed is not a function, while sorted is.
Open a REPL session and type help(reversed):
class reversed(object)
| reversed(sequence) -> reverse iterator over values of the sequence
|
| Return a reverse iterator
It is indeed a class which is used to return a reverse iterator.
Okay, so reversed isn't a function. But why not?
This is a bit hard to answer. One explanation is that iterators have lazy evaluation. This requires some sort of container to store information about the current state of the iterator at any given time. This is best done through an object, and hence, a class.

map of function list and arguments: unpacking difficulty

I have an assignment in a mooc where I have to code a function that returns the cumulative sum, cumulative product, max and min of an input list.
This part of the course was about functional programming, so I wanted to go all out on this, even though I can use other ways.
So I tried this:
from operator import mul
from itertools import repeat
from functools import reduce
def reduce2(l):
print(l)
return reduce(*l)
def numbers(l):
return tuple(map(reduce2, zip([sum, mul,min, max], repeat(l,4))))
l=[1,2,3,4,5]
numbers(l)
My problem is that it doesn't work. zip will pass only one object to reduce if I use it inside map, and unpacking the zip will yield the 4 tuple of (function and argument list l) so I defined reduce2 for this reason, I wanted to unpack the zip inside it but it did not work.
Python returns a TypeError: int' object is not iterable
I thought that I could use return reduce(l[0],l[1]) in reduce2, but there is still the same Error.
I don't understand the behavior of python here.
If I merely use return reduce(l), it returns again a TypeError: reduce expected at least 2 arguments, got 1
What's happening here? How could I make it work?
Thanks for your help.

Effectively, you are trying to execute code like this:
xs = [1, 2, 3, 4, 5]
reduce(sum, xs)
But sum takes an iterable and isn't really compatible with direct use via reduce. Instead, you need a function that takes 2 arguments and returns their sum -- a function analogous to mul. You can get that from operator:
from operator import mul, add
Then just change sum to add in your program.
BTW, functional programming has a variable naming convention that is really cool: x for one thing, and xs for a list of them. It's much better than the hard-to-read l variable name. Also it uses singular/plural to tell you whether you are dealing with a scalar value or a collection.

FMc answer's correctly diagnoses the error in your code. I just want to add a couple alternatives to your map + zip approach.
For one, instead of defining a special version of reduce, you can use itertools.starmap instead of map, which is designed specifically for this purpose:
def numbers(xs):
return tuple(starmap(reduce, zip([add, mul, min, max], repeat(xs))))
However, even better would be to use the often ignored variadic version of map instead of manually zipping the arguments:
def numbers(xs):
return tuple(map(reduce, [add, mul, min, max], repeat(xs)))
It essentially does the zip + starmap for you. In terms of functional programming, this version of map is analogous to Haskell's zipWith function.

Why doesn't Python's filter(predicate, set) return a set?

Why was Python's filter designed such that if you run filter(my_predicate, some_set), I get back a list object return than a set object?
Are there practical cases where you would not want the result to be a set...?

You can do a set comprehension.
{my_predicate(x) for x in some_set} # mapping
{x for x in some_set if my_predicate(x)} # filtering
such as
In [1]: s = set([1,2,3])
In [2]: {x%2 for x in s}
Out[2]: {0, 1}
Many of the "functional" functions in Python 2 are standardized on having list as the output type. This was just an API choice long ago. In itertools many of the same "functional" functions standardize on providing a generator from which you could populate whatever data structure you'd like. And in Python 3 they are standardized on providing an iterator.
But do also note that "filtering" in Python is not like it is in some other languages, like, say Haskell. It's not considered to be a transformation within the context of the data structure, and you don't choose to "endow" your data structures with "filterability" by making them an instance of Functor (or whatever other similar ideas exist in other languages).
As a result, it's a common use case in Python to say something like: "Here's a set, but I just want back all of the values less than 5. I don't care about their 'set-ness' after that point cause I'm just going to do some other work on them, so just give me a ____." No need to get all crazy about preserving the context within which the values originally lived.
In a dynamic typing culture this is very reasonable. But in a static typing culture where preserving the type during transformations might matter, this would be a bit frustrating. It's really just sort of a heuristic from Python's particular perspective.
If it was really just in a very narrow context of a set or tuple then I might just write a helper function:
def type_preserving_filter(predicate, data):
return type(data)(filter(predicate, data))
such as
>>> type_preserving_filter(lambda x: x > 3, set([1,2,3,4,5,6,7,7]))
{4, 5, 6, 7}
>>> type_preserving_filter(lambda x: x > 3, list([1,2,3,4,5,6,7,7]))
[4, 5, 6, 7, 7]
>>> type_preserving_filter(lambda x: x > 3, tuple([1,2,3,4,5,6,7,7]))
(4, 5, 6, 7, 7)
which works in both Python 2.10 and Python 3.4. In Python 2 this feels a bit wasteful; constructing from the iterator in Python 3 is better.

This is not limited to filter(). But the API has changed in Python 3, where filter() now returns an iterator instead of a list. Quoting the python documentation:
Views And Iterators Instead Of Lists
Some well-known APIs no longer return lists:
...
map() and
filter()
return iterators. If you really need a list, a quick fix is e.g.
list(map(...)), but a better fix is often to use a list
comprehension (especially when the original code uses lambda), or
rewriting the code so it doesn’t need a list at all. Particularly
tricky is map() invoked for the side effects of the function; the
correct transformation is to use a regular for loop (since creating a
list would just be wasteful).
This article written by the author of Python goes into detail for reasons for dropping filter() in Python 3 (but this did not happen as you can see above, although the reasoning is still important.)
The fate of reduce() in Python 3000
...
I think dropping filter() and map() is pretty uncontroversial;
filter(P, S) is almost always written clearer as [x for x in S if P(x)], and this has the huge advantage that the most common usages
involve predicates that are comparisons, e.g. x==42, and defining a
lambda for that just requires much more effort for the reader (plus
the lambda is slower than the list comprehension). Even more so for
map(F, S) which becomes [F(x) for x in S]. Of course, in many
cases you'd be able to use generator expressions instead.

Loop inside or outside a function?

What is considered to be a better programming practice when dealing with more object at time (but with the option to process just one object)?
A: LOOP INSIDE FUNCTION
Function can be called with one or more objects and it is iterating inside function:
class Object:
def __init__(self, a, b):
self.var_a = a
self.var_b = b
var_a = ""
var_b = ""
def func(obj_list):
if type(obj_list) != list:
obj_list = [obj_list]
for obj in obj_list:
# do whatever with an object
print(obj.var_a, obj.var_b)
obj_list = [Object("a1", "a2"), Object("b1", "b2")]
obj_alone = Object("c1", "c2")
func(obj_list)
func(obj_alone)
B: LOOP OUTSIDE FUNCTION
Function is dealing with one object only and when it is dealing with more objects in must be called multiple times.
class Object:
def __init__(self, a, b):
self.var_a = a
self.var_b = b
var_a = ""
var_b = ""
def func(obj):
# do whatever with an object
print(obj.var_a, obj.var_b)
obj_list = [Object("a1", "a2"), Object("b1", "b2")]
obj_alone = Object("c1", "c2")
for obj in obj_list:
func(obj)
func(obj_alone)
I personally like the first one (A) more, because for me it makes cleaner code when calling the function, but maybe it's not the right approach. Is there some method generally better than the other? And if not, what are the cons and pros of each method?

A function should have a defined input and output and follow the single responsibility principle. You need to be able to clearly define your function in terms of "I put foo in, I get bar back". The more qualifiers you need to make in this statement to properly describe your function probably means your function is doing too much. "I put foo in and get bar back, unless I put baz in then I also get bar back, unless I put a foo-baz in then it'll error".
In this particular case, you can pass an object or a list of objects. Try to generalise that to a value or a list of values. What if you want to pass a list as a value? Now your function behaviour is ambiguous. You want the single list object to be your value, but the function treats it as multiple arguments instead.
Therefore, it's trivial to adapt a function which takes one argument to work on multiple values in practice. There's no reason to complicate the function's design by making it adaptable to multiple arguments. Write the function as simple and clearly as possible, and if you need it to work through a list of things then you can loop it through that list of things outside the function.
This might become clearer if you try to give an actual useful name to your function which describes what it does. Do you need to use plural or singular terms? foo_the_bar(bar) does something else than foo_the_bars(bars).

Move loops outside functions (when possible)
Generally speaking, keep loops that do nothing but iterate over the parameter outside of functions. This gives the caller maximum control and assumes the least about how the client will use the function.
The rule of thumb is to use the most minimal parameter complexity that the function needs do its job.
For example, let's say you have a function that processes one item. You've anticipated that a client might conceivably want to process multiple items, so you changed the parameter to an iterable, baked a loop into the function, and are now returning a list. Why not? It could save the client from writing an ugly loop in the caller, you figure, and the basic functionality is still available -- and then some!
But this turns out to be a serious constraint. Now the caller needs to pack (and possibly unpack, if the function returns a list of results in addition to a list of arguments) that single item into a list just to use the function. This is confusing and potentially expensive on heap memory:
>>> def square(it): return [x ** 2 for x in it]
...
>>> square(range(6)) # you're thinking ...
[0, 1, 4, 9, 16, 25]
>>> result, = square([3]) # ... but the client just wants to square 1 number
>>> result
9
Here's a much better design for this particular function, intuitive and flexible:
>>> def square(x): return x ** 2
...
>>> square(3)
9
>>> [square(x) for x in range(6)]
[0, 1, 4, 9, 16, 25]
>>> list(map(square, range(6)))
[0, 1, 4, 9, 16, 25]
>>> (square(x) for x in range(6))
<generator object <genexpr> at 0x00000166D122CBA0>
>>> all(square(x) % 2 for x in range(6))
False
This brings me to a second problem with the functions in your code: they have a side-effect, print. I realize these functions are just for demonstration, but designing functions like this makes the example somewhat contrived. Functions typically return values rather than simply produce side-effects, and the parameters and return values are often related, as in the above example -- changing the parameter type bound us to a different return type.
When does it make sense to use an iterable argument? A good example is sort -- the smallest unit of operation for a sorting function is an iterable, so the problem of packing and unpacking in the square example above is a non-issue.
Following this logic a step further, would it make sense for a sort function to accept a list (or variable arguments) of lists? No -- if the caller wants to sort multiple lists, they should loop over them explicitly and call sort on each one, as in the second square example.
Consider variable arguments
A nice feature that bridges the gap between iterables and single arguments is support for variable arguments, which many languages offer. This sometimes gives you the best of both worlds, and some functions go so far as to accept either args or an iterable:
>>> max([1, 3, 2])
3
>>> max(1, 3, 2)
3
One reason max is nice as a variable argument function is that it's a reduction function, so you'll always get a single value as output. If it were a mapping or filtering function, the output is always a list (or generator) so the input should be as well.
To take another example, a sort routine wouldn't make much sense with varargs because it's a classically in-place algorithm that works on lists, so you'd need to unpack the list into the arguments with the * operator pretty much every time you invoke the function -- not cool.
There's no real need for a call like sort(1, 3, 4, 2) as there is with max, where the parameters are just as likely to be loose variables as they are a packed iterable. Varargs are usually used when you have a small number of arguments, or the thing you're unpacking is a small pair or tuple-type element, as often the case with zip.
There's definitely a "feel" to when to offer parameters as varargs, an iterable, or a single value (i.e. let the caller handle looping), but as long as you follow the rule of avoiding iterables unless they're essential to the function, it's hard to go wrong.
As a final tip, try to write your functions with similar contracts to the library functions in your language or the tools you use frequently. These are pretty much always designed well; mimic good design.

If you implement B then you will make it harder for yourself to achieve A.
If you implement A then it isn't too difficult to achieve B. You also have many tools already available to apply this function to a list of arguments (the loop method you described, using something like map, or even a multiprocessing approach if needed)
Therefore I would choose to implement A, and if it makes things neater or easier in a given case you can think about also implementing B (using A) also so that you have both.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to quickly compute a hash for a collection of objects? - python

Related

lru_cache not leveraging cache with same input but passed in different ways

Why are reversed and sorted of different types in Python?

map of function list and arguments: unpacking difficulty

Why doesn't Python's filter(predicate, set) return a set?

Loop inside or outside a function?

Categories

Resources