What is the quickest way to hash a large arbitrary object? - python

I am writing a method to generate cache keys for caching function results, the key is based on a combination of function name and hash value of parameters.
Currently I am using hashlib to hash the serialized version of parameters, however the operation is very expensive to serialize large objects, so what's the alternative?
#get the cache key for storage
def cache_get_key(*args):
import hashlib
serialise = []
for arg in args:
serialise.append(str(arg))
key = hashlib.md5("".join(serialise)).hexdigest()
return key
UPDATE:
I tried using hash(str(args)), but if args have relatively large data in it, still takes long time to compute the hash value. Any better way to do it?
Actually str(args) with large data takes forever...

Assuming you made the object, and it is composed of smaller components (it is not a binary blob), you can precompute the hash when you build the object by using the hashes of its subcomponents.
For example, rather than serialize(repr(arg)), do arg.precomputedHash if isinstance(arg, ...) else serialize(repr(arg))
If you neither make your own objects nor use hashable objects, you can perhaps keep a memoization table of objectreferences -> hashes, assuming you don't mutate the objects. Worst case, you can use a functional language which allows for memoization, since all objects in such a language are probably immutable and hence hashable.

def cache_get_key(*args):
return hash(str(args))
or (if you really want to use the hashlib library)
def cache_get_key(*args):
return hashlib.md5(str(args)).hexdigest()
I wouldn't bother rewriting code to make arrays into strings. Use the inbuilt one.
alternative solution
below is the solution #8bitwide suggested. No hashing required at all with this solution!
def foo(x, y):
return x+y+1
result1 = foo(1,1)
result2 = foo(2,3)
results = {}
results[foo] = {}
results[foo][ [1,1] ] = result1
results[foo][ [2,3] ] = result2

Have you tried just using the hash function? It works perfectly well on tuples.

I know this question is old, but I just want to add my 2 cents:
You don't have to create a list and then join it. Especially if the list is going to be discarded anyways. Use the hash function's .update() method
Consider using a much faster non-cryptographic hash algo, especially if this is not meant to be a cryptographically-secure implementation.
Having said that, this is my suggested improvement:
import xxhash
#get the cache key for storage
def cache_get_key(*args):
hasher = xxhash.xxh3_64()
for arg in args:
hasher.update(str(arg))
return hasher.hexdigest()
This uses the (claimed to be) extremely fast xxHash NCHF*.
* NCHF = Non-Cryptographic Hash Function

I've seen people feed an arbitrary python object to random.seed(), and then use the first value back from random.random() as the "hash" value. It doesn't give a terrific distribution of values (can be skewed), but it seems to work for arbitrary objects.
If you don't need cryptographic-strength hashes, I came up with a pair of hash functions for a list of integers that I use in a bloom filter. They appear below. The bloom filter actually uses linear combinations of these two hash functions to obtain an arbitrarily large number of hash functions, but they should work fine in other contexts that just need a bit of scattering with a decent distribution. They're inspired by Knuth's writing on Linear Congruential Random Number Generation. They take a list of integers as input, which I believe could just be the ord()'s of your serialized characters.
MERSENNES1 = [ 2 ** x - 1 for x in [ 17, 31, 127 ] ]
MERSENNES2 = [ 2 ** x - 1 for x in [ 19, 67, 257 ] ]
def simple_hash(int_list, prime1, prime2, prime3):
'''Compute a hash value from a list of integers and 3 primes'''
result = 0
for integer in int_list:
result += ((result + integer + prime1) * prime2) % prime3
return result
def hash1(int_list):
'''Basic hash function #1'''
return simple_hash(int_list, MERSENNES1[0], MERSENNES1[1], MERSENNES1[2])
def hash2(int_list):
'''Basic hash function #2'''
return simple_hash(int_list, MERSENNES2[0], MERSENNES2[1], MERSENNES2[2])

Related

Is there an efficient way to fill a numba `Dict` in parallel?

I'm having some trouble quickly filling a numba Dict object with key-value pairs (around 63 million of them). Is there an efficient way to do this in parallel?
The documentation (https://numba.pydata.org/numba-doc/dev/reference/pysupported.html#typed-dict) is clear that numba.typed.Dict is not thread-safe, and so I think to use prange with a single Dict object would be a bad idea. I've tried to use a numba List of Dicts, to populate them in parallel and then stitch them together using update, but I think this last step is also inefficient.
Note that one thing (which may be important) is that all the keys are unique, i.e. once assigned, a key will not be reassigned a value. I think this property makes the problem amenable to an efficient parallelised solution.
Below is an example of the serial approach, which is slow with a large number of key-value pairs.
d = typed.Dict.empty(
key_type=types.UnicodeCharSeq(128), value_type=types.int64
)
#njit
def fill_dict(keys_list, values_list, d):
n = len(keys_list)
for i in range(n):
d[keys_list[i]] = values_list[i]
fill_dict(keys_list, values_list, d)
Can anybody help me?
Many thanks.
You don't have to stitch them together if you preprocess the key into an integer that can be computed for its modulo num_shard value.
# assuming hash() returns an arbitrary integer computed by ascii values
shard = hash(key) % num_shard;
selected_dictionary = dictionary[shard]
value = selected_dictionary[key]
# inserting
# lock only the selected_dictionary
shard = hash(key) % num_shard;
selected_dictionary = dictionary[shard]
selected_dictionary.push((key,value))
The hashing could be something like sum of all ascii codes of chars in key. The modulo based indexing separates blocks of keys so that they can work independently without extra processing except hashing.

How to choose between a dictionary with integer keys in range(n) and a list of length n?

Short version of the question:
When comparing a dictionary for which the keys are integers in range(n) and a list of length n, which are the key points of an implementation to choose between one or the other? Things like "if you are doing a lot of this thing with your object, then a dictionary is better".
Long version of the question
I'm not sure if the following details of my implementation matter for the question... So here it is.
In trying to make my code a bit more pythonic, I implemented a subclass of UserList that accepts as index both an integer and a list that represents an integer in base l.
from collections import UserList
class MyList(UserList):
"""
A list that can be accessed both by a g-tuple of coefficients in range(l)
or the corresponding integer.
"""
def __init__(self, data=None, l=2, g=None):
self.l = l
if data == None:
if g == None:
raise ValueError
self.data = [0]*(l**g)
else:
self.data = data
def __setitem__(self, key, value):
if isinstance(key, int):
self.data[key] = value
else:
self.data[self.idx(key)] = value
def __getitem__(self, key):
if isinstance(key, int):
return self.data[key]
return self.data[self.idx(key)]
def idx(self, key):
l = self.l
idx = 0
for i, value in enumerate(key):
idx += value*l**i
return idx
Which can be used like this:
L = MyList(l=4, g=2) #creates a list of length 4**2 initialized at zero
L[9] = 'Hello World'
L[9] == L[1,2]
I have generalized this class to also accept l to be a tuple of bases (let's call this generalized class MyListTuple), but the code is in SageMath so I don't really want to translate that to pure python too, but it works great.
It would look something like this:
L = MyListTuple(l=[2,4], g=2) #creates a list of length 2^2*4^2 initialized at zero
L[0,9] = 'Hello World'
L[0,9] == L[[0,0],[1,2]]
The next part I want to improve I currently use a dictionary of which the keys are tuples of integers (so you would access it as d[9,13,0]), but I want to also be able to use as (equivalent) keys lists representing the integer in base l as above (so for l=4 that would be d[[1,2], [1,3], [0,0]]).
This is very similar to what I have done in MyListTuple, but in this case, a lot of the keys are never used.
So my question is: How to choose between creating a subclass of UserDict that is equivalent to MyListTuple in handling the given key or just use MyListTuple even if in most cases most entries will never be used?
Or as I phrased it above, which are the details in the use of this structure that I should look for to choose between the two? (things like "if you are doing a lot of this thing with your object, then a dictionary is better")
(Will only try to address the general "list vs dict" part.
Take this with a grain of salt; from user, not implementer.
This is not a real answer, more of a big comment.)
List (probably doubly-linked list) should provide efficient
insertion & deletion anywhere (only modifying pointers to
next/prev elements, O(1)).
Searching will be ineffecient (both O(n) -check all items-,
and cache misses* -bad locality of reference-).
(*vs items stored contiguously in memory (e.g. numpy.array)).
Dict (some kind of hash-map) should theoretically provide
efficient searches, insertions & deletions (amortized O(1));
but that may depend on the quality of the hash function, the
bucket size, the usage patterns, etc. (I don't know enough).
Iterating through all items sequentially will be inefficient
for both, due to cache misses / bad locality of reference
(following pointers, instead of accessing memory sequentially).
As far as I know:
You would use lists as mutable sequences (when you need to
iterate over all items) in Python, for lack of a better
alternative (C arrays, C++ std::array/std::vector/, etc.).
You would use dicts for quick lookup/search, based on keys,
when searching is more important/often, than insertion/deletion.

Python heapq vs sorted speed for pre-sorted lists

I have a reasonably large number n=10000 of sorted lists of length k=100 each. Since merging two sorted lists takes linear time, I would imagine its cheaper to recursively merge the sorted lists of length O(nk) with heapq.merge() in a tree of depth log(n) than to sort the entire thing at once with sorted() in O(nklog(nk)) time.
However, the sorted() approach seems to be 17-44x faster on my machine. Is the implementation of sorted() that much faster than heapq.merge() that it outstrips the asymptotic time advantage of the classic merge?
import itertools
import heapq
data = [range(n*8000,n*8000+10000,100) for n in range(10000)]
# Approach 1
for val in heapq.merge(*data):
test = val
# Approach 2
for val in sorted(itertools.chain(*data)):
test = val
CPython's list.sort() uses an adaptive merge sort, which identifies natural runs in the input, and then merges them "intelligently". It's very effective at exploiting many kinds of pre-existing order. For example, try sorting range(N)*2 (in Python 2) for increasing values of N, and you'll find the time needed grows linearly in N.
So the only real advantage of heapq.merge() in this application is lower peak memory use if you iterate over the results (instead of materializing an ordered list containing all the results).
In fact, list.sort() is taking more advantage of the structure in your specific data than the heapq.merge() approach. I have some insight into this, because I wrote Python's list.sort() ;-)
(BTW, I see you already accepted an answer, and that's fine by me - it's a good answer. I just wanted to give a bit more info.)
ABOUT THAT "more advantage"
As discussed a bit in comments, list.sort() plays lots of engineering tricks that may cut the number of comparisons needed over what heapq.merge() needs. It depends on the data. Here's a quick account of what happens for the specific data in your question. First define a class that counts the number of comparisons performed (note that I'm using Python 3, so have to account for all possible comparisons):
class V(object):
def __init__(self, val):
self.val = val
def __lt__(a, b):
global ncmp
ncmp += 1
return a.val < b.val
def __eq__(a, b):
global ncmp
ncmp += 1
return a.val == b.val
def __le__(a, b):
raise ValueError("unexpected comparison")
__ne__ = __gt__ = __ge__ = __le__
sort() was deliberately written to use only < (__lt__). It's more of an accident in heapq (and, as I recall, even varies across Python versions), but it turns out .merge() only required < and ==. So those are the only comparisons the class defines in a useful way.
Then changing your data to use instances of that class:
data = [[V(i) for i in range(n*8000,n*8000+10000,100)]
for n in range(10000)]
Then run both methods:
ncmp = 0
for val in heapq.merge(*data):
test = val
print(format(ncmp, ","))
ncmp = 0
for val in sorted(itertools.chain(*data)):
test = val
print(format(ncmp, ","))
The output is kinda remarkable:
43,207,638
1,639,884
So sorted() required far fewer comparisons than merge(), for this specific data. And that's the primary reason it's much faster.
LONG STORY SHORT
Those comparison counts looked too remarkable to me ;-) The count for heapq.merge() looked about twice as large as I thought reasonable.
Took a while to track this down. In short, it's an artifact of the way heapq.merge() is implemented: it maintains a heap of 3-element list objects, each containing the current next value from an iterable, the 0-based index of that iterable among all the iterables (to break comparison ties), and that iterable's __next__ method. The heapq functions all compare these little lists (instead of just the iterables' values), and list comparison always goes thru the lists first looking for the first corresponding items that are not ==.
So, e.g., asking whether [0] < [1] first asks whether 0 == 1. It's not, so then it goes on to ask whether 0 < 1.
Because of this, each < comparison done during the execution of heapq.merge() actually does two object comparisons (one ==, the other <). The == comparisons are "wasted" work, in the sense that they're not logically necessary to solve the problem - they're just "an optimization" (which happens not to pay in this context!) used internally by list comparison.
So in some sense it would be fairer to cut the report of heapq.merge() comparisons in half. But it's still way more than sorted() needed, so I'll let it drop now ;-)
sorted uses an adaptive mergesort that detects sorted runs and merges them efficiently, so it gets to take advantage of all the same structure in the input that heapq.merge gets to use. Also, sorted has a really nice C implementation with a lot more optimization effort put into it than heapq.merge.

Python: time/platform independent fast hash for big sets of pairs of ints

I would like to get a time and platform independent hash function for big sets of pairs of integers in Python, which is also fast and has (almost certainly) no collision. (Hm, what else do you want a hash to be - but anyway... .)
What I have so far is to use hashlib.md5 on the string representation of the sorted list:
> my_set = set([(1,2),(0,3),(1,3)]) # the input set, size 1...10^6
> import hashlib
> def MyHash(my_set):
> my_lst = sorted(my_set)
> my_str = str(my_lst)
> return hashlib.md5(my_str).hexdigest()
my_set contains between 1 and 10^5 pairs, and each int is between 0 and 10^6. In total, I have about 10^8 such sets on which the hash should be almost certainly unique.
Does this sound reasonable, or is there a better way of doing it?
On my example set with 10^6 pairs in the list, this takes about 2.5sec, so improvements on the time might be good, if possible. Almost all of the time is spend computing the string of the sorted list, so a big part of the question is
Is the string of a sorted list of tuples of integers in python stable among versions and platforms? Is there a better/faster way of obtaining a stable string representation?

Efficient functional list iteration in Python

So suppose I have an array of some elements. Each element have some number of properties.
I need to filter this list from some subsets of values determined by predicates. This subsets of course can have intersections.
I also need to determine amount of values in each such subset.
So using imperative approach I could write code like that and it would have running time of 2*n. One iteration to copy array and another one to filter it count subsets sizes.
from split import import groupby
a = [{'some_number': i, 'some_time': str(i) + '0:00:00'} for i in range(10)]
# imperative style
wrong_number_count = 0
wrong_time_count = 0
for item in a[:]:
if predicate1(item):
delete_original(item, a)
wrong_number_count += 1
if predicate2(item):
delete_original(item, a)
wrong_time_count += 1
update_some_data(item)
do_something_with_filtered(a)
def do_something_with_filtered(a, c1, c2):
print('filtered a {}'.format(a))
print('{} items had wrong number'.format(c1))
print('{} items had wrong time'.format(c2))
def predicate1(x):
return x['some_number'] < 3
def predicate2(x):
return x['some_time'] < '50:00:00'
Somehow I can't think of the way to do that in Python in functional way with same running time.
So in functional style I could have used groupby multiple times probably or write a comprehension for each predicate, but that's obviously would be slower than imperative approach.
I think such thing possible in Haskell using Stream Fusion (am I right?)
But how do that in Python?
Python has a strong support to "stream processing" in the form of its iterators - and what you ask seens just trivial to do. You just have to have a way to group your predicates and attributes to it - it could be a dictionary where the predicate itself is the key.
That said, a simple iterator function that takes in your predicate data structure, along with the data to be processed could do what you want. TThe iterator would have the side effect of changing your data-structure with the predicate-information. If you want "pure functions" you'd just have to duplicate the predicate information before, and maybe passing and retrieving all predicate and counters valus to the iterator (through the send method) for each element - I donĀ“ t think it would be worth that level of purism.
That said you could have your code something along:
from collections import OrderedDict
def predicate1(...):
...
...
def preticateN(...):
...
def do_something_with_filtered(item):
...
def multifilter(data, predicates):
for item in data:
for predicate in predicates:
if predicate(item):
predicates[predicate] += 1
break
else:
yield item
def do_it(data):
predicates = OrderedDict([(predicate1, 0), ..., (predicateN, 0) ])
for item in multifilter(data, predicates):
do_something_with_filtered(item)
for predicate, value in predicates.items():
print("{} filtered out {} items".format(predicate.__name__, value)
a = ...
do_it(a)
(If you have to count an item for all predicates that it fails, then an obvious change from the "break" statement to a state flag variable is enough)
Yes, fusion in Haskell will often turn something written as two passes into a single pass. Though in the case of lists, it's actually foldr/build fusion rather than stream fusion.
That's not generally possible in languages that don't enforce purity, though. When side effects are involved, it's no longer correct to fuse multiple passes into one. What if each pass performed output? Unfused, you get all the output from each pass separately. Fused, you get the output from both passes interleaved.
It's possible to write a fusion-style framework in Python that will work correctly if you promise to only ever use it with pure functions. But I'm doubtful such a thing exists at the moment. (I'd loved to be proven wrong, though.)

Categories