Is there an efficient way to fill a numba `Dict` in parallel? - python

I'm having some trouble quickly filling a numba Dict object with key-value pairs (around 63 million of them). Is there an efficient way to do this in parallel?
The documentation (https://numba.pydata.org/numba-doc/dev/reference/pysupported.html#typed-dict) is clear that numba.typed.Dict is not thread-safe, and so I think to use prange with a single Dict object would be a bad idea. I've tried to use a numba List of Dicts, to populate them in parallel and then stitch them together using update, but I think this last step is also inefficient.
Note that one thing (which may be important) is that all the keys are unique, i.e. once assigned, a key will not be reassigned a value. I think this property makes the problem amenable to an efficient parallelised solution.
Below is an example of the serial approach, which is slow with a large number of key-value pairs.
d = typed.Dict.empty(
key_type=types.UnicodeCharSeq(128), value_type=types.int64
)
#njit
def fill_dict(keys_list, values_list, d):
n = len(keys_list)
for i in range(n):
d[keys_list[i]] = values_list[i]
fill_dict(keys_list, values_list, d)
Can anybody help me?
Many thanks.

You don't have to stitch them together if you preprocess the key into an integer that can be computed for its modulo num_shard value.
# assuming hash() returns an arbitrary integer computed by ascii values
shard = hash(key) % num_shard;
selected_dictionary = dictionary[shard]
value = selected_dictionary[key]
# inserting
# lock only the selected_dictionary
shard = hash(key) % num_shard;
selected_dictionary = dictionary[shard]
selected_dictionary.push((key,value))
The hashing could be something like sum of all ascii codes of chars in key. The modulo based indexing separates blocks of keys so that they can work independently without extra processing except hashing.

Related

Most efficient way to get first value that startswith of large list

I have a very large list with over a 100M strings. An example of that list look as follows:
l = ['1,1,5.8067',
'1,2,4.9700',
'2,2,3.9623',
'2,3,1.9438',
'2,7,1.0645',
'3,3,8.9331',
'3,5,2.6772',
'3,7,3.8107',
'3,9,7.1008']
I would like to get the first string that starts with e.g. '3'.
To do so, I have used a lambda iterator followed by next() to get the first item:
next(filter(lambda i: i.startswith('3,'), l))
Out[1]: '3,3,8.9331'
Considering the size of the list, this strategy unfortunately still takes relatively much time for a process I have to do over and over again. I was wondering if someone could come up with an even faster, more efficient approach. I am open for alternative strategies.
I have no way of testing it myself but it is possible that if you will join all the strings with a char that is not in any of the string:
concat_list = '$'.join(l)
And now use a simple .find('$3,'), it would be faster. It might happen if all the strings are relatively short. Since now all the string is in one place in memory.
If the amount of unique letters in the text is small you can use Abrahamson-Kosaraju method and het time complexity of practically O(n)
Another approach is to use joblib, create n threads when the i'th thread is checking the i + k * n, when one is finding the pattern it stops the others. So the time complexity is O(naive algorithm / n).
Since your actual strings consist of relatively short tokens (such as 301) after splitting the the strings by tabs, you can build a dict with each possible length of the first token as the keys so that subsequent lookups take only O(1) in average time complexity.
Build the dict with values of the list in reverse order so that the first value in the list that start with each distinct character will be retained in the final dict:
d = {s[:i + 1]: s for s in reversed(l) for i in range(len(s.split('\t')[0]))}
so that given:
l = ['301\t301\t51.806763\n', '301\t302\t46.970094\n',
'301\t303\t39.962393\n', '301\t304\t18.943836\n',
'301\t305\t11.064584\n', '301\t306\t4.751911\n']
d['3'] will return '301\t301\t51.806763'.
If you only need to test each of the first tokens as a whole, rather than prefixes, you can simply make the first tokens as the keys instead:
d = {s.split('\t')[0]: s for s in reversed(l)}
so that d['301'] will return '301\t301\t51.806763'.

How to calculate Euclidian of dictionary with tuple as key

I have created a matrix by using a dictionary with a tuple as the key (e.g. {(user, place) : 1 } )
I need to calculate the Euclidian for each place in the matrix.
I've created a method to do this, but it is extremely inefficient because it iterates through the entire matrix for each place.
def calculateEuclidian(self, place):
count = 0;
for key, value in self.matrix.items():
if(key[1] == place and value == 1):
count += 1
euclidian = math.sqrt(count)
return euclidian
Is there a way to do this more efficiently?
I need the result to be in a dictionary with the place as a key, and the euclidian as the value.
You can use a dictionary comprehension (using a vectorized form is much faster than a for loop) and accumulate the result of the conditionals (0 or 1) as the euclidean value:
def calculateEuclidian(self, place):
return {place: sum(p==place and val==1 for (_,p), val in self.matrix.items())}
With your current data structure, I doubt there is any way you can avoid iterating through the entire dictionary.
If you cannot use another way (or an auxiliary way) of representing your data, iterating through every element of the dict is as efficient as you can get (asymptotically), since there is no way to ask a dict with tuple keys to give you all elements with keys matching (_, place) (where _ denotes "any value"). There are other, and more succinct, ways of writing the iteration code, but you cannot escape the asymptotic efficiency limitation.
If this is your most common operation, and you can in fact use another way of representing your data, you can use a dict[Place, list[User]] instead. That way, you can, in O(1) time, get the list of all users at a certain place, and all you would need to do is count the items in the list using the len(...) function which is also O(1). Obviously, you'll still need to take the sqrt in the end.
There may be ways to make it more Pythonic, but I do not think you can change the overall complexity since you are making a query based off both key and value. I think you have to search the whole matrix for your instances.
you may want to create a new dictionary from your current dictionary which isn't adapted to this kind of search and create a dictionary with place as key and list of (user,value) tuples as values.
Get the tuple list under place key (that'll be fast), then count the times where value is 1 (linear, but on a small set of data)
Keep the original dictionary for euclidian distance computation. Hoping that you don't change the data too often in the program, because you'd need to keep both dicts in-sync.

Python heapq vs sorted speed for pre-sorted lists

I have a reasonably large number n=10000 of sorted lists of length k=100 each. Since merging two sorted lists takes linear time, I would imagine its cheaper to recursively merge the sorted lists of length O(nk) with heapq.merge() in a tree of depth log(n) than to sort the entire thing at once with sorted() in O(nklog(nk)) time.
However, the sorted() approach seems to be 17-44x faster on my machine. Is the implementation of sorted() that much faster than heapq.merge() that it outstrips the asymptotic time advantage of the classic merge?
import itertools
import heapq
data = [range(n*8000,n*8000+10000,100) for n in range(10000)]
# Approach 1
for val in heapq.merge(*data):
test = val
# Approach 2
for val in sorted(itertools.chain(*data)):
test = val
CPython's list.sort() uses an adaptive merge sort, which identifies natural runs in the input, and then merges them "intelligently". It's very effective at exploiting many kinds of pre-existing order. For example, try sorting range(N)*2 (in Python 2) for increasing values of N, and you'll find the time needed grows linearly in N.
So the only real advantage of heapq.merge() in this application is lower peak memory use if you iterate over the results (instead of materializing an ordered list containing all the results).
In fact, list.sort() is taking more advantage of the structure in your specific data than the heapq.merge() approach. I have some insight into this, because I wrote Python's list.sort() ;-)
(BTW, I see you already accepted an answer, and that's fine by me - it's a good answer. I just wanted to give a bit more info.)
ABOUT THAT "more advantage"
As discussed a bit in comments, list.sort() plays lots of engineering tricks that may cut the number of comparisons needed over what heapq.merge() needs. It depends on the data. Here's a quick account of what happens for the specific data in your question. First define a class that counts the number of comparisons performed (note that I'm using Python 3, so have to account for all possible comparisons):
class V(object):
def __init__(self, val):
self.val = val
def __lt__(a, b):
global ncmp
ncmp += 1
return a.val < b.val
def __eq__(a, b):
global ncmp
ncmp += 1
return a.val == b.val
def __le__(a, b):
raise ValueError("unexpected comparison")
__ne__ = __gt__ = __ge__ = __le__
sort() was deliberately written to use only < (__lt__). It's more of an accident in heapq (and, as I recall, even varies across Python versions), but it turns out .merge() only required < and ==. So those are the only comparisons the class defines in a useful way.
Then changing your data to use instances of that class:
data = [[V(i) for i in range(n*8000,n*8000+10000,100)]
for n in range(10000)]
Then run both methods:
ncmp = 0
for val in heapq.merge(*data):
test = val
print(format(ncmp, ","))
ncmp = 0
for val in sorted(itertools.chain(*data)):
test = val
print(format(ncmp, ","))
The output is kinda remarkable:
43,207,638
1,639,884
So sorted() required far fewer comparisons than merge(), for this specific data. And that's the primary reason it's much faster.
LONG STORY SHORT
Those comparison counts looked too remarkable to me ;-) The count for heapq.merge() looked about twice as large as I thought reasonable.
Took a while to track this down. In short, it's an artifact of the way heapq.merge() is implemented: it maintains a heap of 3-element list objects, each containing the current next value from an iterable, the 0-based index of that iterable among all the iterables (to break comparison ties), and that iterable's __next__ method. The heapq functions all compare these little lists (instead of just the iterables' values), and list comparison always goes thru the lists first looking for the first corresponding items that are not ==.
So, e.g., asking whether [0] < [1] first asks whether 0 == 1. It's not, so then it goes on to ask whether 0 < 1.
Because of this, each < comparison done during the execution of heapq.merge() actually does two object comparisons (one ==, the other <). The == comparisons are "wasted" work, in the sense that they're not logically necessary to solve the problem - they're just "an optimization" (which happens not to pay in this context!) used internally by list comparison.
So in some sense it would be fairer to cut the report of heapq.merge() comparisons in half. But it's still way more than sorted() needed, so I'll let it drop now ;-)
sorted uses an adaptive mergesort that detects sorted runs and merges them efficiently, so it gets to take advantage of all the same structure in the input that heapq.merge gets to use. Also, sorted has a really nice C implementation with a lot more optimization effort put into it than heapq.merge.

Python dict key delete if pattern match with other dict key

Python dict key delete, if key pattern match with other dict key.
e.g.
a={'a.b.c.test':1, 'b.x.d.pqr':2, 'c.e.f.dummy':3, 'd.x.y.temp':4}
b={'a.b.c':1, 'b.p.q':20}
result
a={'b.x.d.pqr':2,'c.e.f.dummy':3,'d.x.y.temp':4}`
If "pattern match with other dict key" means "starts with any key in the other dict", the most direct way to write that would be like this:
a = {k:v for (k, v) in a.items() if any(k.startswith(k2) for k2 in b)}
If that's hard to follow at first glance, it's basically the equivalent of this:
def matches(key1, d2):
for key2 in d2:
if key1.startswith(key2):
return True
return False
c = {}
for key in a:
if not matches(key, b):
c[key] = a[key]
a = c
This is going to be slower than necessary. If a has N keys, and b has M keys, the time taken is O(NM). While you can checked "does key k exist in dict b" in constant time, there's no way to check "does any key starting with k exist in dict b" without iterating over the whole dict. So, if b is potentially large, you probably want to search sorted(b.keys()) and write a binary search, which will get the time down to O(N log M). But if this isn't a bottleneck, you may be better off sticking with the simple version, just because it's simple.
Note that I'm generating a new a with the matches filtered out, rather than deleting the matches. This is almost always a better solution than deleting in-place, for multiple reasons:
* It's much easier to reason about. Treating objects as immutable and doing pure operations on them means you don't need to think about how states change over time. For example, the naive way to delete in place would run into the problem that you're changing the dictionary while iterating over it, which will raise an exception. Issues like that never come up without mutable operations.
* It's easier to read, and (once you get the hang of it) even to write.
* It's almost always faster. (One reason is that it takes a lot more memory allocations and deallocations to repeatedly modify a dictionary than to build one with a comprehension.)
The one tradeoff is memory usage. The delete-in-place implementation has to make a copy of all of the keys; the built-a-new-dict implementation has to have both the filtered dict and the original dict in memory. If you're keeping 99% of the values, and the values are much larger than the keys, this could hurt you. (On the other hand, if you're keeping 10% of the values, and the values are about the same size as the keys, you'll actually save space.) That's why it's "almost always" a better solution, rather than "always".
for key in list(a.keys()):
if any(key.startswith(k) for k in b):
del a[key]
Replace key.startswith(k) with an appropriate condition for "matching".
c={} #result in dict c
for key in b.keys():
if all([z.count(key)==0 for z in a.keys()]): #string of the key in b should not be substring for any of the keys in a
c[key]=b[key]

What is the quickest way to hash a large arbitrary object?

I am writing a method to generate cache keys for caching function results, the key is based on a combination of function name and hash value of parameters.
Currently I am using hashlib to hash the serialized version of parameters, however the operation is very expensive to serialize large objects, so what's the alternative?
#get the cache key for storage
def cache_get_key(*args):
import hashlib
serialise = []
for arg in args:
serialise.append(str(arg))
key = hashlib.md5("".join(serialise)).hexdigest()
return key
UPDATE:
I tried using hash(str(args)), but if args have relatively large data in it, still takes long time to compute the hash value. Any better way to do it?
Actually str(args) with large data takes forever...
Assuming you made the object, and it is composed of smaller components (it is not a binary blob), you can precompute the hash when you build the object by using the hashes of its subcomponents.
For example, rather than serialize(repr(arg)), do arg.precomputedHash if isinstance(arg, ...) else serialize(repr(arg))
If you neither make your own objects nor use hashable objects, you can perhaps keep a memoization table of objectreferences -> hashes, assuming you don't mutate the objects. Worst case, you can use a functional language which allows for memoization, since all objects in such a language are probably immutable and hence hashable.
def cache_get_key(*args):
return hash(str(args))
or (if you really want to use the hashlib library)
def cache_get_key(*args):
return hashlib.md5(str(args)).hexdigest()
I wouldn't bother rewriting code to make arrays into strings. Use the inbuilt one.
alternative solution
below is the solution #8bitwide suggested. No hashing required at all with this solution!
def foo(x, y):
return x+y+1
result1 = foo(1,1)
result2 = foo(2,3)
results = {}
results[foo] = {}
results[foo][ [1,1] ] = result1
results[foo][ [2,3] ] = result2
Have you tried just using the hash function? It works perfectly well on tuples.
I know this question is old, but I just want to add my 2 cents:
You don't have to create a list and then join it. Especially if the list is going to be discarded anyways. Use the hash function's .update() method
Consider using a much faster non-cryptographic hash algo, especially if this is not meant to be a cryptographically-secure implementation.
Having said that, this is my suggested improvement:
import xxhash
#get the cache key for storage
def cache_get_key(*args):
hasher = xxhash.xxh3_64()
for arg in args:
hasher.update(str(arg))
return hasher.hexdigest()
This uses the (claimed to be) extremely fast xxHash NCHF*.
* NCHF = Non-Cryptographic Hash Function
I've seen people feed an arbitrary python object to random.seed(), and then use the first value back from random.random() as the "hash" value. It doesn't give a terrific distribution of values (can be skewed), but it seems to work for arbitrary objects.
If you don't need cryptographic-strength hashes, I came up with a pair of hash functions for a list of integers that I use in a bloom filter. They appear below. The bloom filter actually uses linear combinations of these two hash functions to obtain an arbitrarily large number of hash functions, but they should work fine in other contexts that just need a bit of scattering with a decent distribution. They're inspired by Knuth's writing on Linear Congruential Random Number Generation. They take a list of integers as input, which I believe could just be the ord()'s of your serialized characters.
MERSENNES1 = [ 2 ** x - 1 for x in [ 17, 31, 127 ] ]
MERSENNES2 = [ 2 ** x - 1 for x in [ 19, 67, 257 ] ]
def simple_hash(int_list, prime1, prime2, prime3):
'''Compute a hash value from a list of integers and 3 primes'''
result = 0
for integer in int_list:
result += ((result + integer + prime1) * prime2) % prime3
return result
def hash1(int_list):
'''Basic hash function #1'''
return simple_hash(int_list, MERSENNES1[0], MERSENNES1[1], MERSENNES1[2])
def hash2(int_list):
'''Basic hash function #2'''
return simple_hash(int_list, MERSENNES2[0], MERSENNES2[1], MERSENNES2[2])

Categories