Couldn't find the details of this anywhere online, when comparing two frozensets does Python iterate through the elements in one of the sets or does it check the hash values of the frozensets since frozensets are hashable?
Since the reference docs don't say anything about this, it's implementation-dependent, so there is no answer short of looking at the source code for the version of Python you're using (in your CPython distribution's Objects/setobject.c). Looking at the source for Python 3.7.0, the answer is "maybe" ;-)
Equality first checks whether the frozensets have the same size (len()). If not, they can't be equal, so False is returned at once.
Else the hash codes are compared if they've already been computed. If they have already been computed, then False is returned at once if the hash codes aren't equal. Else element-by-element code is invoked to check whether one is a subset of the other.
A hash code for a frozenset isn't computed just for the heck of it - that would be an expense that may not pay off. So something has to force it. The primary use case for frozensets at the start was to allow sets of sets, and in that context hash codes will be computed as a normal part of adding a frozenset to a containing set. The C-level set implementation contains a slot to record the hash if and when it's computed, which is initialized to -1 (a reserved value that means "no hash code known" internally).
hash(x) == hash(y) does not imply that x == y:
>>> help(hash)
hash(...)
hash(object) -> integer
Return a hash value for the object. Two objects with the same value have
the same hash value. The reverse is not necessarily true, but likely.
so to compare two frozenset values for equality, you still need to check that both sets have the same size, then check if every element in one is also in the other.
I leave it as an exercise for the reader with lots of spare time to find two different frozensets with the same hash value.
Related
After reading that interning string can help with performance. Do i just store the return value from the sys.intern call in the dictionary as the key and that is it?
t = {}
t[sys.intern('key')] = 'val'
Thanks
Yes, that's how you will use it.
To be more specific on the performance, the doc states that:
Interning strings is useful to gain a little performance on dictionary lookup – if the keys in a dictionary are interned, and the lookup key is interned, the key comparisons (after hashing) can be done by a pointer compare instead of a string compare.
There are two steps in a (classic) dict lookup: 1. hash the object into a number that is the index in the array that stores the data; 2. iterate over the array cell at this index to find a couple (key, value) with the correct key.
Usually, the second step is reasonabily fast because we choose a hash function that ensures very few collisions (different objects - same hash). But it has still to check the key you are looking for against every stored key having the same hash. This is the step 2 that will be faster : strings identity is tested before the expensive test, char by char, of the string equality.
The step 1 is harder to accelerate, because you can store the hash along with the interned string... but you have to compute the hash to find the interned string itself.
This was theory! If you really need to improve performance, first do some benchmarks.
Then think of the specificity of the domain. You are storing IPv4 addresses as keys. An IPv4 address is a number between 0 and 256^4. If you replace the human friendly representation of an address by an integer, you'll get a faster hash (hashing small numbers in CPython if almost costless: https://github.com/python/cpython/blob/master/Python/pyhash.c) and a faster lookup. The ip_address module might be the best choice in your case.
If you are sure that addresses are between boundaries (e.g. 172.16.0.0 – 172.31.255.255) you can try to use an array instead of a dict. It should be faster unless your array is huge (disk swap).
Finally, if this is not fast enough, be ready to use a faster language.
I am using Python 3.5 and the documentation for it at
https://docs.python.org/3.5/library/stdtypes.html#sequence-types-list-tuple-range
says:
list([iterable])
(...)
The constructor builds a list whose items are the same and in the same order as iterable’s items.
OK, for the following script:
#!/usr/bin/python3
import random
def rand6():
return random.randrange(63)
random.seed(0)
check_dict = {}
check_dict[rand6()] = 1
check_dict[rand6()] = 1
check_dict[rand6()] = 1
print(list(check_dict))
I always get
[24, 48, 54]
But, if I change the function to:
def rand6():
return bytes([random.randrange(63)])
then the order returned is not always the same:
>./foobar.py
[b'\x18', b'6', b'0']
>./foobar.py
[b'6', b'0', b'\x18']
Why?
Python dictionaries are implemented as hash tables. In most Python versions (more on this later), the order you get the keys when you iterate over a dictionary is the arbitrary order of the values in the table, which has only very little to do with the order in which they were added (when hash collisions occur, the order of insertions can matter a little bit). This order is implementation dependent. The Python language does not offer any guarantee about the order other than that it will remain the same for several iterations over a dictionary if no keys are added or removed in between.
For your dictionary with integer keys, the hash table doesn't do anything fancy. Integers hash to themselves (except -1), so with the same numbers getting put in the dict, you get a consistent order in the hash table.
For the dictionary with bytes keys however, you're seeing different behavior due to Hash Randomization. To prevent a kind of dictionary collision attack (where a webapp implemented in Python could be DoSed by sending it data with thousands of keys that hash to the same value leading to lots of collisions and very bad (O(N**2)) performance), Python picks a random seed every time it starts up and uses it to randomize the hash function for Unicode and byte strings as well as datetime types.
You can disable the hash randomization by setting the environment variable PYTHONHASHSEED to 0 (or you can pick your own seed by setting it to any positive integer up to 2**32-1).
It's worth noting that this behavior has changed in Python 3.6. Hash randomization still happens, but a dictionary's iteration order is no longer based on the hash values of the keys. While the official language policy is still that the order is arbitrary, the implementation of dict in CPython now preserves the order that its values were added. You shouldn't rely upon this behavior when using regular dicts yet, as it's possible (though it appears unlikely at this point) that the developers will decide it was a mistake and change the implementation again. If you want to guarantee that iteration occurs in a specific order, use the collections.OrderedDict class instead of a normal dict.
Consider two dictionaries as follows:
d1={"Name":"John","Age":47}
d2={"Name":"Margaret","Age":35}
On executing the following statement:
>>>cmp(d1,d2)
1
That implies that since the keys are identical therefore it compares the values and gives priority to the value associated with the "Age" key (perhaps because lexicographically it comes first). This is supported by the fact that when I alter the dictionaries:
d1={"Name":"John","Age":47}
d2={"Name":"Jack","Age":47}
The statement returns 1. Since the sum of the ASCII values is greater for d1.
But consider this pair of dictionaries:
d1={"Name":"John","Age":47}
d2={"Name":"Jzan","Age":47}
Now the statement returns -1.
Why is that? Is it that instead of comparing the sum of the ASCII values, it compares each character's value, one by one?
Also, if the keys themselves are different, on what basis does the function compare?
Most of the programming language implement the comparison of strings according to dictionary order (the way that words are ordered in a dictionary), i.e. compare the character's value one by one and return the first difference.
If the keys themselves are different, the return values are actually depends on the implementation. You can find more information from here: Is there a description of how __cmp__ works for dict objects in Python 2?. However it is not recommended to rely on this feature in your code.
I'm taking a first look at the python language from Python wikibook.
For sets the following is mentioned:
We can also have a loop move over each of the items in a set. However, since sets are unordered, it is undefined which order the iteration will follow.
and the code example given is :
s = set("blerg")
for letter in s:
print letter
Output:
r b e l g
When I run the program I get the results in the same order, no matter how many times I run. If sets are unordered and order of iteration is undefined, why is it returning the set in the same order? And what is the basis of the order?
They are not randomly ordered, they are arbitrarily ordered. It means you should not count on the order of insertions being maintained as the actual internal implementation details determine the order instead.
The order depends on the insertion and deletion history of the set.
In CPython, sets use a hash table, where inserted values are slotted into a sparse table based on the value returned from the hash() function, modulo the table size and a collision handling algorithm. Listing the set contents then returns the values as ordered in this table.
If you want to go into the nitty-gritty technical details then look at Why is the order in dictionaries and sets arbitrary?; sets are, at their core, dictionaries where the keys are the set values and there are no associated dictionary values. The actual implementation is a little more complicated, as always, but that answer will suffice to get you most of the way there. Then look at the C source code for set for the rest of those details.
Compare this to lists, which do have a fixed order that you can influence; you can move items around in the list and the new ordering would be maintained for you.
I'm storing millions, possibly billions of 4 byte values in a hashtable and I don't want to store any of the keys. I expect that only the hashes of the keys and the values will have to be stored. This has to be fast and all kept in RAM. The entries would still be looked up with the key, unlike set()'s.
What is an implementation of this for Python? Is there a name for this?
Yes, collisions are allowed and can be ignored.
(I can make an exception for collisions, the key can be stored for those. Alternatively, collisions can just overwrite the previously stored value.)
Bloomier filters - space-efficient associative array
From the Wikipedia:
Chazelle et al. (2004) designed a
generalization of Bloom filters that
could associate a value with each
element that had been inserted,
implementing an associative array.
Like Bloom filters, these structures
achieve a small space overhead by
accepting a small probability of false
positives. In the case of "Bloomier
filters", a false positive is defined
as returning a result when the key is
not in the map. The map will never
return the wrong value for a key that
is in the map.
How about using an ordinary dictionary and instead of doing:
d[x]=y
use:
d[hash(x)]=y
To look up:
d[hash(foo)]
Of course, if there is a hash collision, you may get the wrong value back.
Its the good old space vs runtime tradeoff: You can have constant time with linear space usage for the keys in a hastable. Or you can store the key implicitly and use log n time by using a binary tree. The (binary) hash of a value gives you the path in the tree where it will be stored.
Build your own b-tree in RAM.
Memory use:
(4 bytes) comparison hash value
(4 bytes) index of next leaf if hash <= comparison OR if negative index of value
(4 bytes) index of next leaf if hash > comparison OR if negative index of value
12 bytes per b-tree node for the b-tree. More overhead for the values (see below).
How you structure this in Python - aren't there "native arrays" of 32bit integers upported with almost no extra memory overhead...? what are they called... anyway those.
Separate ordered array of subarrays each containing one or more values. The "indexes of value" above are indexes into this big array, allowing retrieval of all values matching the hash.
This assumes a 32bit hash. You will need more bytes per b-tree node if you have
greater than 2^31-1 entries or a larger hash.
BUT Spanner in the works perhaps: Note that you will not be able, if you are not storing the key values, to verify that a hash value looked up corresponds only to your key unless through some algorithmic or organisational mechanism you have guaranteed that no two keys will have the same hash. Quite a serious issue here. Have you considered it? :)
Although python dictionaries are very efficient, I think that if you're going to store billions of items, you may want to create your own C extension with data structures, optimized for the way you are actually using it (sequential access? completely random? etc).
In order to create a C extension, you may want to use SWIG, or something like Pyrex (which I've never used).
Hash table has to store keys, unless you provide a hash function that gives absolutely no collisions, which is nearly impossible.
There is, however, if your keys are string-like, there is a very space-efficient data structure - directed acyclic word graph (DAWG). I don't know any Python implementation though.
It's not what you asked for buy why not consider Tokyo Cabinet or BerkleyDB for this job? It won't be in memory but you are trading performance for greater storage capacity. You could still keep your list in memory and use the database only to check existence.
Would you please tell us more about the keys? I'm wondering if there is any regularity in the keys that we could exploit.
If the keys are strings in a small alphabet (example: strings of digits, like phone numbers) you could use a trie data structure:
http://en.wikipedia.org/wiki/Trie
If you're actually storing millions of unique values, why not use a dictionary?
Store: d[hash(key)/32] |= 2**(hash(key)%32)
Check: (d[hash(key)/32] | 2**(hash(key)%32))
If you have billions of entries, use a numpy array of size (2**32)/32, instead. (Because, after all, you only have 4 billion possible values to store, anyway).
Why not a dictionary + hashlib?
>>> import hashlib
>>> hashtable = {}
>>> def myHash(obj):
return hashlib.sha224(obj).hexdigest()
>>> hashtable[myHash("foo")] = 'bar'
>>> hashtable
{'0808f64e60d58979fcb676c96ec938270dea42445aeefcd3a4e6f8db': 'bar'}