how to intern dictionary string keys in python?

how to intern dictionary string keys in python? - python

After reading that interning string can help with performance. Do i just store the return value from the sys.intern call in the dictionary as the key and that is it?
t = {}
t[sys.intern('key')] = 'val'
Thanks

Yes, that's how you will use it.
To be more specific on the performance, the doc states that:
Interning strings is useful to gain a little performance on dictionary lookup – if the keys in a dictionary are interned, and the lookup key is interned, the key comparisons (after hashing) can be done by a pointer compare instead of a string compare.
There are two steps in a (classic) dict lookup: 1. hash the object into a number that is the index in the array that stores the data; 2. iterate over the array cell at this index to find a couple (key, value) with the correct key.
Usually, the second step is reasonabily fast because we choose a hash function that ensures very few collisions (different objects - same hash). But it has still to check the key you are looking for against every stored key having the same hash. This is the step 2 that will be faster : strings identity is tested before the expensive test, char by char, of the string equality.
The step 1 is harder to accelerate, because you can store the hash along with the interned string... but you have to compute the hash to find the interned string itself.
This was theory! If you really need to improve performance, first do some benchmarks.
Then think of the specificity of the domain. You are storing IPv4 addresses as keys. An IPv4 address is a number between 0 and 256^4. If you replace the human friendly representation of an address by an integer, you'll get a faster hash (hashing small numbers in CPython if almost costless: https://github.com/python/cpython/blob/master/Python/pyhash.c) and a faster lookup. The ip_address module might be the best choice in your case.
If you are sure that addresses are between boundaries (e.g. 172.16.0.0 – 172.31.255.255) you can try to use an array instead of a dict. It should be faster unless your array is huge (disk swap).
Finally, if this is not fast enough, be ready to use a faster language.

Related

What is the fastest key type for the dictionaries in python? tuple, frozenset...?

Context: I am trying to speed up the execution time of k-means. For that, I pre-compute the means before the k-means execution. These means are stored in a dictionary called means_dict which has as a key a sequence of the points id ordered in ascending order and then joining by an underscore ,and as a value the mean of these points.
When I want to access to the mean of a given points set in dict_mean dictionary during the k-means execution, I have to generate the key of that points set ie order the id points in ascending order and joining them by an underscore.
The key generation instruction takes a long time because I the key may contain thousands of integers.
I have for each key a sequence of integers separated by an underscore "-" in a dictionary. I have to sort the sequence of integers before joining them by an underscore in order to make the key unique, I finaly obtain a string key. The problem is this process is so long. I want to use an another type of key that permits to avoid sorting the sequence and that key type should be faster than the string type in terms of access, comparison and search.
# means_dict is the dictionary containing as a key a string (sequence of
# integers joined by underscore "-", for example key="3-76-45-78-344")
# points is a dictionary containing for each value a list of integers
for k in keys:
# this joining instruction is so long
key = "_".join([ str(c) for c in sorted(points[k])])
if( key in means_dict ):
newmu.append( means_dict[key] )

Computing the means is cheap.
Did you profile your program? How much of the time is spent recomputing he means? With proper numpy arrays instead of python boxed arrays, this should be extremely cheap - definitely cheaper than constructing any such key!
The reason why computing the key is expensive is simple: it means constructing an object of varying size. And based on your description, it seems you will be building first a list of boxed integers, then a tuple of boxes integers, then serialize this into a string and then copy the string again to append the underscore. There is no way this is going to be faster than the simple - vectorizable - aggregation when computing the actual mean...
You could even use MacQueens approach to update means rather than recomputing them. But even that is often slower than recomputing them.
I wouldn't be surprised if your approach ends up being 10x slower than regular k-means... And probably 1000x slower than the clever kmeans algorithms such as Hartigan and Wong's.

Dictionary Python space

I use a dictionary to store unique keys mapped to unique values.
First keys were str and values int. I used sys.getsizeof(dict) to get the size and i also printed the dict into a file.
debugggg = csv.writer(open("debug1.txt", 'w'))
for key, val in diction.items():
debugggg.writerow([key, val])
I got 296 MB from the file and 805306648 from sys.getsizeof()
Then as values I stored the same keys but this time I hashed them before mapping
diction[hash(mykey_1)] = value_1
And I was expecting this to be a bit more compressed than the previous approach.
I ran the same functions to get the size. From the file i got 362MB(!) and from sys.getsizeof() i got the same result as previous (805306648)
Time for process was the same , as it was expected O(1) lookups. But I am a bit confused about the sizes.

sys.getsizeof(some_dict) accounts for the size of the internal hash table, which is roughly proportional to the number of keys. But it does not account for the size of the keys and values, partly because that would be tricky to do correctly, and partly because there can be many other references to those objects so it's a bit misleading to include them (their size might be amortized over many different dicts, for instance).
As for the file size: Aside from the fact that this does include the sizes of keys and values, the values (integers) are encoded differently, possibly more inefficiently. Overall this might be balanced by the fact that an int object contains considerable metadata and rounds up the actual data to 4 to 8 bytes. Other factors: A CSV file includes the commas and line breaks, and a hash is often a large number, larger than many short strings (hash("a") is -392375501 on my machine).
Side note: It's likely that the dict you built using diction[hash(mykey1_)] = ... is wrong. You're doing the hashing outside of the eyes of the dictionary, so it can't protect you from hash collisions. You might lose some values because their keys hash to the same integer. Since the internal hash table is always a power of two and only resized at certain threshold, having a few entries less doesn't necessarily show in sys.getsizeof.

First, O(1) is the average time for lookups in a dict, but it can be as bad as O(n). Here's the source for that information.
Second, the hash function returns an integer. sys.getsizeof(hash("Some value")) returns 12 on my platform. But if you read up on how python dictionaries are implemented the size of the key doesn't really have much to do with how many bytes the entire dictionary takes up. That has more to do with how many total items you're storing, not how you access them.

Python Efficiency of the in statement

Just a quick question, I know that when looking up entries in a dictionary there's a fast efficient way of doing it:
(Assuming the dictionary is ordered in some way using collections.OrderedDict())
You start at the middle of the dictionary, and find whether the desired key is off to one half or another, such as when testing the position of a name in an alphabetically ordered dictionary (or in rare cases dead on). You then check the next half, and continue this pattern until the item is found (meaning that with a dictionary of 1000000 keys you could effectively find any key within 20 iterations of this algorithm).
So I was wondering, if I were to use an in statement (i.e. if a in somedict:), would it use this same method of checking for the desired key? Does it use a faster/slower algorithm?

Nope. Python's dictionaries basically use a hash table (it actually uses an modified hash table to improve speed) (I won't bother to explain a hash table; the linked Wikipedia article describes it well) which is a neat structure which allows ~O(1) (very fast) access. in looks up the object (the same thing that dict[object] does) except it doesn't return the object, which is the most optimal way of doing it.
The code for in for dictionaries contains this line (dk_lookup() returns a hash table entry if it exists, otherwise NULL (the equivalent of None in C, often indicating an error)):
ep = (mp->ma_keys->dk_lookup)(mp, key, hash, &value_addr);

Python unhash value

I am a newbie to the python. Can I unhash, or rather how can I unhash a value. I am using std hash() function. What I would like to do is to first hash a value send it somewhere and then unhash it as such:
#process X
hashedVal = hash(someVal)
#send n receive in process Y
someVal = unhash(hashedVal)
#for example print it
print someVal
Thx in advance

It can't be done.
A hash is not a compressed version of the original value, it is a number (or something similar ) derived from the original value. The nature of hash implementations is that it is possible (but statistically unlikely if the hash algorithm is a good one) that two different objects produce the same hash value.
This is known as the Pigeonhole Principle which basically states that if you have N different items, and want to place them into M different categories, where the N number is larger than M (ie. more items than categories), you're going to end up with some categories containing multiple items. Since a hash value is typically much smaller in size than the data it hashes, it follows the same principles.
As such, it is impossible to go back once you have the hash value. You need a different way of transporting data than this.
For instance, an example (but not a very good one) hash algorithm would be to calculate the number modulus 3 (ie. the remainder after dividing by 3). Then you would have the following hash values from numbers:
1 --> 1 <--+- same hash number, but different original values
2 --> 2 |
3 --> 0 |
4 --> 1 <--+
Are you trying to use the hash function in this way in order to:
Save space (you have observed that the hash value is much smaller in size than the original data)
Secure transportation (you have observed that the hash value is difficult to reverse)
Transport data (you have observed that the hash number/string is easier to transport than a complex object hierarchy)
... ?
Knowing why you want to do this might give you a better answer than just "it can't be done".
For instance, for the above 3 different observations, here's a way to do each of them properly:
Compression/Decompression, for instance using gzip or zlib (the two typically available in most programming languages/runtimes)
Encryption/Decryption, for instance using RSA, AES or a similar secure encryption algorithm
Serialization/Deserialization, which is code built to take a complex object hierarchy and produce either a binary or textual representation that later on can be deserialized back into new objects

Even if I'm almost 8 years late with an answer, I want to say it is possible to unhash data (not with the std hash() function though).
The previous answers are all describing cryptographic hash functions, which by design should compute hashes that are impossible (or at least very hard to unhash).
However, this is not the case with all hash functions.
Solution
You can use basehash python lib (pip install basehash) to achieve what you want.
There is an important thing to keep in mind though: in order to be able to unhash the data, you need to hash it without loss of data. This generally means that the bigger the pool of data types and values you would like to hash, the bigger the hash length has to be, so that you won't get hash collisions.
Anyway, here's a simple example of how to hash/unhash data:
import basehash
hash_fn = basehash.base36() # you can initialize a 36, 52, 56, 58, 62 and 94 base fn
hash_value = hash_fn.hash(1) # returns 'M8YZRZ'
unhashed = hash_fn.unhash('M8YZRZ') # returns 1
You can define the hash length on hash function initialization and hash other data types as well.
I leave out the explanation of the necessity for various bases and hash lengths to the readers who would like to find out more about hashing.

You can't "unhash" data, hash functions are irreversible due to the pigeonhole principle
http://en.wikipedia.org/wiki/Hash_function
http://en.wikipedia.org/wiki/Pigeonhole_principle
I think what you are looking for encryption/decryption. (Or compression or serialization as mentioned in other answers/comments.)

This is not possible in general. A hash function necessarily loses information, and python's hash is no exception.

What is a hashtable/dictionary implementation for Python that doesn't store the keys?

I'm storing millions, possibly billions of 4 byte values in a hashtable and I don't want to store any of the keys. I expect that only the hashes of the keys and the values will have to be stored. This has to be fast and all kept in RAM. The entries would still be looked up with the key, unlike set()'s.
What is an implementation of this for Python? Is there a name for this?
Yes, collisions are allowed and can be ignored.
(I can make an exception for collisions, the key can be stored for those. Alternatively, collisions can just overwrite the previously stored value.)

Bloomier filters - space-efficient associative array
From the Wikipedia:
Chazelle et al. (2004) designed a
generalization of Bloom filters that
could associate a value with each
element that had been inserted,
implementing an associative array.
Like Bloom filters, these structures
achieve a small space overhead by
accepting a small probability of false
positives. In the case of "Bloomier
filters", a false positive is defined
as returning a result when the key is
not in the map. The map will never
return the wrong value for a key that
is in the map.

How about using an ordinary dictionary and instead of doing:
d[x]=y
use:
d[hash(x)]=y
To look up:
d[hash(foo)]
Of course, if there is a hash collision, you may get the wrong value back.

Its the good old space vs runtime tradeoff: You can have constant time with linear space usage for the keys in a hastable. Or you can store the key implicitly and use log n time by using a binary tree. The (binary) hash of a value gives you the path in the tree where it will be stored.

Build your own b-tree in RAM.
Memory use:
(4 bytes) comparison hash value
(4 bytes) index of next leaf if hash <= comparison OR if negative index of value
(4 bytes) index of next leaf if hash > comparison OR if negative index of value
12 bytes per b-tree node for the b-tree. More overhead for the values (see below).
How you structure this in Python - aren't there "native arrays" of 32bit integers upported with almost no extra memory overhead...? what are they called... anyway those.
Separate ordered array of subarrays each containing one or more values. The "indexes of value" above are indexes into this big array, allowing retrieval of all values matching the hash.
This assumes a 32bit hash. You will need more bytes per b-tree node if you have
greater than 2^31-1 entries or a larger hash.
BUT Spanner in the works perhaps: Note that you will not be able, if you are not storing the key values, to verify that a hash value looked up corresponds only to your key unless through some algorithmic or organisational mechanism you have guaranteed that no two keys will have the same hash. Quite a serious issue here. Have you considered it? :)

Although python dictionaries are very efficient, I think that if you're going to store billions of items, you may want to create your own C extension with data structures, optimized for the way you are actually using it (sequential access? completely random? etc).
In order to create a C extension, you may want to use SWIG, or something like Pyrex (which I've never used).

Hash table has to store keys, unless you provide a hash function that gives absolutely no collisions, which is nearly impossible.
There is, however, if your keys are string-like, there is a very space-efficient data structure - directed acyclic word graph (DAWG). I don't know any Python implementation though.

It's not what you asked for buy why not consider Tokyo Cabinet or BerkleyDB for this job? It won't be in memory but you are trading performance for greater storage capacity. You could still keep your list in memory and use the database only to check existence.

Would you please tell us more about the keys? I'm wondering if there is any regularity in the keys that we could exploit.
If the keys are strings in a small alphabet (example: strings of digits, like phone numbers) you could use a trie data structure:
http://en.wikipedia.org/wiki/Trie

If you're actually storing millions of unique values, why not use a dictionary?
Store: d[hash(key)/32] |= 2**(hash(key)%32)
Check: (d[hash(key)/32] | 2**(hash(key)%32))
If you have billions of entries, use a numpy array of size (2**32)/32, instead. (Because, after all, you only have 4 billion possible values to store, anyway).

Why not a dictionary + hashlib?
>>> import hashlib
>>> hashtable = {}
>>> def myHash(obj):
return hashlib.sha224(obj).hexdigest()
>>> hashtable[myHash("foo")] = 'bar'
>>> hashtable
{'0808f64e60d58979fcb676c96ec938270dea42445aeefcd3a4e6f8db': 'bar'}

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.