My teacher wants us to recreate the dict class in Python using tuples and linkedlists (for collisions). One of the methods is used to return a value given a key. I know how to do this in a tuple ( find the key at location[0] and return location[1]) but I have no idea how I would do this in the case of a collision. Any suggestions? If more info is needed please let me know
It sounds like you have some sort of hash to get a shortlist of possibilities, so, you hash your key to a small-ish number, e.g. 0-256 (as an example, it might hash to 63). You can then go directly to your data at index 63. Because you might have more than one item that hashes to 63, your entry for 63 will contain a list of (key,value) pairs, that you would have to search one by one - effectively, you've reduced your search area by 255/256th of the full list. Optionally, when the collisions for a particular key exceeds a threshold, you could repeat the process - so you get mydict[63][92], again reducing the problem size by the same factor. You could repeat this indefinitely.
Related
I am currently going through learn python the hard way, and i find my self stuck on this dictionary exercise: http://learnpythonthehardway.org/book/ex39.html
What gives me problems is the following code:
def hash_key(aMap, key):
"""Given a key this will create a number and then convert it to
an index for the aMap's buckets."""
return hash(key) % len(aMap)
How can i be sure that i wont get duplicate values from the hash_key function? Since modulo is being used, what prevents hash() from returning values such that after modulo is used on them they return the same hash_key.
Ex. len(aMap)=10, hash(key1) = 20, hash(key2) = 30, therefore hash_key for both dict keys is 0, even though they are obviously not equal.
I'm having trouble grasping the concepts behind hashing, so if you have any reading material suitable for my skill level, please share. I'm not afraid to work hard.
Thank you for your help.
The hashmap, as proposed in the linked exercise, intends to generate key-collisions.
The data structure is a list of lists, where the key's hash-modulus value determines the 2nd-level list where your data goes.
Imagine the structure as an array of n buckets, each with its own key. If you put something into this datastructure, the hash_key() method finds the appropriate bucket and appends your new data to that bucket's contents. Actually, it is random what bucket will receive your data, but since it's a hash function, it will always be the same bucket for the same key.
Consider two dictionaries as follows:
d1={"Name":"John","Age":47}
d2={"Name":"Margaret","Age":35}
On executing the following statement:
>>>cmp(d1,d2)
1
That implies that since the keys are identical therefore it compares the values and gives priority to the value associated with the "Age" key (perhaps because lexicographically it comes first). This is supported by the fact that when I alter the dictionaries:
d1={"Name":"John","Age":47}
d2={"Name":"Jack","Age":47}
The statement returns 1. Since the sum of the ASCII values is greater for d1.
But consider this pair of dictionaries:
d1={"Name":"John","Age":47}
d2={"Name":"Jzan","Age":47}
Now the statement returns -1.
Why is that? Is it that instead of comparing the sum of the ASCII values, it compares each character's value, one by one?
Also, if the keys themselves are different, on what basis does the function compare?
Most of the programming language implement the comparison of strings according to dictionary order (the way that words are ordered in a dictionary), i.e. compare the character's value one by one and return the first difference.
If the keys themselves are different, the return values are actually depends on the implementation. You can find more information from here: Is there a description of how __cmp__ works for dict objects in Python 2?. However it is not recommended to rely on this feature in your code.
I'm taking a first look at the python language from Python wikibook.
For sets the following is mentioned:
We can also have a loop move over each of the items in a set. However, since sets are unordered, it is undefined which order the iteration will follow.
and the code example given is :
s = set("blerg")
for letter in s:
print letter
Output:
r b e l g
When I run the program I get the results in the same order, no matter how many times I run. If sets are unordered and order of iteration is undefined, why is it returning the set in the same order? And what is the basis of the order?
They are not randomly ordered, they are arbitrarily ordered. It means you should not count on the order of insertions being maintained as the actual internal implementation details determine the order instead.
The order depends on the insertion and deletion history of the set.
In CPython, sets use a hash table, where inserted values are slotted into a sparse table based on the value returned from the hash() function, modulo the table size and a collision handling algorithm. Listing the set contents then returns the values as ordered in this table.
If you want to go into the nitty-gritty technical details then look at Why is the order in dictionaries and sets arbitrary?; sets are, at their core, dictionaries where the keys are the set values and there are no associated dictionary values. The actual implementation is a little more complicated, as always, but that answer will suffice to get you most of the way there. Then look at the C source code for set for the rest of those details.
Compare this to lists, which do have a fixed order that you can influence; you can move items around in the list and the new ordering would be maintained for you.
Just a quick question, I know that when looking up entries in a dictionary there's a fast efficient way of doing it:
(Assuming the dictionary is ordered in some way using collections.OrderedDict())
You start at the middle of the dictionary, and find whether the desired key is off to one half or another, such as when testing the position of a name in an alphabetically ordered dictionary (or in rare cases dead on). You then check the next half, and continue this pattern until the item is found (meaning that with a dictionary of 1000000 keys you could effectively find any key within 20 iterations of this algorithm).
So I was wondering, if I were to use an in statement (i.e. if a in somedict:), would it use this same method of checking for the desired key? Does it use a faster/slower algorithm?
Nope. Python's dictionaries basically use a hash table (it actually uses an modified hash table to improve speed) (I won't bother to explain a hash table; the linked Wikipedia article describes it well) which is a neat structure which allows ~O(1) (very fast) access. in looks up the object (the same thing that dict[object] does) except it doesn't return the object, which is the most optimal way of doing it.
The code for in for dictionaries contains this line (dk_lookup() returns a hash table entry if it exists, otherwise NULL (the equivalent of None in C, often indicating an error)):
ep = (mp->ma_keys->dk_lookup)(mp, key, hash, &value_addr);
I'm implementing something like a cache, which works like this:
If a new value for the given key arrives from some external process, store that value, and remember the time when this value arrived.
If we are idle, find the oldest entry in the cache, fetch the new value for the key from external source and update the cache.
Return the value for the given key when asked.
I need a data structure to store key-value pairs which would allow to perform the following operations as fast as possible (in the order of speed priority):
Find the key with the lowest (unknown) value.
Update a value for the given key or add a new key-value pair if the key does not exist.
Other regular hash-table operations, like delete a key, check if a key exists, etc.
Are there any data-structures which allow this? The problem here is that to perform the first query quickly I need something value-ordered and to update the values for the given key quickly I need something key-ordered. The best solution I have so far is something like this:
Store values an a regular hashtable, and pairs of (value, key) as a value-ordered heap. Finding the key for the lowest value goes like this:
Find the key for the lowest value on the heap.
Find the value for that key from the hashtable.
If the values don't match pop the value from the heap and repeat from step 1.
Updating the values goes like this:
Store the value in the hashtable.
Push the new (value, key) pair to the heap.
Deleting a key is more tricky and requires searching for the value in the heap. This gives something like O(log n) performance, but this solution seems to be cumbersome to me.
Are there any data structures which combine the properties of a hashtable for keys and a heap for the associated values? I'm programming in Python, so if there are existing implementations in Python, it is a big plus.
Most heap implementations will get you the lowest key in your collection in O(1) time, but there's no guarantees regarding the speed of random lookups or removal. I'd recommend pairing up two data structures: any simple heap implementation and any out-of-the-box hashtable.
Of course, any balanced binary tree can be used as a heap, since the smallest and largest values are on the left-most and right-most leaves respectively. Red-black tree or AVL tree should give you O(lg n) heap and dictionary operations.
I'd try:
import heapq
myheap = []
mydict = {}
...
def push(key, val):
heapq.heappush(myheap, (val, key))
mydict[key] = val
def pop():
...
More info here
You're looking for a Map, or an associative array. To get more specific, we'd need to know what language you're trying to use.