Python Efficiency of the in statement

Python Efficiency of the in statement - python

Just a quick question, I know that when looking up entries in a dictionary there's a fast efficient way of doing it:
(Assuming the dictionary is ordered in some way using collections.OrderedDict())
You start at the middle of the dictionary, and find whether the desired key is off to one half or another, such as when testing the position of a name in an alphabetically ordered dictionary (or in rare cases dead on). You then check the next half, and continue this pattern until the item is found (meaning that with a dictionary of 1000000 keys you could effectively find any key within 20 iterations of this algorithm).
So I was wondering, if I were to use an in statement (i.e. if a in somedict:), would it use this same method of checking for the desired key? Does it use a faster/slower algorithm?

Nope. Python's dictionaries basically use a hash table (it actually uses an modified hash table to improve speed) (I won't bother to explain a hash table; the linked Wikipedia article describes it well) which is a neat structure which allows ~O(1) (very fast) access. in looks up the object (the same thing that dict[object] does) except it doesn't return the object, which is the most optimal way of doing it.
The code for in for dictionaries contains this line (dk_lookup() returns a hash table entry if it exists, otherwise NULL (the equivalent of None in C, often indicating an error)):
ep = (mp->ma_keys->dk_lookup)(mp, key, hash, &value_addr);

Related

Using tuples in a linked list in python

My teacher wants us to recreate the dict class in Python using tuples and linkedlists (for collisions). One of the methods is used to return a value given a key. I know how to do this in a tuple ( find the key at location[0] and return location[1]) but I have no idea how I would do this in the case of a collision. Any suggestions? If more info is needed please let me know

It sounds like you have some sort of hash to get a shortlist of possibilities, so, you hash your key to a small-ish number, e.g. 0-256 (as an example, it might hash to 63). You can then go directly to your data at index 63. Because you might have more than one item that hashes to 63, your entry for 63 will contain a list of (key,value) pairs, that you would have to search one by one - effectively, you've reduced your search area by 255/256th of the full list. Optionally, when the collisions for a particular key exceeds a threshold, you could repeat the process - so you get mydict[63][92], again reducing the problem size by the same factor. You could repeat this indefinitely.

Why does a set display in same order if sets are unordered?

I'm taking a first look at the python language from Python wikibook.
For sets the following is mentioned:
We can also have a loop move over each of the items in a set. However, since sets are unordered, it is undefined which order the iteration will follow.
and the code example given is :
s = set("blerg")
for letter in s:
print letter
Output:
r b e l g
When I run the program I get the results in the same order, no matter how many times I run. If sets are unordered and order of iteration is undefined, why is it returning the set in the same order? And what is the basis of the order?

They are not randomly ordered, they are arbitrarily ordered. It means you should not count on the order of insertions being maintained as the actual internal implementation details determine the order instead.
The order depends on the insertion and deletion history of the set.
In CPython, sets use a hash table, where inserted values are slotted into a sparse table based on the value returned from the hash() function, modulo the table size and a collision handling algorithm. Listing the set contents then returns the values as ordered in this table.
If you want to go into the nitty-gritty technical details then look at Why is the order in dictionaries and sets arbitrary?; sets are, at their core, dictionaries where the keys are the set values and there are no associated dictionary values. The actual implementation is a little more complicated, as always, but that answer will suffice to get you most of the way there. Then look at the C source code for set for the rest of those details.
Compare this to lists, which do have a fixed order that you can influence; you can move items around in the list and the new ordering would be maintained for you.

Look up python dict value by expression

I have a dict that has unix epoch timestamps for keys, like so:
lookup_dict = {
1357899: {} #some dict of data
1357910: {} #some other dict of data
}
Except, you know, millions and millions and millions of entries. I'd like to subset this dict, over and over again. Ideally, I'd love to be able to write something like I can in R, like:
lookup_value = 1357900
dict_subset = lookup_dict[key >= lookup_value]
# dict_subset now contains {1357910: {}}
But I confess, I can't find any actual proof that this is something Python can do without having, one way or the other, to iterate over every row. If I understand Python correctly (and I might not), key lookup of the form key in dict uses binary search, and is thus very fast; any way to do a binary search, on dict keys?

To do this without iterating, you're going to need the keys in sorted order. Then you just need to do a binary search for the first one >= lookup_value, instead of checking each one for >= lookup_value.
If you're willing to use a third-party library, there are plenty out there. The first two that spring to mind are bintrees (which uses a red-black tree, like C++, Java, etc.) and blist (which uses a B+Tree). For example, with bintrees, it's as simple as this:
dict_subset = lookup_dict[lookup_value:]
And this will be as efficient as you'd hope—basically, it adds a single O(log N) search on top of whatever the cost of using that subset. (Of course usually what you want to do with that subset is iterate the whole thing, which ends up being O(N) anyway… but maybe you're doing something different, or maybe the subset is only 10 keys out of 1000000.)
Of course there is a tradeoff. Random access to a tree-based mapping is O(log N) instead of "usually O(1)". Also, your keys obviously need to be fully ordered, instead of hashable (and that's a lot harder to detect automatically and raise nice error messages on).
If you want to build this yourself, you can. You don't even necessarily need a tree; just a sorted list of keys alongside a dict. You can maintain the list with the bisect module in the stdlib, as JonClements suggested. You may want to wrap up bisect to make a sorted list object—or, better, get one of the recipes on ActiveState or PyPI to do it for you. You can then wrap the sorted list and the dict together into a single object, so you don't accidentally update one without updating the other. And then you can extend the interface to be as nice as bintrees, if you want.

Using the following code will work out
some_time_to_filter_for = # blah unix time
# Create a new sub-dictionary
sub_dict = {key: val for key, val in lookup_dict.items()
if key >= some_time_to_filter_for}
Basically we just iterate through all the keys in your dictionary and given a time to filter out for we take all the keys that are greater than or equal to that value and place them into our new dictionary

array vs hash key search

So I'm a longtime perl scripter who's been getting used to python since I changed jobs a few months back. Often in perl, if I had a list of values that I needed to check a variable against (simply to see if there is a match in the list), I found it easier to generate hashes to check against, instead of putting the values into an array, like so:
$checklist{'val1'} = undef;
$checklist{'val2'} = undef;
...
if (exists $checklist{$value_to_check}) { ... }
Obviously this wastes some memory because of the need for a useless right-hand value, but IMO is more efficients and easier to code than to loop through an array.
Now in python, the code for this is exactly the same no matter if you're searching an list or a dictionary:
if value_to_check in checklist_which_can_be_list_or_dict:
<code>
So my real question here is: in perl, the hash method was preferred for speed of processing vs. iterating through an array, but is this true in python? Given the code is the same, I'm wondering if python does list iteration better? Should I still use the dictionary method for larger lists?

Dictionaries are hashes. An in test on a list has to walk through every element to check it against, while an in test on a dictionary uses hashing to see if the key exists. Python just doesn't make you explicitly loop through the list.
Python also has a set datatype. It's basically a hash/dictionary without the right-hand values. If what you want is to be able to build up a collection of things, then test whether something is already in that collection, and you don't care about the order of the things or whether a thing is in the collection multiple times, then a set is exactly what you want!

What is a hashtable/dictionary implementation for Python that doesn't store the keys?

I'm storing millions, possibly billions of 4 byte values in a hashtable and I don't want to store any of the keys. I expect that only the hashes of the keys and the values will have to be stored. This has to be fast and all kept in RAM. The entries would still be looked up with the key, unlike set()'s.
What is an implementation of this for Python? Is there a name for this?
Yes, collisions are allowed and can be ignored.
(I can make an exception for collisions, the key can be stored for those. Alternatively, collisions can just overwrite the previously stored value.)

Bloomier filters - space-efficient associative array
From the Wikipedia:
Chazelle et al. (2004) designed a
generalization of Bloom filters that
could associate a value with each
element that had been inserted,
implementing an associative array.
Like Bloom filters, these structures
achieve a small space overhead by
accepting a small probability of false
positives. In the case of "Bloomier
filters", a false positive is defined
as returning a result when the key is
not in the map. The map will never
return the wrong value for a key that
is in the map.

How about using an ordinary dictionary and instead of doing:
d[x]=y
use:
d[hash(x)]=y
To look up:
d[hash(foo)]
Of course, if there is a hash collision, you may get the wrong value back.

Its the good old space vs runtime tradeoff: You can have constant time with linear space usage for the keys in a hastable. Or you can store the key implicitly and use log n time by using a binary tree. The (binary) hash of a value gives you the path in the tree where it will be stored.

Build your own b-tree in RAM.
Memory use:
(4 bytes) comparison hash value
(4 bytes) index of next leaf if hash <= comparison OR if negative index of value
(4 bytes) index of next leaf if hash > comparison OR if negative index of value
12 bytes per b-tree node for the b-tree. More overhead for the values (see below).
How you structure this in Python - aren't there "native arrays" of 32bit integers upported with almost no extra memory overhead...? what are they called... anyway those.
Separate ordered array of subarrays each containing one or more values. The "indexes of value" above are indexes into this big array, allowing retrieval of all values matching the hash.
This assumes a 32bit hash. You will need more bytes per b-tree node if you have
greater than 2^31-1 entries or a larger hash.
BUT Spanner in the works perhaps: Note that you will not be able, if you are not storing the key values, to verify that a hash value looked up corresponds only to your key unless through some algorithmic or organisational mechanism you have guaranteed that no two keys will have the same hash. Quite a serious issue here. Have you considered it? :)

Although python dictionaries are very efficient, I think that if you're going to store billions of items, you may want to create your own C extension with data structures, optimized for the way you are actually using it (sequential access? completely random? etc).
In order to create a C extension, you may want to use SWIG, or something like Pyrex (which I've never used).

Hash table has to store keys, unless you provide a hash function that gives absolutely no collisions, which is nearly impossible.
There is, however, if your keys are string-like, there is a very space-efficient data structure - directed acyclic word graph (DAWG). I don't know any Python implementation though.

It's not what you asked for buy why not consider Tokyo Cabinet or BerkleyDB for this job? It won't be in memory but you are trading performance for greater storage capacity. You could still keep your list in memory and use the database only to check existence.

Would you please tell us more about the keys? I'm wondering if there is any regularity in the keys that we could exploit.
If the keys are strings in a small alphabet (example: strings of digits, like phone numbers) you could use a trie data structure:
http://en.wikipedia.org/wiki/Trie

If you're actually storing millions of unique values, why not use a dictionary?
Store: d[hash(key)/32] |= 2**(hash(key)%32)
Check: (d[hash(key)/32] | 2**(hash(key)%32))
If you have billions of entries, use a numpy array of size (2**32)/32, instead. (Because, after all, you only have 4 billion possible values to store, anyway).

Why not a dictionary + hashlib?
>>> import hashlib
>>> hashtable = {}
>>> def myHash(obj):
return hashlib.sha224(obj).hexdigest()
>>> hashtable[myHash("foo")] = 'bar'
>>> hashtable
{'0808f64e60d58979fcb676c96ec938270dea42445aeefcd3a4e6f8db': 'bar'}

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.