array vs hash key search - python

So I'm a longtime perl scripter who's been getting used to python since I changed jobs a few months back. Often in perl, if I had a list of values that I needed to check a variable against (simply to see if there is a match in the list), I found it easier to generate hashes to check against, instead of putting the values into an array, like so:
$checklist{'val1'} = undef;
$checklist{'val2'} = undef;
...
if (exists $checklist{$value_to_check}) { ... }
Obviously this wastes some memory because of the need for a useless right-hand value, but IMO is more efficients and easier to code than to loop through an array.
Now in python, the code for this is exactly the same no matter if you're searching an list or a dictionary:
if value_to_check in checklist_which_can_be_list_or_dict:
<code>
So my real question here is: in perl, the hash method was preferred for speed of processing vs. iterating through an array, but is this true in python? Given the code is the same, I'm wondering if python does list iteration better? Should I still use the dictionary method for larger lists?

Dictionaries are hashes. An in test on a list has to walk through every element to check it against, while an in test on a dictionary uses hashing to see if the key exists. Python just doesn't make you explicitly loop through the list.
Python also has a set datatype. It's basically a hash/dictionary without the right-hand values. If what you want is to be able to build up a collection of things, then test whether something is already in that collection, and you don't care about the order of the things or whether a thing is in the collection multiple times, then a set is exactly what you want!

Related

Solutions for a Dynamic Infinite Tree Structure in Python

I am trying to build a Tree Structure, starting at a point 1, which can branch into infinte directions. Every point can path into infinite other points ( 1.1, 1.2, 1.3, ... ) and each of those points can also path into infinite points (1.1.1, 1.2.1, 1.2.2, ...).
My plan was to store an Object at every point and be able to refer to them by a position 1.1.1 etc. Also i decided to generate every point dynamically, so the Tree starts at 1 and only branches when an Object is created.
Since i tend to overcomplicate things i used a nested Dictionary, so i could refer to a object by using dict[1][1]["data"], but i'm struggling with the use of an infinite nested Dictionary:
How do i use a Dictionary if the amount of "[1]" varies? (think dict[1][1][1]....[1]["data"]).
I can simply loop through the dict to find the data, like
for i in [1.1.1]:
point = dict[i]
But i can't find a way to open new dictionary branches, or store data, when the amount of "[1]" is unknown.
Basically, I want to know if a simpler solution exists and how to deal with too many nested "[]" brackets.
You might want a different way of retrieving values than using [], since as you said it's hard to do when you don't know how deep something is.
Instead you can use a simple recursive function, and use a list for your key instead of a string:
def fetch_field(subtree, key_list):
if not key_list:
return subtree["data"]
return fetch_field(subtree[key_list[0]], key_list[1:])
key = "1.2.1.3"
# Instead of using a string, split it into a list:
key = key.split(".")
fetch_field(tree, key)
You can tweak the function to accept a string instead of an array if you like, I personally prefer working with a list instead of messing around with strings.

Am I using `all` correctly?

A user asked (Keyerror while using pandas in PYTHON 2.7) why he was having a KeyError while looking in a dictionary and how he could avoid this exception.
As an answer, I suggested him to check for the keys in the dictionary before. So, if he needed all the keys ['key_a', 'key_b', 'key_c'] in the dictionary, he could test it with:
if not all([x in dictionary for x in ['key_a', 'key_b', 'key_c']]):
continue
This way he could ignore dictionaries that didn't have the expected keys (the list of dictionaries is created out of JSON formatted lines loaded from a file). *Refer to the original question for more details, if relevant to this question.
A user more experienced in Python and SO, which I would consider an authority on the matter for its career and gold badges told me I was using all incorrectly. I was wondering if this is really the case (for what I can tell, that works as expected) and why, or if there is a better way to check if a couple of keys are all in a dictionary.
Yes that will work fine, but you don't even need the list comprehension
if not all(x in dictionary for x in ['key_a', 'key_b', 'key_c']):
continue
If you have the surrounding [], it will evaluate all the elements before calling all. If you remove them, the inner expression is a generator, and will short-circuit upon the first False.

Python Efficiency of the in statement

Just a quick question, I know that when looking up entries in a dictionary there's a fast efficient way of doing it:
(Assuming the dictionary is ordered in some way using collections.OrderedDict())
You start at the middle of the dictionary, and find whether the desired key is off to one half or another, such as when testing the position of a name in an alphabetically ordered dictionary (or in rare cases dead on). You then check the next half, and continue this pattern until the item is found (meaning that with a dictionary of 1000000 keys you could effectively find any key within 20 iterations of this algorithm).
So I was wondering, if I were to use an in statement (i.e. if a in somedict:), would it use this same method of checking for the desired key? Does it use a faster/slower algorithm?
Nope. Python's dictionaries basically use a hash table (it actually uses an modified hash table to improve speed) (I won't bother to explain a hash table; the linked Wikipedia article describes it well) which is a neat structure which allows ~O(1) (very fast) access. in looks up the object (the same thing that dict[object] does) except it doesn't return the object, which is the most optimal way of doing it.
The code for in for dictionaries contains this line (dk_lookup() returns a hash table entry if it exists, otherwise NULL (the equivalent of None in C, often indicating an error)):
ep = (mp->ma_keys->dk_lookup)(mp, key, hash, &value_addr);

Look up python dict value by expression

I have a dict that has unix epoch timestamps for keys, like so:
lookup_dict = {
1357899: {} #some dict of data
1357910: {} #some other dict of data
}
Except, you know, millions and millions and millions of entries. I'd like to subset this dict, over and over again. Ideally, I'd love to be able to write something like I can in R, like:
lookup_value = 1357900
dict_subset = lookup_dict[key >= lookup_value]
# dict_subset now contains {1357910: {}}
But I confess, I can't find any actual proof that this is something Python can do without having, one way or the other, to iterate over every row. If I understand Python correctly (and I might not), key lookup of the form key in dict uses binary search, and is thus very fast; any way to do a binary search, on dict keys?
To do this without iterating, you're going to need the keys in sorted order. Then you just need to do a binary search for the first one >= lookup_value, instead of checking each one for >= lookup_value.
If you're willing to use a third-party library, there are plenty out there. The first two that spring to mind are bintrees (which uses a red-black tree, like C++, Java, etc.) and blist (which uses a B+Tree). For example, with bintrees, it's as simple as this:
dict_subset = lookup_dict[lookup_value:]
And this will be as efficient as you'd hope—basically, it adds a single O(log N) search on top of whatever the cost of using that subset. (Of course usually what you want to do with that subset is iterate the whole thing, which ends up being O(N) anyway… but maybe you're doing something different, or maybe the subset is only 10 keys out of 1000000.)
Of course there is a tradeoff. Random access to a tree-based mapping is O(log N) instead of "usually O(1)". Also, your keys obviously need to be fully ordered, instead of hashable (and that's a lot harder to detect automatically and raise nice error messages on).
If you want to build this yourself, you can. You don't even necessarily need a tree; just a sorted list of keys alongside a dict. You can maintain the list with the bisect module in the stdlib, as JonClements suggested. You may want to wrap up bisect to make a sorted list object—or, better, get one of the recipes on ActiveState or PyPI to do it for you. You can then wrap the sorted list and the dict together into a single object, so you don't accidentally update one without updating the other. And then you can extend the interface to be as nice as bintrees, if you want.
Using the following code will work out
some_time_to_filter_for = # blah unix time
# Create a new sub-dictionary
sub_dict = {key: val for key, val in lookup_dict.items()
if key >= some_time_to_filter_for}
Basically we just iterate through all the keys in your dictionary and given a time to filter out for we take all the keys that are greater than or equal to that value and place them into our new dictionary

Python: Nested for loops or "next" statement

I'm a rookie hobbyist and I nest for loops when I write python, like so:
dict = {
key1: {subkey/value1: value2}
...
keyn: {subkeyn/valuen: valuen+1}
}
for key in dict:
for subkey/value in key:
do it to it
I'm aware of a "next" keyword that would accomplish the same goal in one line (I asked a question about how to use it but I didn't quite understand it).
So to me, a nested for loop is much more readable. Why, then do people use "next"? I read somewhere that Python is a dynamically-typed and interpreted language and because + both concontinates strings and sums numbers, that it must check variable types for each loop iteration in order to know what the operators are, etc. Does using "next" prevent this in some way, speeding up the execution or is it just a matter of style/preference?
next is precious to advance an iterator when necessary, without that advancement controlling an explicit for loop. For example, if you want "the first item in S that's greater than 100", next(x for x in S if x > 100) will give it to you, no muss, no fuss, no unneeded work (as everything terminates as soon as a suitable x is located) -- and you get an exception (StopIteration) if unexpectedly no x matches the condition. If a no-match is expected and you want None in that case, next((x for x in S if x > 100), None) will deliver that. For this specific purpose, it might be clearer to you if next was actually named first, but that would betray its much more general use.
Consider, for example, the task of merging multiple sequences (e.g., a union or intersection of sorted sequences -- say, sorted files, where the items are lines). Again, next is just what the doctor ordered, because none of the sequences can dominate over the others by controlling A "main for loop". So, assuming for simplicity no duplicates can exist (a condition that's not hard to relax if needed), you keep pairs (currentitem, itsfile) in a list controlled by heapq, and the merging becomes easy... but only thanks to the magic of next to advance the correct file once its item has been used, and that file only.
import heapq
def merge(*theopentextfiles):
theheap = []
for afile in theopentextfiles:
theitem = next(afile, '')
if theitem: theheap.append((theitem, afile))
heapq.heapify(theheap)
while theheap:
theitem, afile = heapq.heappop(theheap)
yielf theitem
theitem = next(afile, '')
if theitem: heapq.heappush(theheap, (theitem, afile))
Just try to do anything anywhere this elegant without next...!-)
One could go on for a long time, but the two use cases "advance an iterator by one place (without letting it control a whole for loop)" and "get just the first item from an iterator" account for most important uses of next.

Categories