Python. Identity in sets of objects. And hashing

Python. Identity in sets of objects. And hashing - python

How do __hash__ and __eq__ use in identification in sets?
For example some code that should help to solve some domino puzzle:
class foo(object):
def __init__(self, one, two):
self.one = one
self.two = two
def __eq__(self,other):
if (self.one == other.one) and (self.two == other.two): return True
if (self.two == other.one) and (self.one == other.two): return True
return False
def __hash__(self):
return hash(self.one + self.two)
s = set()
for i in range(7):
for j in range(7):
s.add(foo(i,j))
len(s) // returns 28 Why?
If i use only __eq__() len(s) equals 49. Its ok because as i understand objects (1-2 and 2-1 for example) not same, but represent same domino. So I have added hash function.
Now it works the way i want, but i did not understand one thing: hash of 1-3 and 2-2 should be same so they should counted like same object and shouldn't added to set. But they do! Im stuck.

Equality for dict/set purposes depends on equality as defined by __eq__. However, it is required that objects that compare equal have the same hash value, and that is why you need __hash__. See this question for some similar discussion.
The hash itself does not determine whether two objects count as the same in dictionaries. The hash is like a "shortcut" that only works one way: if two objects have different hashes, they are definitely not equal; but if they have the same hash, they still might not be equal.
In your example, you defined __hash__ and __eq__ to do different things. The hash depends only on the sum of the numbers on the domino, but the equality depends on both individual numbers (in order). This is legal, since it is still the case that equal dominoes have equal hashes. However, like I said above, it doesn't mean that equal-sum dominoes will be considered equal. Some unequal dominoes will still have equal hashes. But equality is still determined by __eq__, and __eq__ still looks at both numbers, in order, so that's what determines whether they are equal.
It seems to me that the appropriate thing to do in your case is to define both __hash__ and __eq__ to depend on the ordered pair --- that is, first compare the greater of the two numbers, then compare the lesser. This will mean that 2-1 and 1-2 will be considered the same.

The hash is only a hint to help Python arrange the objects. When looking for some object foo in a set, it still has to check each object in the set with the same hash as foo.
It's like having a bookshelf for every letter of the alphabet. Say you want to add a new book to your collection, only if you don't have a copy of it already; you'd first go to the shelf for the appropriate letter. But then you have to look at each book on the shelf and compare it to the one in your hand, to see if it's the same. You wouldn't discard the new book just because there's something already on the shelf.
If you want to use some other value to filter out "duplicates", then use a dict that maps the domino's total value to the first domino you saw. Don't subvert builtin Python behavior to mean something entirely different. (As you've discovered, it doesn't work in this case, anyway.)

The requirement for hash functions is that if x == y for two values, then hash(x) == hash(y). The reverse need not be true.
You can easily see why this is the case by considering hashing of strings. Lets say that hash(str) returns a 32-bit number, and we are hashing strings longer than 4 characters long (i.e. contain more than 32 bits). There are more possible strings than there are possible hash values, so some non-equal strings must share the same hash (this is an application of the pigeonhole principle).
Python sets are implemented as hash tables. When checking whether an object is a member of the set, it will call its hash function and use the result to pick a bucket, and then use the equality operator to see if it matches any of the items in the bucket.
With your implementation, the 2-2 and 1-3 dominoes will end up in the hash bucket, but they don't compare equal. Therefore, the both can be added to the set.

You can read about this in the Python data model documentation, but the short answer is that you can rewrite your hash function as:
def __hash__(self):
return hash(tuple(sorted((self.one, self.two))))

I like the sound of the answer provided by Eevee, but I had difficulty imagining an implementation. Here's my interpretation, explanation and implementation of the answer provided by Eevee.
Use the sum of two domino values as dictionary the key.
Store either of the domino values as the dictionary value.
For example, given the domino '12', the sum is 3, and therefore the dictionary key will be 3. We can then pick either value (1 or 2) to store in that position (we'll pick the first value, 1).
domino_pairs = {}
pair = '12'
pair_key = sum(map(int, pair))
domino_pairs[pair_key] = int(pair[0]) # Store the first pair's first value.
print domino_pairs
Outputs:
{3: '1'}
Although we're only storing a single value from the domino pair, the other value can easily be calculated from the dictionary key and value:
pair = '12'
pair_key = sum(map(int, pair))
domino_pairs[pair_key] = int(pair[0]) # Store the first pair's first value.
# Retrieve pair from dictionary.
print pair_key - domino_pairs[pair_key] # 3-1 = 2
Outputs:
2
But, since two different pairs may have the same total, we need to store multiple values against a single key. So, we store a list of values against a single key (i.e. sum of two pairs). Putting this into a function:
def add_pair(dct, pair):
pair_key = sum(map(int, pair))
if pair_key not in dct:
dct[pair_key] = []
dct[pair_key].append(int(pair[0]))
domino_pairs = {}
add_pair(domino_pairs, '22')
add_pair(domino_pairs, '04')
print domino_pairs
Outputs:
{4: [2, 0]}
This makes sense. Both pairs sum to 4, yet the first value in each pair differs, so we store both. The implementation so far will allow duplicates:
domino_pairs = {}
add_pair(domino_pairs, '40')
add_pair(domino_pairs, '04')
print domino_pairs
Outputs
{4: [4, 0]}
'40' and '04' are the same in Dominos, so we don't need to store both. We need a way of checking for duplicates. To do this we'll define a new function, has_pair:
def has_pair(dct, pair):
pair_key = sum(map(int, pair))
if pair_key not in dct:
return False
return (int(pair[0]) in dct[pair_key] or
int(pair[1]) in dct[pair_key])
As normal, we get the sum (our dictionary key). If it it's not in the dictionary, then the pair cannot exist. If it is in the dictionary, we must check to see if either value in our pair exist in the dictionary 'bucket'. Let's insert this check into add_pair, and so we don't add duplicate domino pairs:
def add_pair(dct, pair):
pair_key = sum(map(int, pair))
if has_pair(dct, pair):
return
if pair_key not in dct:
dct[pair_key] = []
dct[pair_key].append(int(pair[0]))
Now adding duplicate domino pairs works correctly:
domino_pairs = {}
add_pair(domino_pairs, '40')
add_pair(domino_pairs, '04')
print domino_pairs
Outputs:
{4: [4]}
Lastly, a print function shows how from storing only the sum of a domino pair, and a single value from the same pair, is the same as storing the pair itself:
def print_pairs(dct):
for total in dct:
for a in dct[total]:
a = int(a)
b = int(total) - int(a)
print '(%d, %d)'%(a,b)
Testing:
domino_pairs = {}
add_pair(domino_pairs, '40')
add_pair(domino_pairs, '04')
add_pair(domino_pairs, '23')
add_pair(domino_pairs, '50')
print_pairs(domino_pairs)
Outputs:
(4, 0)
(2, 3)
(5, 0)

Related

Time Complexity of getting value when key is very long

Let's assume that keys of dictionary are very long, and their length is around M where M is a very large number.
Then does it mean in terms of M, the time complexity of operations like
x=dic[key]
print(dic[key])
is O(M)? not O(1)?
How does it work?

If you're talking about string keys with M characters, yes, it can be O(M), and on two counts:
Computing the hash code can take O(M) time.
If the hash code of the key passed in matches the hash code of a key in the table, then the implementation has to go on to compute whether they're equal (what __eq__() returns). If they are equal, that requires exactly M+1 comparisons to determine (M for each character pair, and another compare at the start to verify that the (integer) string lengths are the same).
In rare cases it can be constant-time (independent of string length): those where the passed-in key is the same object as a key in the table. For example, in
k = "a" * 10000000
d = {k : 1}
print(k in d)
Time it, and compare to when, e.g., adding this line before the end:
k = k[:-1] + "a"
The change builds another key equal to the original k, but is not the same object. So an internal pointer-equality test doesn't succeed, and a full-blown character-by-character comparison is needed.

Custom Sort Complicated Strings in Python

I have a list of filenames conforming to the pattern: s[num][alpha1][alpha2].ext
I need to sort, first by the number, then by alpha1, then by alpha2. The last two aren't alphabetical, however, but rather should reflect a custom ordering.
I've created two lists representing the ordering for alpha1 and alpha2, like so:
alpha1Order = ["Fizz", "Buzz", "Ipsum", "Dolor", "Lorem"]
alpha2Order = ["Sit", "Amet", "Test"]
What's the best way to proceed? My first though was to tokenize (somehow) such that I split each filename into its component parts (s, num, alpha1, alpha2), then sort, but I wasn't quite sure how to perform such a complicated sort. Using a key function seemed clunky, as this sort didn't seem to lend itself to a simple ordering.

Once tokenized, your data is perfectly orderable with a key function. Just return the index of the alpha1Order and alpha2Order lists for the value. Replace them with dictionaries to make the lookup easier:
alpha1Order = {token: i for i, token in enumerate(alpha1Order)}
alpha2Order = {token: i for i, token in enumerate(alpha2Order)}
def keyfunction(filename):
num, alpha1, alpha2 = tokenize(filename)
return int(num), alpha1Order[alpha1], alpha2Order[alpha2]
This returns a tuple to sort on; Python will use the first value to sort on, ordering anything that has the same int(num) value by the second entry, using the 3rd to break any values tied on the first 2 entries.

Iterating over an array and dictionary and storing the values in another array

Say I have a dictionary of pattern types:
patternDict = {1:[0],5:[0,3]}
And I have an array:
A = [[1,3,4,5],[6,7,8,9]]
I also have two empty arrays to store the value of each pattern type:
pattern1=[]
pattern5=[]
I am iterating over each row in A and each pattern type in patternDict:
for row in A:
for key, value in patternDict.iteritems():
currentPattern = row[value] for value in patternDict[key]
#append either pattern1 or pattern5 with currentPattern based on the key
And this is where I am having trouble. How do I append to either the Pattern 1 array or Pattern 5 array based on the key in patternDict. The output would look like:
pattern1=[1,6]
pattern5=[1,5,6,9]
What's the best way to do this?

Use dict instead of variables:
>>> p = {k:[x[y] for x in A for y in v] for k, v in patterrnDict.iteritems()}
>>> p[1]
[1, 6]
>>> p[5]
[1, 5, 6, 9]

Given the constraints you've specified, the solution is actually pretty simple.
for row in rows:
for key, value in patternDict.iteritems():
currentPattern = [row[value] for value in patternDict[key]]
if key == 1:
pattern1.extend(currentPattern)
elif key == 5:
pattern5.extend(currentPattern)
But I suspect that your real problem is not so closely related to simplified problem. If you can provide a clearer description, someone will provide an appropriate generalization.

I'd strongly recommend using a dict, too, as in hcwhsa'a reply, and if you have just a few such variables, you could also use an if-else condition, as in Rob's answer.
However, if there's abolutely no way around using separate variables for the different lists instead of one dict, you could access those variables using the globals() function. This function returns a dictionary mapping the names of all the variable and functions defined in the current scope to their respective values, i.e., globals()["foo"] gets you the content of the variable foo.
for row in A:
for key, value in patternDict.iteritems():
currentPattern = [row[value] for value in patternDict[key]]
globals()["pattern%d" % key].extend(currentPattern)
print pattern1, pattern5
But be warned that this can lead to all sorts of nasty, hard to detect bugs and should generally be avoided!

Dictionary use instead of dynamic variable names in Python

I have a long text file having truck configurations. In each line some properties of a truck is listed as a string. Each property has its own fixed width space in the string, such as:
2 chracters = number of axles
2 characters = weight of the first axle
2 characters = weight of the second axle
...
2 characters = weight of the last axle
2 characters = length of the first axle spacing (spacing means distance between axles)
2 characters = length of the second axle spacing
...
2 characters = length of the last axle spacing
As an example:
031028331004
refers to:
number of axles = 3
first axle weight = 10
second axle weight = 28
third axle weight = 33
first spacing = 10
second spacing = 4
Now, you have an idea about my file structure, here is my problem: I would like to group these trucks in separate lists, and name the lists in terms of axle spacings. Let's say I am using a boolean type of approach, and if the spacing is less than 6, the boolean is 1, if it is greater than 6, the boolean is 0. To clarify, possible outcomes in a three axle truck becomes:
00 #Both spacings > 6
10 #First spacing < 6, second > 6
01 #First spacing > 6, second < 6
11 #Both spacings < 6
Now, as you see there are not too many outcomes for a 3 axle truck. However, if I have a 12 axle truck, the number of "possible" combinations go haywire. The thing is, in reality you would not see all "possible" combinations of axle spacings in a 12 axle truck. There are certain combinations (I don't know which ones, but to figure it out is my aim) with a number much less than the "possible" number of combinations.
I would like the code to create lists and fill them with the strings that define the properties I mentioned above if only such a combination exists. I thought maybe I should create lists with variable names such as:
truck_0300[]
truck_0301[]
truck_0310[]
truck_0311[]
on the fly. However, from what I read in SF and other sources, this is strongly discouraged. How would you do it using the dictionary concept? I understand that dictionaries are like 2 dimensional arrays, with a key (in my case the keys would be something like truck_0300, truck_0301 etc.) and value pair (again in my case, the values would probably be lists that hold the actual strings that belong to the corresponding truck type), however I could not figure out how to create that dictionary, and populate it with variable keys and values.
Any insight would be welcome!
Thanks a bunch!

You are definitely correct that it is almost always a bad idea to try and create "dynamic variables" in a scope. Dictionaries usually are the answer to build up a collection of objects over time and reference back to them...
I don't fully understand your application and format, but in general to define and use your dictionary it would look like this:
trucks = {}
trucks['0300'] = ['a']
trucks['0300'].append('c')
trucks['0300'].extend(['c','d'])
aTruck = trucks['0300']
Now since every one of these should be a list of your strings, you might just want to use a defaultdict, and tell it to use a list as default value for non existant keys:
from collections import defaultdict
trucks = defaultdict(list)
trucks['0300']
# []
Note that even though it was a brand new dict that contained no entries, the 'truck_0300' key still return a new list. This means you don't have to check for the key. Just append:
trucks = defaultdict(list)
trucks['0300'].append('a')
A defaultdict is probably what you want, since you do not have to pre-define keys at all. It is there when you are ready for it.
Getting key for the max value
From your comments, here is an example of how to get the key with the max value of a dictionary. It is pretty easy, as you just use max and define how it should determine the key to use for the comparisons:
d = {'a':10, 'b':5, 'c':50}
print max(d.iteritems(), key=lambda (k,v): v)
# ('c', 50)
d['c'] = 1
print max(d.iteritems(), key=lambda (k,v): v)
# ('a', 10)
All you have to do is define how to produce a comparison key. In this case I just tell it to take the value as the key. For really simply key functions like this where you are just telling it to pull an index or attribute from the object, you can make it more efficient by using the operator module so that the key function is in C and not in python as a lambda:
from operator import itemgetter
...
print max(d.iteritems(), key=itemgetter(1))
#('c', 50)
itemgetter creates a new callable that will pull the second item from the tuple that is passed in by the loop.
Now assume each value is actually a list (similar to your structure). We will make it a list of numbers, and you want to find the key which has the list with the largest total:
d = {'a': range(1,5), 'b': range(2,4), 'c': range(5,7)}
print max(d.iteritems(), key=lambda (k,v): sum(v))
# ('c', [5, 6])

If the number of keys is more than 10,000, then this method is not viable. Otherwise define a dictionary d = {} and do a loop over your lines:
key = line[:4]
if not key in d.keys():
d[key] = []
d[key] += [somevalue]
I hope this helps.

Here's a complete solution from string to output:
from collections import namedtuple, defaultdict
# lightweight class
Truck = namedtuple('Truck', 'weights spacings')
def parse_truck(s):
# convert to array of numbers
numbers = [int(''.join(t)) for t in zip(s[::2], s[1::2])]
# check length
n = numbers[0]
assert n * 2 == len(numbers)
numbers = numbers[1:]
return Truck(numbers[:n], numbers[n:])
trucks = [
parse_truck("031028331004"),
...
]
# dictionary where every key contains a list by default
trucks_by_spacing = defaultdict(list)
for truck in trucks:
# (True, False) instead of '10'
key = tuple(space > 6 for space in truck.spacings)
trucks_by_spacing[key].append(truck)
print trucks_by_spacing
print trucks_by_spacing[True, False]

data structure that can do a "select distinct X where W=w and Y=y and Z=z and ..." type lookup

I have a set of unique vectors (10k's worth). And I need to, for any chosen column, extract the set of values that are seen in that column, in rows where all the others columns are given values.
I'm hoping for a solution that is sub linear (wrt the item count) in time and at most linear (wrt the total size of all the items) in space, preferably sub linear extra space over just storing the items.
Can I get that or better?
BTW: it's going to be accessed from python and needs to simple to program or be part of an existing commonly used library.
edit: the costs are for the lookup, and do not include the time to build the structures. All the data that will ever be indexed is available before the first query will be made.
It seems I'm doing a bad job of describing what I'm looking for, so here is a solution that gets close:
class Index:
dep __init__(self, stuff): # don't care about this O() time
self.all = set(stuff)
self.index = {}
for item in stuff:
for i,v in item:
self.index.getdefault(i,set()).add(v)
def Get(self, col, have): # this O() matters
ret = []
t = array(have) # make a copy.
for i in self.index[col]:
t[col] = i
if t in self.all:
ret.append(i)
return ret
The problem is that this give really bad (O(n)) worst case perf.

Since you are asking for a SQL-like query, how about using a relational database? SQLite is part of the standard library, and can be used either on-disk or fully in memory.

If you have a Python set (no ordering) there is no way to select all relevant items without at least looking at all items -- so it's impossible for any solution to be "sub linear" (wrt the number of items) as you require.
If you have a list, instead of a set, then it can be ordered -- but that cannot be achieved in linear time in the general case (O(N log N) is provably the best you can do for a general-case sorting -- and building sorted indices would be similar -- unless there are constraints that let you use "bucket-sort-like" approaches). You can spread around the time it takes to keep indices over all insertions in the data structure -- but that won't reduce the total time needed to build such indices, just, as I said, spread them around.
Hash-based (not sorted) indices can be faster for your special case (where you only need to select by equality, not by "less than" &c) -- but the time to construct such indices is linear in the number of items (obviously you can't construct such an index without at least looking once at each item -- sublinear time requires some magic that lets you completely ignore certain items, and that can't happen without supporting "structure" (such as sortedness) which in turn requires time to achieve (though it can be achieved "incrementally" ahead of time, such an approach doesn't reduce the total time required).
So, taken to the letter, your requirements appear overconstrained: neither Python, nor any other language, nor any database engine, etc, can possibly achieve them -- if interpreted literally exactly as you state them. If "incremental work done ahead of time" doesn't count (as breaching your demands of linearity and sublinearity), if you take about expected/typical rather than worst-case behavior (and your items have friendly probability distributions), etc, then it might be possible to come close to achieving your very demanding requests.
For example, consider maintaining for each of the vectors' D dimensions a dictionary mapping the value an item has in that dimension, to a set of indices of such items; then, selecting the items that meet the D-1 requirements of equality for every dimension but the ith one can be done by set intersections. Does this meet your requirements? Not by taking the latter strictly to the letter, as I've explained above -- but maybe, depending on how much each requirement can perhaps be taken in a more "relaxed" sense.
BTW, I don't understand what a "group by" implies here since all the vectors in each group would be identical (since you said all dimensions but one are specified by equality), so it may be that you've skipped a COUNT(*) in your SQL-equivalent -- i.e., you need a count of how many such vectors have a given value in the i-th dimension. In that case, it would be achievable by the above approach.
Edit: as the OP has clarified somewhat in comments and an edit to his Q I can propose an approach in more details:
import collections
class Searchable(object):
def __init__(self, toindex=None):
self.toindex = toindex
self.data = []
self.indices = None
def makeindices(self):
if self.indices is not None:
return
self.indices = dict((i, collections.defaultdict(set))
for i in self.toindex)
def add(self, record):
if self.toindex is None:
self.toindex = range(len(record))
self.makeindices()
where = len(self.data)
self.data.append(record)
for i in self.toindex:
self.indices[i][record[i]].add(where)
def get(self, indices_and_values, indices_to_get):
ok = set(range(len(self.data)))
for i, v in indices_and_values:
ok.intersection_update(self.indices[i][v])
result = set()
for rec in (self.data[i] for i in ok):
t = tuple(rec[i] for i in indices_to_get)
result.add(t)
return result
def main():
c = Searchable()
for r in ((1,2,3), (1,2,4), (1,5,4)):
c.add(r)
print c.get([(0,1),(1,2)], [2])
main()
This prints
set([(3,), (4,)])
and of course could be easily specialized to return results in other formats, accept indices (to index and/or to return) in different ways, etc. I believe it meets the requirements as edited / clarified since the extra storage is, for each indexed dimension/value, a set of the indices at which said value occurs on that dimension, and the search time is one set intersection per indexed dimension plus a loop on the number of items to be returned.

I'm assuming that you've tried the dictionary and you need something more flexible. Basically, what you need to do is index values of x, y and z
def build_index(vectors):
index = {x: {}, y: {}, z: {}}
for position, vector in enumerate(vectors):
if vector.x in index['x']:
index['x'][vector.x].append(position)
else:
index['x'][vector.x] = [position]
if vector.y in index['y']:
index['y'][vector.y].append(position)
else:
index['y'][vector.y] = [position]
if vector.z in index['z']:
index['z'][vector.z].append(position)
else:
index['z'][vector.z] = [position]
return index
What you have in index a lookup table. You can say, for example, select x,y,z from vectors where x=42 by doing this:
def query_by(vectors, index, property, value):
results = []
for i in index[property][value]:
results.append(vectors[i])
vecs_x_42 = query_by(index, 'x', 42)
# now vec_x_42 is a list of all vectors where x is 42
Now to do a logical conjunction, say select x,y,z from vectors where x=42 and y=3 you can use Python's sets to accomplish this:
def query_by(vectors, index, criteria):
sets = []
for k, v in criteria.iteritems():
if v not in index[k]:
return []
sets.append(index[k][v])
results = []
for i in set.intersection(*sets):
results.append(vectors[i])
return results
vecs_x_42_y_3 = query_by(index, {'x': 42, 'y': 3})
The intersection operation on sets produces values which only appear in both sets, so you are only iterating the positions which satisfy both conditions.
Now for the last part of your question, to group by x:
def group_by(vectors, property):
result = {}
for v in vectors:
value = getattr(v, property)
if value in result:
result[value].append(v)
else:
result[value] = [v]
return result
So let's bring it all together:
vectors = [...] # your vectors, as objects such that v.x, v.y produces the x and y values
index = build_index(vectors)
my_vectors = group_by(query_by(vectors, index, {'y':42, 'z': 3}), 'x')
# now you have in my_vectors a dictionary of vectors grouped by x value, where y=42 and z=3
Update
I updated the code above and fixed a few obvious errors. It works now and it does what it claims to do. On my laptop, a 2GHz core 2 duo with 4GB RAM, it takes less than 1s to build_index. Lookups are very quick, even when the dataset has 100k vectors. If I have some time I'll do some formal comparisons against MySQL.
You can see the full code at this Codepad, if you time it or improve it, let me know.

Suppose you have a 'tuple' class with fields x, y, and z and you have a bunch of such tuples saved in an enumerable var named myTuples. Then:
A) Pre-population:
dct = {}
for tpl in myTuples:
tmp = (tpl.y, tpl.z)
if tmp in dct:
dct[tmp].append(tpl.x)
else:
dct[tmp] = [tpl.x]
B) Query:
def findAll(y,z):
tmp = (y,z)
if tmp not in dct: return ()
return [(x,y,z) for x in dct[tmp]]
I am sure that there is a way to optimize the code for readability, save a few cycles, etc. But essentially you want to pre-populate a dict, using a 2-tuple as a key. If I did not see a request for sub-linear then I would not have though of this :)
A) The pre-population is linear, sorry.
B) Query should be as slow as the number of items returned - most of the time sub-linear, except for weird edge cases.

So you have 3 coordinates and one value for start and end of vector (x,y,z)?
How is it possible to know the seven known values? Are there many coordinate triples multiple times?
You must be doing very tight loop with the function to be so conserned of look up time considering the small size of data (10K).
Could you give example of real input for your class you posted?

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.