More consistent hashing in dictionary with python objects? - python

So, I saw Hashing a dictionary?, and I was trying to figure out a way to handle python native objects better and produce stable results.
After looking at all the answers + comments this is what I came to and everything seems to work properly, but am I maybe missing something that would make my hashing inconsistent (besides hash algorithm collisions)?
md5(repr(nested_dict).encode()).hexdigest()
tl;dr: it creates a string with the repr and then hashes the string.
Generated my testing nested dict with this:
for i in range(100):
for j in range(100):
if not nested_dict.get(i,None):
nested_dict[i] = {}
nested_dict[i][j] = ''
I'd imagine the repr should be able to support any python object, since most have to have the __repr__ support in general, but I'm still pretty new to python programming. One thing that I've heard of when using from reprlib import repr instead of the stdlib one that it'll truncate large sequences. So, that's one potential downfall, but it seems like the native list and set types don't do that.
other notes:
I'm not able to use https://stackoverflow.com/a/5884123, because I'm going to have nested dictionaries.
I used python 3.9.7 when testing this out.
Not able to use https://stackoverflow.com/a/22003440, because at the time of hashing it still has IPv4 address objects as keys. (json.dumps didn't like that too much 😅)

Python dicts are insert ordered. The repr respects that. Your hexdigest of {"A":1,"B":2} will differ from {"B":2,"A":1} whereas == - wise those dicts are the same.
Yours won't work out:
from hashlib import md5
def yourHash(d):
return md5(repr(d).encode()).hexdigest()
a = {"A":1,"B":2}
b = {"B":2,"A":1}
print(repr(a))
print(repr(b))
print (a==b)
print(yourHash(a) == yourHash(b))
gives
{'A': 1, 'B': 2} # repr a
{'B': 2, 'A': 1} # repr b
True # a == b
False # your hashes equall'ed
I really do not see the "sense" in hashing dicts at all ... and those ones here are not even "nested".
You could try JSON to sort keys down to the last nested one and using the json.dumps() of the whole structure to be hashed - but still - don't see the sense and it will give you plenty computational overhead:
import json
a = {"A":1,"B":2, "C":{2:1,3:2}}
b = {"B":2,"A":1, "C":{3:2,2:1}}
for di in (a,b):
print(json.dumps(di,sort_keys=True))
gives
{"A": 1, "B": 2, "C": {"2": 1, "3": 2}} # thouroughly sorted
{"A": 1, "B": 2, "C": {"2": 1, "3": 2}} # recursively...
which is exactly what this answer in Hashing a dictionary? proposes ... why stray from it?

Related

Python set dictionary nested key with dot delineated string

If I have a dictionary that is nested, and I pass in a string like "key1.key2.key3" which would translate to:
myDict["key1"]["key2"]["key3"]
What would be an elegant way to be able to have a method where I could pass on that string and it would translate to that key assignment? Something like
myDict.set_nested('key1.key2.key3', someValue)
Using only builtin stuff:
def set(my_dict, key_string, value):
"""Given `foo`, 'key1.key2.key3', 'something', set foo['key1']['key2']['key3'] = 'something'"""
# Start off pointing at the original dictionary that was passed in.
here = my_dict
# Turn the string of key names into a list of strings.
keys = key_string.split(".")
# For every key *before* the last one, we concentrate on navigating through the dictionary.
for key in keys[:-1]:
# Try to find here[key]. If it doesn't exist, create it with an empty dictionary. Then,
# update our `here` pointer to refer to the thing we just found (or created).
here = here.setdefault(key, {})
# Finally, set the final key to the given value
here[keys[-1]] = value
myDict = {}
set(myDict, "key1.key2.key3", "some_value")
assert myDict == {"key1": {"key2": {"key3": "some_value"}}}
This traverses myDict one key at a time, ensuring that each sub-key refers to a nested dictionary.
You could also solve this recursively, but then you risk RecursionError exceptions without any real benefit.
There are a number of existing modules that will already do this, or something very much like it. For example, the jmespath module will resolve jmespath expressions, so given:
>>> mydict={'key1': {'key2': {'key3': 'value'}}}
You can run:
>>> import jmespath
>>> jmespath.search('key1.key2.key3', mydict)
'value'
The jsonpointer module does something similar, although it likes / for a separator instead of ..
Given the number of pre-existing modules I would avoid trying to write your own code to do this.
EDIT: OP's clarification makes it clear that this answer isn't what he's looking for. I'm leaving it up here for people who find it by title.
I implemented a class that did this a while back... it should serve your purposes.
I achieved this by overriding the default getattr/setattr functions for an object.
Check it out! AndroxxTraxxon/cfgutils
This lets you do some code like the following...
from cfgutils import obj
a = obj({
"b": 123,
"c": "apple",
"d": {
"e": "nested dictionary value"
}
})
print(a.d.e)
>>> nested dictionary value

How to remove a key/value pair in python dictionary?

Say I have a dictionary like this :
d = {'ben' : 10, 'kim' : 20, 'bob' : 9}
Is there a way to remove a pair like ('bob',9) from the dictionary?
I already know about d.pop('bob') but that will remove the pair even if the value was something other than 9.
Right now the only way I can think of is something like this :
if (d.get('bob', None) == 9):
d.pop('bob')
but is there an easier way? possibly not using if at all
pop also returns the value, so performance-wise (as neglectable as it may be) and readability-wise it might be better to use del.
Other than that I don't think there's something easier/better you can do.
from timeit import Timer
def _del():
d = {'a': 1}
del d['a']
def _pop():
d = {'a': 1}
d.pop('a')
print(min(Timer(_del).repeat(5000, 5000)))
# 0.0005624240000000613
print(min(Timer(_pop).repeat(5000, 5000)))
# 0.0007729860000003086
You want to perform two operations here
1) You want to test the condition d['bob']==9.
2) You want to remove the key along with value if the 1st answer is true.
So we can not omit the testing part, which requires use of if, altogether. But we can certainly do it in one line.
d.pop('bob') if d.get('bob')==9 else None

Is there something simple like a set for un-hashable objects?

For hashable objects inside a dict I could easily pair down duplicate values store in a dict using a set. For example:
a = {'test': 1, 'key': 1, 'other': 2}
b = set(a.values())
print(b)
Would display [1,2]
Problem I have is I am using a dict to store mapping between variable keys in __dict__ and the corresponding processing functions that will be passed to an engine to order and process those functions, some of these functions may be fast some may be slower due to accessing an API. The problem is each function may use multiple variable, therefor need multiple mappings in the dict. I'm wondering if there is a way to do this or if I am stuck writing my own solution?
Ended up building a callable class, since caching could speed things up for me:
from collections.abc import Callable
class RemoveDuplicates(Callable):
input_cache = []
output_cache = []
def __call__(self, in_list):
if list in self.input_cache:
idx = self.input_cache.index(in_list)
return self.output_cache[idx]
else:
self.input_cache.append(in_list)
out_list = self._remove_duplicates(in_list)
self.output_cache.append(out_list)
return out_list
def _remove_duplicates(self, src_list):
result = []
for item in src_list:
if item not in result:
result.append(item)
return result
If the objects can be ordered, you can use itertools.groupby to eliminate the duplicates:
>>> a = {'test': 1, 'key': 1, 'other': 2}
>>> b = [k for k, it in itertools.groupby(sorted(a.values()))]
>>> print(b)
[1, 2]
Is there something simple like a set for un-hashable objects
Not in the standard library but you need to look beyond and search for BTree implementation of dictionary. I googled and found few hits where the first one (BTree)seems promising and interesting
Quoting from the wiki
The BTree-based data structures differ from Python dicts in several
fundamental ways. One of the most important is that while dicts
require that keys support hash codes and equality comparison, the
BTree-based structures don’t use hash codes and require a total
ordering on keys.
Off-course its trivial fact that a set can be implemented as a dictionary where the value is unused.
You could (indirectly) use the bisect module to create sorted collection of your values which would greatly speed-up the insertion of new values and value membership testing in general — which together can be utilized to unsure that only unique values get put into it.
In the code below, I've used un-hashable set values for the sake of illustration.
# see http://code.activestate.com/recipes/577197-sortedcollection
from sortedcollection import SortedCollection
a = {'test': {1}, 'key': {1}, 'other': {2}}
sc = SortedCollection()
for value in a.values():
if value not in sc:
sc.insert(value)
print(list(sc)) # --> [{1}, {2}]

Declaring a dictionary using another value within the same dictionary?

I'm using python trying to basically do this:
myDict = {"key1" : 1, "key2" : myDict["key1"]+1}
...if you catch my drift. Possible without using multiple statements?
EDIT: Also, if anyone could tell me a better way to state this question more clearly that would be cool. I don't really know how to word what I'm asking.
EDIT2: Seems to be some confusion - yes, it's more complex than just "key2":1+1, and what I'm doing is mostly for code readability as it will get messy if I have to 2-line it.
Here's a bit more accurate code sample of what I'm trying to do...though it's still not nearly as complex as it gets :P
lvls={easy: {mapsize:(10,10), winPos:(mapsize[0]-1,mapsize[1]-1)},
medium:{mapsize:(15,15), winPos:(mapsize[0]-RANDOMINT,mapsize[1]-1)},
hard: {mapsize:(20,20), winPos:(mapsize[0]-RANDOMINT,mapsize[1]-RANDOMINT)}
}
No, this isn't possible in general without using multiple statements.
In this particular case, you could get around it in a hacky way. For example:
myDict = dict(zip(("key1", "key2"), itertools.count(1))
However, that will only work when you want to specify a single start value and everything else will be sequential, and presumably that's not general enough for what you want.
If you're doing this kind of thing a lot, you could wrap those multiple statements up in some suitably-general function, so that each particular instance is just a single expression. For example:
def make_funky_dict(*args):
myDict = {}
for key, value in zip(*[iter(a)]*2):
if value in myDict:
value = myDict[value] + 1
myDict[key] = value
return myDict
myDict = make_funky_dict("key1", 1, "key2", "key1")
But really, there's no good reason not to use multiple statements here, and it will probably be a lot clearer, so… I'd just do it that way.
It's not possible without using multiple statements, at least not using some of the methods from your problem statement. But here's something, using dict comprehension:
>>> myDict = {"key" + str(key): value for (key, value) in enumerate(range(7))}
>>> myDict
{'key0': 0,
'key1': 1,
'key2': 2,
'key3': 3,
'key4': 4,
'key5': 5,
'key6': 6}
Of course those aren't in order, but they're all there.
The only variable you are trying to use is an integer. How about a nice function:
def makelevel(size,jitterfunc=lambda:0):
return {'mapsize':(size,size), 'winPos':(size-1+jitterfunc(),size-1+jitterfunc())}
lvls = {hardness,makelevel(size) for hardness, size in [('easy',10),('medium',15), ('hard',20)]}
Of course, this function looks a bit like a constructor. Maybe you should be using objects?
If you want a dict that will allow you to to have values that are evaluated on demand you can do something like this:
class myDictType(dict):
def __getitem__(self, key):
retval = dict.__getitem__(self,key)
if type(retval) == type(lambda: 1):
return retval()
return retval
myDict = myDictType()
myDict['bar'] = lambda: foo['foo'] + 1
myDict['foo'] = 1
print myDict['bar'] #This'll print a 2
myDict['foo'] = 2
print myDict['bar'] #This'll print a 3
This overrides __getitem__ in the dictionary to return whatever is stored in it (like a normal dictionary,) unless what is stored there is a lambda. If the value is a lambda, it instead evaluates it and returns the result.

Joining large dictionaries by identical keys

I have around 10 huge files that contain python dictionaries like so:
dict1:
{
'PRO-HIS-MET': {
'A': ([1,2,3],[4,5,6],[7,8,9]),
'B': ([5,2],[6],[8,9]),
'C': ([3],[4],[7,8])},
'TRP-MET-GLN': {
'F': ([-5,-4,1123],[-7,-11,2],[-636,-405])}
}
dict2:
{
'PRO-HIS-MET': {
'J': ([-657], [7,-20,3], [-8,-85,15])}
'TRP-MET-GLN':{
'K': ([1,2,3],[4,50,6],[7,80,9]),
'L': ([5,20],[60,80],[8,9])}
}
Basically they are all dictionaries of dictionaries. Each file is around 1 GB in size (the above is just an example of the data). Anyway, what I would like to do is join the 10 dictionaries together:
final:
{
'PRO-HIS-MET': {
'A': ([1,2,3],[4,5,6],[7,8,9]),
'B': ([5,2],[6],[8,9]),
'C': ([3],[4],[7,8])
'J': ([-657], [7,-20,3], [-8,-85,15])},
'TRP-MET-GLN': {
'F': ([-5,-4,1123],[-7,-11,2],[-636,-405])
'K': ([1,2,3],[4,50,6],[7,80,9]),
'L': ([5,20],[60,80],[8,9])}
}
I have tried the following code on small files and it works fine:
import csv
import collections
d1 = {}
d2 = {}
final = collections.defaultdict(dict)
for key, val in csv.reader(open('filehere.txt')):
d1[key] = eval(val)
for key, val in csv.reader(open('filehere2.txt')):
d2[key] = eval(val)
for key in d1:
final[key].update(d1[key])
for key in d2:
final[key].update(d2[key])
out = csv.writer(open('out.txt', 'w'))
for k, v in final.items():
out.writerow([k, v])
However if I try that on my 1 GB files I quickly run out of memory by keeping d1 and d2 as well as the final dictionary in memory.
I have a couple ideas:
Is there a way where I can just load the keys from the segmented dictionaries, compare those, and if the same ones are found in multiple dictionaries just combine the values?
Instead of merging the dictionaries into one huge file (which will probably give me memory headaches in the future), how can I make many separate files that contain all the values for one key after merging data? For example, for the above data I would just have:
pro-his-met.txt:
'PRO-HIS-MET': {
'A': ([1,2,3],[4,5,6],[7,8,9]),
'B': ([5,2],[6],[8,9]),
'C': ([3],[4],[7,8])
'J': ([-657], [7,-20,3], [-8,-85,15])}
trp-met-gln.txt:
'TRP-MET-GLN': {
'F': ([-5,-4,1123],[-7,-11,2],[-636,-405])
'K': ([1,2,3],[4,50,6],[7,80,9]),
'L': ([5,20],[60,80],[8,9])}
I don't have too much programming experience as a biologist (you may have guessed the above data represents a bioinformatics problem) so any help would be much appreciated!
The shelve module is a very easy-to-use database for Python. It's nowhere near as powerful as a real database (for that, see #Voo's answer), but it will do the trick for manipulating large dictionaries.
First, create shelves from your dictionaries:
import shelve
s = shelve.open('filehere.db', flag='n', protocol=-1, writeback=False)
for key, val in csv.reader(open('filehere.txt')):
s[key] = eval(val)
s.close()
Now that you've shelved everything neatly, you can operate on the dictionaries efficiently:
import shelve
import itertools
s = shelve.open('final.db', flag='c', protocol=-1, writeback=False)
s1 = shelve.open('file1.db', flag='r')
s2 = shelve.open('file2.db', flag='r')
for key, val in itertools.chain(s1.iteritems(), s2.iteritems()):
d = s.get(key, {})
d.update(val)
s[key] = d # force write
s.close()
Personally this sounds like the archetype of a problem databases were invented to solve. Yes you can solve this yourself with keeping files around and for performance optimizations map them into memory and let the OS handle the swapping, etc. but this is really complicated and hard to do really good.
Why go through all this effort if you can let a DB - into which millions of man-hours of have been put - handle it? That will be more efficient and as an added benefit much easier to query for information.
I've seen Oracle DBs storing much more than 10 GB of data without any problems, I'm sure postgre will handle this just as well.. the nice thing is if you use an ORM you can abstract those nitty gritty details away and worry about them later if it gets necessary.
Also while bioinformatics isn't my speciality I'm pretty sure there are specific solutions tailored to bioinformatics around - maybe one of them would be the perfect fit?
This concept should work.
I would consider to do multiple passes on the file where each time you do a portion of keys. and save that result.
Eg. if you create a list of the unique first characters of all the keys in one pass and then process each of thoses passes to new output files. If it was simple alphabetic data the logical choice would be a loop with each letter of the alphabet.
Eg. in the "p" pass you would process 'PRO-HIS-MET'
Then you would combine all the results from all the files at the end.
If you were a developer, the Database idea in the previous answer is probably the best approach if you can handle that kind of interaction. That idea entails creating a 2 level structure where you insert and update the records then query the result with an SQL statement.

Categories