Python: "Hash" a nested List - python

I have a dictionary master which contains around 50000 to 100000 unique lists which can be simple lists or also lists of lists. Every list is assigned to a specific ID (which is the key of the dictionary):
master = {12: [1, 2, 4], 21: [[1, 2, 3], [5, 6, 7, 9]], ...} # len(master) is several ten thousands
Now I have a few hundreds of dictionarys which again contain around 10000 lists (same as above: can be nested). Example of one of those dicts:
a = {'key1': [6, 9, 3, 1], 'key2': [[1, 2, 3], [5, 6, 7, 9]], 'key3': [7], ...}
I want to cross-reference this data for every single dictionary in reference to my master, i.e. instead of saving every list within a, I want to only store the ID of the master in case the list is present in the master.
=> a = {'key1': [6, 9, 3, 1], 'key2': 21, 'key3': [7], ...}
I can do that by looping over all values in a and all values of master and try to match the lists (by sorting them), but that'll take ages.
Now I'm wondering how would you solve this?
I thought of "hashing" every list in master to a unique string and store it as a key of a new master_inverse reference dict, e.g.:
master_inverse = {hash([1,2,4]): 12, hash([[1, 2, 3], [5, 6, 7, 9]]): 21}
Then it would be very simple to look it up later on:
for k, v in a.items():
h = hash(v)
if h in master_inverse:
a[k] = master_inverse[h]
Do you have a better idea?
How could such a hash look like? Is there a built-in-method already which is fast and unique?
EDIT:
Dunno why I didn't come up instantly with this approach:
What do you think of using a m5-hash of either the pickle or the repr() any single list?
Something like this:
import hashlib
def myHash(str):
return hashlib.md5(repr(str)).hexdigest()
master_inverse = {myHash(v): k for k, v in master.items()}
for k, v in a.items():
h = myHash(v)
if h in master_inverse:
a[k] = master_inverse[h]
EDIT2:
I benched it: To check one of the hundred dicts (in my example a, a contains for my benchmark around 20k values) against my master_inverse is very fast, didn't expect that: 0.08sec. So I guess I can live with that well enough.

MD5 approach will work, but you need to be cautions about very small possibility of cache collisions (see How many random elements before MD5 produces collisions? for more deitals) when using MD5 hash.
If you need to be absolutely sure that program works correctly you can convert lists to tuples and create dictionary where keys are tuples you have created and values are keys from your master dictionary (same as master_inverse, but with full values instead of MD5 hash values).
More info on how to use tuples as dictionary keys: http://www.developer.com/lang/other/article.php/630941/Learn-to-Program-using-Python-Using-Tuples-as-Keys.htm.

Related

Alternative way to use setdefault() using dictionary comprehension?

I have a nested dictionary that was created from a nested list where the first item in the nested list would be the outer key and outer value would be a dictionary which is the next two items. The following code is working great using the two setdefault() functions because it just adds to the nested dictionary when it sees a duplicate key of the outer. I was just wondering how you could do this same logic using a dictionary comprehension?
dict1 = {}
list1 = [[1, 2, 6],
[1, 3, 7],
[2, 5, 8],
[2, 8, 9]]
for i in list1:
dict1.setdefault(i[0], {}).setdefault(i[1], i[2])
OUTPUT:
{1: {2: 6, 3: 7}, 2: {5: 8, 8: 9}}
Use the loop because it's very readable and efficient. Not all code has to be a one-liner.
Having said that, it's possible. It abuses syntax, extremely unreadable, inefficient, and generally just plain bad code (don't do it!)
out = {k: next(gg for gg in [{}] if all(gg.setdefault(a, b) for a,b in v)) for k, v in next(g for g in [{}] if not any(g.setdefault(key, []).append(v) for key, *v in list1)).items()}
Output:
{1: {2: 6, 3: 7}, 2: {5: 8, 8: 9}}
I actually tried to achieve that result and failed.
The comprehension overwrites the new entries.
After, giving this idea a look, I found a similar post in which it is stated it is not possible:https://stackoverflow.com/questions/11276473/append-to-a-dict-of-lists-with-a-dict-comprehension
I believe Amber's answer best sumarizes what the conclusion with my failed attempt with dict comprehensions:
No - dict comprehensions are designed to generate non-overlapping keys with each iteration; they don't support aggregation. For this particular use case, a loop is the proper way to accomplish the task efficiently (in linear time)

Populate Python dictionaries with pre-assigned keys and values of a particular length

Say for example I have the following dictionary in Python:
memory_map = {'data': [1,2,3], 'theta': [4,5,6,7]}
I would like to create another random dictionary that is identical to memory_map, has the same keys and the same lengths of lists as their values however the elements of the list are populated with random values using the np.random.default_rng(seed).uniform(low, high, size) function.
An example would be: random_dict = {'data': [5,3,1], 'theta': [7,3,4,8]}.
Moreover, if the names of the keys or the lengths of the lists in values change, this should automatically be reflected in the random_dict that is created.
If I add a new key or remove a key from the memory_map this should also be reflected in the random_dict.
So far, I have random_dict = {item: [] for item in list(memory_map.keys())} but am unsure of how to populate the empty list with the random values of the same length.
Thanks.
Looks like you want something like this.
import random
import itertools
def shuffle_map(input_map):
# Grab a list of all values
all_values = list(itertools.chain(*input_map.values()))
random.shuffle(all_values) # Shuffle it
new_map = {} # Initialize a new map
i = 0 # Keep track of how many items we've consumed
for key, value in input_map.items():
n = len(value) # How many values per this key
new_map[key] = all_values[i : i + n] # Assign a slice
i += n
return new_map
memory_map = {"data": [1, 2, 3], "theta": [4, 5, 6, 7]}
print(shuffle_map(memory_map))
This prints out (e.g., consecutive runs)
{'data': [1, 5, 7], 'theta': [2, 4, 3, 6]}
{'data': [6, 7, 1], 'theta': [5, 4, 3, 2]}
{'data': [5, 2, 3], 'theta': [7, 6, 4, 1]}
For the random lists you should take a look at the random module in the standard library, specifically at the random.sample or random.choices, depending on your needs.
For the second request, of automatically update the dictionary based on changes of the first, the easiest way to do it is to create a wrapper around the first dict inheriting from the collections.UserDict class

Quickest way to merge dictionaries based on key match

I have two dictionaries:
dic_1={'1234567890': 1, '1234567891': 2, '1234567880': 3, '1234567881': 4}
dic_2={'1234567890': 5, '1234567891': 6}
Now I want to merge them based on key values such that the merged dictionary looks like the following:
merged_dic=={'1234567890': 1, '1234567891': 2, '1234567880': 3, '1234567881': 4}
We only want to keep unique keys and only one distinct value associated with them. What's the best way to do that
This should be what you need. It iterates through all dictionaries adding key/values only if the key is not already in the merged dictionary.
from itertools import chain
merged_dic = {}
for k, v in chain(dic_1.items(), dic_2.items()):
if k not in merged_dic:
merged_dic[k] = v
print(merged_dic)
# {'1234567890': 1, '1234567891': 2, '1234567880': 3, '1234567881': 4}
If, for example, you were wanting to keep all values for a key you could use:
from collections import defaultdict
from itertools import chain
merged_dic = defaultdict(list)
for k, v in chain(dic_1.items(), dic_2.items()):
merged_dic[k].append(v)
print(merged_dic)
# {'1234567890': [1, 5], '1234567891': [2, 6], '1234567880': [3], '1234567881': [4]}
Using chain() can allow you to iterate over many dictionaries. In the question you showed 2 dictionaries, but if you had 4 you could easily merge them all. E.g.
for k, v in chain(dic_1.items(), dic_2.items(), dic_3.items(), dic_4.items()):
All you're really trying to do is update dic_2 with any values in dic_1 so you can just do
merged_dic = {**dic_2,**dic_1}
This will merge the two dictionaries, taking all the values from dic_2, updating any keys in the new dictionary with any new values that exist in dic_1 and then adding any unique keys in dic_1
The sample data is not exactly explains the SO. If dic_2 has common key with dic_1 then retain the item in dic_1; if new item is found in dic_2 then put it in merged dictionary.
import copy
dic_1={'1234567890': 1, '1234567891': 2, '1234567880': 3, '1234567881': 4}
dic_2={'1234567890': 5, '8234567890': 6}
merged_d = copy.copy(dic_1)
diff = set(dic_2)-set(dic_1)
merged_d.update({k: dic_2[k] for k in diff})
print(merged_d)
Result:
{'1234567890': 1, '1234567891': 2, '1234567880': 3, '1234567881': 4, '8234567890': 6}
If you want the first dict to override the keys in the second dict then:
dic_2.update(dic_1)

Get dictionary mapping values to reference ID

I have a list of numpy arrays like the following:
list_list = [np.array([53, 5, 2, 5, 5, 2, 1, 5, 9]), np.array([6, 4, 1,2, 53, 23, 1, 4])]
and a list of IDs for each array above:
ID = [6, 2]
How can I get a dictionary that for each unique value in list_list, I get a list of the IDs which contain it?
For example, for this very simple example, I want something like:
{53: [6, 2], 5: [6], 2: [6, 2], 1: [6, 2], etc}
My actual list_list is over 1000 lists long, with each numpy array containing around 10 million value, so efficiency in the solution is key.
I know that dict(zip(ID, list_list)) will give me a dictionary corresponding an ID with all of its values, but it won't give me a value corresponding to IDs, which is what I want.
Thanks!
The best way to approach a problem like this is to break it into smaller steps. Describe these in a combination of English and pseudo-python as seems appropriate. You seem to have the right idea to get started with zip(ID, list_list). This creates the association between the two lists as we discussed in the comments.
So what next? Well, we want to build a dictionary with the values in list_list as keys. To do so, we need to iterate over the list returned by zip():
for id, list in zip(ID, list_list):
pass
And then we need to iterate over the elements of list to determine the keys of the dictionary:
for id, list in zip(ID, list_list):
for x in list:
pass
Now we need an empty dictionary to add things to:
d = {}
for id, list in zip(ID, list_list):
for x in list:
pass
Next, we need to get a list for the dictionary if it exists. If it doesn't exist, we can use an empty list instead. Then append the id to the list and put it into the dictionary:
d = {}
for id, list in zip(ID, list_list):
for x in list:
l = d.get(x, [])
l.append(id)
d[x] = l
Notice how I describe in words what to do at each step and then translate it into Python. Breaking a problem into small steps like this is a key part of developing your skills as a programmer.
We iterate the Zip(ID,list_list) and to get only the unique elements in the lis by creating a set of it.
Then we will iterate through this set and if that element is not allready present in the dictionary we add it to the dictionary if it's already present we append the id.
import numpy as np
list_list = [np.array([53, 5, 2, 5, 5, 2, 1, 5, 9]), np.array([6, 4, 1,2, 53, 23, 1, 4])]
ID = [6, 2]
dic={}
for id,lis in zip(ID,list_list):
lis=set(lis)
for ele in lis:
if ele not in dic:
dic[ele]=[id]
else:
dic[ele].append(id)
print(dic)
{1: [6, 2], 2: [6, 2], 5: [6], 9: [6], 53: [6, 2], 4: [2], 6: [2], 23: [2]}

Flatten a dictionary of dictionaries (2 levels deep) of lists

I'm trying to wrap my brain around this but it's not flexible enough.
In my Python script I have a dictionary of dictionaries of lists. (Actually it gets a little deeper but that level is not involved in this question.) I want to flatten all this into one long list, throwing away all the dictionary keys.
Thus I want to transform
{1: {'a': [1, 2, 3], 'b': [0]},
2: {'c': [4, 5, 1], 'd': [3, 8]}}
to
[1, 2, 3, 0, 4, 5, 1, 3, 8]
I could probably set up a map-reduce to iterate over items of the outer dictionary to build a sublist from each subdictionary and then concatenate all the sublists together.
But that seems inefficient for large data sets, because of the intermediate data structures (sublists) that will get thrown away. Is there a way to do it in one pass?
Barring that, I would be happy to accept a two-level implementation that works... my map-reduce is rusty!
Update:
For those who are interested, below is the code I ended up using.
Note that although I asked above for a list as output, what I really needed was a sorted list; i.e. the output of the flattening could be any iterable that can be sorted.
def genSessions(d):
"""Given the ipDict, return an iterator that provides all the sessions,
one by one, converted to tuples."""
for uaDict in d.itervalues():
for sessions in uaDict.itervalues():
for session in sessions:
yield tuple(session)
...
# Flatten dict of dicts of lists of sessions into a list of sessions.
# Sort that list by start time
sessionsByStartTime = sorted(genSessions(ipDict), key=operator.itemgetter(0))
# Then make another copy sorted by end time.
sessionsByEndTime = sorted(sessionsByStartTime, key=operator.itemgetter(1))
Thanks again to all who helped.
[Update: replaced nthGetter() with operator.itemgetter(), thanks to #intuited.]
I hope you realize that any order you see in a dict is accidental -- it's there only because, when shown on screen, some order has to be picked, but there's absolutely no guarantee.
Net of ordering issues among the various sublists getting catenated,
[x for d in thedict.itervalues()
for alist in d.itervalues()
for x in alist]
does what you want without any inefficiency nor intermediate lists.
edit: re-read the original question and reworked answer to assume that all non-dictionaries are lists to be flattened.
In cases where you're not sure how far down the dictionaries go, you would want to use a recursive function. #Arrieta has already posted a function that recursively builds a list of non-dictionary values.
This one is a generator that yields successive non-dictionary values in the dictionary tree:
def flatten(d):
"""Recursively flatten dictionary values in `d`.
>>> hat = {'cat': ['images/cat-in-the-hat.png'],
... 'fish': {'colours': {'red': [0xFF0000], 'blue': [0x0000FF]},
... 'numbers': {'one': [1], 'two': [2]}},
... 'food': {'eggs': {'green': [0x00FF00]},
... 'ham': ['lean', 'medium', 'fat']}}
>>> set_of_values = set(flatten(hat))
>>> sorted(set_of_values)
[1, 2, 255, 65280, 16711680, 'fat', 'images/cat-in-the-hat.png', 'lean', 'medium']
"""
try:
for v in d.itervalues():
for nested_v in flatten(v):
yield nested_v
except AttributeError:
for list_v in d:
yield list_v
The doctest passes the resulting iterator to the set function. This is likely to be what you want, since, as Mr. Martelli points out, there's no intrinsic order to the values of a dictionary, and therefore no reason to keep track of the order in which they were found.
You may want to keep track of the number of occurrences of each value; this information will be lost if you pass the iterator to set. If you want to track that, just pass the result of flatten(hat) to some other function instead of set. Under Python 2.7, that other function could be collections.Counter. For compatibility with less-evolved pythons, you can write your own function or (with some loss of efficiency) combine sorted with itertools.groupby.
A recursive function may work:
def flat(d, out=[]):
for val in d.values():
if isinstance(val, dict):
flat(d, out)
else:
out+= val
If you try it with :
>>> d = {1: {'a': [1, 2, 3], 'b': [0]}, 2: {'c': [4, 5, 6], 'd': [3, 8]}}
>>> out = []
>>> flat(d, out)
>>> print out
[1, 2, 3, 0, 4, 5, 6, 3, 8]
Notice that dictionaries have no order, so the list is in random order.
You can also return out (at the end of the loop) and don't call the function with a list argument.
def flat(d, out=[]):
for val in d.values():
if isinstance(val, dict):
flat(d, out)
else:
out+= val
return out
call as:
my_list = flat(d)

Categories