Combine python dictionaries that share values and keys - python

I am doing some entity matching based on string edit distance and my results are a dictionary with keys (query string) and values [list of similar strings] based on some scoring criteria.
for example:
results = {
'ben' : ['benj', 'benjamin', 'benyamin'],
'benj': ['ben', 'beny', 'benjamin'],
'benjamin': ['benyamin'],
'benyamin': ['benjamin'],
'carl': ['karl'],
'karl': ['carl'],
}
Each value also has a corresponding dictionary item, for which it is the key (e.g. 'carl' and 'karl').
I need to combine the elements that have shared values. Choosing one value as the new key (lets say the longest string). In the above example I would hope to get:
results = {
'benjamin': ['ben', 'benj', 'benyamin', 'beny', 'benjamin', 'benyamin'],
'carl': ['carl','karl']
}
I have tried iterating through the dictionary using the keys, but I can't wrap my head around how to iterate and compare through each dictionary item and its list of values (or single value).

This is one solution using collections.defaultdict and sets.
The desired output is very similar to what you have, and can be easily manipulated to align.
from collections import defaultdict
results = {
'ben' : ['benj', 'benjamin', 'benyamin'],
'benj': ['ben', 'beny', 'benjamin'],
'benjamin': 'benyamin',
'benyamin': 'benjamin',
'carl': 'karl',
'karl': 'carl',
}
d = defaultdict(set)
for i, (k, v) in enumerate(results.items()):
w = {k} | (set(v) if isinstance(v, list) else {v})
for m, n in d.items():
if not n.isdisjoint(w):
d[m].update(w)
break
else:
d[i] = w
result = {max(v, key=len): v for k, v in d.items()}
# {'benjamin': {'ben', 'benj', 'benjamin', 'beny', 'benyamin'},
# 'carl': {'carl', 'karl'}}
Credit to #IMCoins for the idea of manipulating v to w in second loop.
Explanation
There are 3 main steps:
Convert values into a consistent set format, including keys and values from original dictionary.
Cycle through this dictionary and add values to a new dictionary. If there is an intersection with some key [i.e. sets are not disjoint], then use that key. Otherwise, add to new key determined via enumeration.
Create result dictionary in a final transformation by mapping max length key to values.

EDIT : Even though performance was not the question here, I took the liberty to perform some tests between jpp's answer, and mine... here is the full script. My script performs the tests in 17.79 seconds, and his in 23.5 seconds.
import timeit
results = {
'ben' : ['benj', 'benjamin', 'benyamin'],
'benj': ['ben', 'beny', 'benjamin'],
'benjamin': ['benyamin'],
'benyamin': ['benjamin'],
'carl': ['karl'],
'karl': ['carl'],
}
def imcoins(result):
new_dict = {}
# .items() for python3x
for k, v in results.iteritems():
flag = False
# Checking if key exists...
if k not in new_dict.keys():
# But then, we also need to check its values.
for item in v:
if item in new_dict.keys():
# If we update, set the flag to True, so we don't create a new value.
new_dict[item].update(v)
flag = True
if flag == False:
new_dict[k] = set(v)
# Now, to sort our newly created dict...
sorted_dict = {}
for k, v in new_dict.iteritems():
max_string = max(v)
if len(max_string) > len(k):
sorted_dict[max(v, key=len)] = set(v)
else:
sorted_dict[k] = v
return sorted_dict
def jpp(result):
from collections import defaultdict
res = {i: {k} | (set(v) if isinstance(v, list) else {v}) \
for i, (k, v) in enumerate(results.items())}
d = defaultdict(set)
for i, (k, v) in enumerate(res.items()):
for m, n in d.items():
if n & v:
d[m].update(v)
break
else:
d[i] = v
result = {max(v, key=len): v for k, v in d.items()}
return result
iterations = 1000000
time1 = timeit.timeit(stmt='imcoins(results)', setup='from __main__ import imcoins, results', number=iterations)
time2 = timeit.timeit(stmt='jpp(results)', setup='from __main__ import jpp, results', number=iterations)
print time1 # Outputs : 17.7903265883
print time2 # Outputs : 23.5605850732
If I move the import from his function to global scope, it gives...
imcoins : 13.4129249463 seconds
jpp : 21.8191823393 seconds

Related

Extract differences of values of 2 given dictionaries - values are tuples of strings

I have two dictionaries as follows, I need to extract which strings in the tuple values are in one dictionary but not in other:
dict_a = {"s": ("mmmm", "iiiii", "p11"), "yyzz": ("oo", "i9")}
dict_b = {"s": ("mmmm",), "h": ("pp",), "g": ("rr",)}
The desired output:
{"s": ("iiiii", "p11"), "yyzz": ("oo", "i9")}
The order of the strings in the output doesn't matter.
One way that I tried to solve, but it doesn't produce the expected result:
>>> [item for item in dict_a.values() if item not in dict_b.values()]
[('mmmm', 'iiiii', 'p11'), ('oo', 'i9')]
If order doesn't matter, convert your dictionary values to sets, and subtract these:
{k: set(v) - set(dict_b.get(k, ())) for k, v in dict_a.items()}
The above takes all key-value pairs from dict_a, and for each such pair, outputs a new dictionary with those keys and a new value that's the set difference between the original value and the corresponding value from dict_b, if there is one:
>>> dict_a = {"s": ("mmmm", "iiiii", "p11"), "yyzz": ("oo", "i9")}
>>> dict_b = {"s": ("mmmm",), "h": ("pp",), "g": ("rr",)}
>>> {k: set(v) - set(dict_b.get(k, ())) for k, v in dict_a.items()}
{'s': {'p11', 'iiiii'}, 'yyzz': {'oo', 'i9'}}
The output will have sets, but these can be converted back to tuples if necessary:
{k: tuple(set(v) - set(dict_b.get(k, ()))) for k, v in dict_a.items()}
The dict_b.get(k, ()) call ensures there is always a tuple to give to set().
If you use the set.difference() method you don't even need to turn the dict_b value to a set:
{k: tuple(set(v).difference(dict_b.get(k, ()))) for k, v in dict_a.items()}
Demo of the latter two options:
>>> {k: tuple(set(v) - set(dict_b.get(k, ()))) for k, v in dict_a.items()}
{'s': ('p11', 'iiiii'), 'yyzz': ('oo', 'i9')}
>>> {k: tuple(set(v).difference(dict_b.get(k, ()))) for k, v in dict_a.items()}
{'s': ('p11', 'iiiii'), 'yyzz': ('oo', 'i9')}
{k: [v for v in vs if v not in dict_b.get(k, [])] for k,vs in dict_a.items()}
if you want to use tuples (or sets - just replace the cast)
{k: tuple(v for v in vs if v not in dict_b.get(k, [])) for k,vs in dict_a.items()}
Try this (see comments for explanations):
>>> out = {} # Initialise output dictionary
>>> for k, v in dict_a.items(): # Iterate through items of dict_a
... if k not in dict_b: # Check if the key is not in dict_b
... out[k] = v # If it isn't, add to out
... else: # Otherwise
... out[k] = tuple(set(v) - set(dict_b[k])) # Subtract sets to find the difference
...
>>> out
{'s': ('iiiii', 'p11'), 'yyzz': ('oo', 'i9')}
This can then be simplified using a dictionary comprehension:
>>> out = {k: tuple(set(v) - set(dict_b.get(k, ()))) for k, v in dict_a.items()}
>>> out
{'s': ('iiiii', 'p11'), 'yyzz': ('oo', 'i9')}
See this solution :
First iterate through all keys in dict_a
Check if the key is present or not in dict_b
Now, if present: take the tuples and iterate through dict_a's tuple(as per your question). now check weather that element is present or not in dict_b's tuple. If it is present just leave it. If it is not just add that element in tup_res.
Now after the for loop, add that key and tup value in the dict_res.
if the key is not present in dict_b simply add it in dict_res.
dict_a = {"s":("mmmm","iiiii","p11"), "yyzz":("oo","i9")}
dict_b = {"s":("mmmm"),"h":("pp",),"g":("rr",)}
dict_res = {}
for key in dict_a:
if key in dict_b:
tup_a = dict_a[key]
tup_b = dict_b[key]
tup_res = ()
for tup_ele in tup_a:
if tup_ele in tup_b:
pass
else:
tup_res = tup_res + (tup_ele,)
dict_res[key] = tup_res;
else:
dict_res[key] = dict_a[key];
print(dict_res)
It is giving correct output :)

inverting a dictionary in Python

This is a code that inverts a dictionary but I've faced some troubles in order to understand the role of each element of code in the defined function invert_dict(dic), it would be great if someone breaks it down to me and explains to me the mission of each element.
Thank you.
animals = {'Lion':["meet", 1.2 ,'yellow'],'Cat':["milk", 0.3,'white'],'dog':["Dog", 1,'black']}
def invert_dict(dic):
return {v: d.setdefault(v, []).append(k) or d[v] for d in ({},) for k in dic for v in dic[k]}
print(invert_dict(animals))
The output:
{'meet': ['Lion'], 1.2: ['Lion'], 'yellow': ['Lion'], 'milk': ['Cat'], 0.3: ['Cat'], 'white': ['Cat'], 'Dog': ['dog'], 1: ['dog'], 'black': ['dog']}
Sounds like a classic use for defaultdict
animals = {'Lion':["meet", 1.2 ,'yellow'],'Cat':["milk", 0.3,'white'],'dog':["Dog", 1,'black']}
from collections import defaultdict
def invert_dict(d):
ret = defaultdict(list)
d = [(v,k) for k, lst in d.items() for v in lst]
for k,v in d.items():
ret[k].append(v)
return dict(ret)
{v: d.setdefault(v, []).append(k) or d[v] for d in ({},) for k in dic for v in dic[k]}
Let's break this down from the right side
for k in dic for v in dic[k]
Uses the dictionary with value as lists { k: [v1, v2] } and extracts the corresponding key as k and list of values in v = [v1, v2]
v: d.setdefault(v, []).append(k) or d[v] for d in ({},)
The for v in dic[k] loops on the values for each key i.e. [v1, v2] and uses a defaultdict to identify the value already exists in the resulting dictionary, if it does, it appends the key from the original dictionary to the new resulting inverted dictionary. If not, it assigns the key.
For a sample { k: [v1, v2, v3], z: [v1] }, after each run, the state of the inverted dictionary is as follows:
{}
{ v1 : [k] }
{ v1 : [k], v2 : [k] }
{ v1 : [k], v2 : [k], v3: [k] }
{ v1 : [k, z], v2 : [k], v3: [k] } # because v1 already exists, z is appended
This is the equivalent of:
from collections import defaultdict
def invert_dict_exploded(dic):
d = defaultdict(list)
for key in dic:
for value in dic[key]:
d[value].append(key)
return dict(d)
We need to go in order, from the outside to the inside:
Braces and colon notation
For loops
Append or value
Braces
Braces are used in Python to create a dictionary or a set. In this case they are used to create a dictionary since we have the {key: value} notation. In this discussion let's call it "the returned dictionary".
For loops
These for loops are the equivalent of:
for d in ({},):
for k in dic:
for v in dic.values():
...
This is a shorthand notation that comes in handy to create collections quickly. It's normally quite readable when it's not, like in this case, abused.
The for d in ({},) is only used to declare a dictionary that will sit there while iterating through the keys and values (because as we see above, it's in the outer-most loop). The dictionary is named d.
Append or value
For each key/value pair, we create a new key in the returned dictionary by using the colon notation, where the key is actually one of the input values, and then we do two separate things:
Append an item in our d dictionary, which is always a list
Assign the list to the given key
The k.setdefault(v, []) bit will set a default value of [] if no key v is found in the dictionary and then will return that list (the newly created empty list or the list found at that key), which is then used by the .append(k) bit to append the key as a value to that list. This takes care of cases where you have items in your input list with the same value, collecting all the keys together for that value, like in:
animals = {'Lion':["meet", 1.2 ,'yellow'],'Cat':["milk", 0.3,'black'],'dog':["Dog", 1,'black']}
Where you can see the multiple lists containing the "black" item and will output the following:
{'meet': ['Lion'], 1.2: ['Lion'], 'yellow': ['Lion'], 'milk': ['Cat'], 0.3: ['Cat'], 'black': ['Cat', 'dog'], 'Dog': ['dog'], 1: ['dog']}
Notice both "Cat" and "dog" keys are added to the "black" list in the result.
Finally, the or part. The list.append() function always returns None since every function that doesn't return explicitly, automatically returns None in Python.
The or operator is used to short-circuit the expression. It's written as A or B and is to be read as "If A evaluates to a true value, the expression evaluates to A; if A evaluates to a false value, the expression evaluates to B". None always evaluates to false in boolean terms, so the expression d.setdefault(v, []).append(k) or d[v] always evaluates to d[v] but only after executing the setdefault() and append().
v: d.setdefault(v, []).append(k) or d[v] can therefore be read as:
Create a key v in our returned dictionary; if v is not a key of d, set d[v] = []; append to d[v] the value k and set d[v] as the value of v.

Check if value exists in a dictionary of dictionaries and get the key(s)?

I have a dictionary of dictionaries:
x = {'NIFTY': {11382018: 'NIFTY19SEPFUT', 13177346: 'NIFTY19OCTFUT', 12335874: 'NIFTY19NOVFUT'}}
The dictionary has a lot of other dictionaries inside.
I want to check whether example:
y = 11382018
exists in the dictionary, if yes, get the master key in this case NIFTY and the value of the above key i.e. 'NIFTY19SEPFUT'
I can do this in the following way I assume:
for key in x.keys():
di = x[key]
if y in di.keys():
inst = key
cont = di[y]
Just wondering if there is a better way.
I was thinking along the lines of not having to loop over the entire dictionary master keys
A more compact way to retrieve both values of interest would be using a nested dictionary comprehension:
[(k, sv) for k,v in x.items() for sk,sv in v.items() if sk == y]
# [('NIFTY', 'NIFTY19SEPFUT')]
More compact version (generic):
[(k, v[y]) for k, v in d.items() if y in v]
Or:
*next(((k, v[y]) for k, v in d.items() if y in v), 'not found')
if you can guarantee the key is found only in one nested dictionary. (Note that I have used d as dictionary here, simply because that feels more meaningful)
Code:
d = {'NIFTY': {11382018: 'NIFTY19SEPFUT', 13177346: 'NIFTY19OCTFUT', 12335874: 'NIFTY19NOVFUT'}}
y = 11382018
print([(k, v[y]) for k, v in d.items() if y in v])
# or:
# print(*next(((k, v[y]) for k, v in d.items() if y in v), 'not found'))
Straightforwardly (for only 2 levels of nesting):
x = {'NIFTY': {11382018: 'NIFTY19SEPFUT', 13177346: 'NIFTY19OCTFUT', 12335874: 'NIFTY19NOVFUT'}}
search_key = 11382018
parent_key, value = None, None
for k, inner_d in x.items():
if search_key in inner_d:
parent_key, value = k, inner_d[search_key]
break
print(parent_key, value) # NIFTY NIFTY19SEPFUT

Create dictionary from dict and list

I have a dictionary :
dicocategory = {}
dicocategory["a"] = ["crapow", "Killian", "pauk", "victor"]
dicocategory["b"] = ["graton", "fred"]
dicocategory["c"] = ["babar", "poca", "german", "Georges", "nowak"]
dicocategory["d"] = ["crado", "cradi", "hibou", "distopia", "fiboul"]
dicocategory["e"] = ["makenkosapo"]
and a list :
my_list = ['makenkosapo', 'Killian', 'Georges', 'poca', 'nowak']
I want to create a new dictionary with my dicocategory's keys as new keys and items of my list as values.
To get the keys of my new dict (removing duplicate content and adapted to my list) I made :
def tablemain(my_list ):
tableheaders = list()
for value in my_list:
tableheaders.append([k for k, v in dicocategory.items() if value in v])
convertlist = [j for i in tableheaders for j in i]
headerstablefinal = list(set(convertlist))
return headerstablefinal
giving me:
['e', 'a', 'c']
My problem is: I don't know how to put the items of my list in the corresponding keys.
EDIT :
Bellow an output of what I want
{"a" : ['Killian'], 'c' : ['Georges', 'poca', 'nowak'], 'e' : ['makenkosapo']}
The list my_list can change, so I want something that can create a new dictionary doesn't matter the list.
If my new list is :
my_list = ['crapow', 'german', 'pauk']
My output will be :
{'a':['crapow', 'pauk'], 'c':['german']}
Do you have any idea?
Thank you
You can use a couple of dictionary comprehensions. Calculate the intersection in the first, and in the second remove instances where the intersection is empty:
my_set = set(my_list)
# calculate intersection
res = {k: set(v) & my_set for k, v in dicocategory.items()}
# remove zero intersection values
res = {k: v for k, v in res.items() if v}
print(res)
{'a': {'Killian'},
'c': {'Georges', 'nowak', 'poca'},
'e': {'makenkosapo'}}
More efficiently, you can use a generator expression to avoid an intermediary dictionary:
# generate intersection
gen = ((k, set(v) & my_set) for k, v in dicocategory.items())
# remove zero intersection values
res = {k: v for k, v in gen if v}
You can get a dictionary containing only keys with values that match your list like this:
{k:v for k,v in dicocategory.items() if set(v).intersection(set(my_list))}
You won't be able to put that directly into a DataFrame though as the lists differ in length.

Python : Match a dictionary value with another dictionary key

I have two dictionaries created this way :
tr = defaultdict(list)
tr = { 'critic' : '2_critic',
'major' : '3_major',
'all' : ['2_critic','3_major']
}
And the second one :
scnd_dict = defaultdict(list)
And contains values like this :
scnd_dict = {'severity': ['all']}
I want to have a third dict that will contain the key of scnd_dict and its corresponding value from tr.
This way, I will have :
third_dict = {'severity' : ['2_critic','3_major']}
I tried this, but it didn't work :
for (k,v) in scnd_dict.iteritems() :
if v in tr:
third_dict[k].append(tr[v])
Any help would be appreciated. Thanks.
Well...
from collections import defaultdict
tr = {'critic' : '2_critic',
'major' : '3_major',
'all' : ['2_critic','3_major']}
scnd_dict = {'severity': ['all']}
third_dict = {}
for k, v in scnd_dict.iteritems():
vals = []
if isinstance(v, list):
for i in v:
vals.append(tr.get(i))
else:
vals.append(tr.get(v))
if not vals:
continue
third_dict[k] = vals
print third_dict
Results:
>>>
{'severity': [['2_critic', '3_major']]}
Will do what you want. But I question the logic of using defaultdicts here, or of have your index part of a list...
If you use non-lists for scnd_dict then you can do the whole thing much easier. Assuming scnd_dict looks like this: scnd_dict = {'severity': 'all'}:
d = dict((k, tr.get(v)) for k, v in scnd_dict.items())
# {'severity': ['2_critic', '3_major']}
Your problem is that v is a list, not an item of a list. So, the if v in tr: will be false. Change your code so that you iterate over the items in v
third_dict = {k: [t for m in ks for t in tr[m]] for k,ks in scnd_dict.iteritems()}
The second dict's value is list, not str, so the code blow will work
for (k, v) in send_dict.iteritems():
if v[0] in tr.keys():
third_dict[k] = tr[v[0]]
The problem is that the third dictionary does not knows that the values is a list
for k in scnd_dict:
for v in scnd_dict[k]:
print v
for k2 in tr:
if v==k2:
if k not in third_dict:
third_dict[k]=tr[k2]
else:
third_dict[k]+=tr[k2]
third_dict = {k: tr[v[0]] for k, v in scnd_dict.iteritems() if v[0] in tr}
This
tr = defaultdict(list)
is a waste of time if you are just rebinding tr on the next line. Likewise for scnd_dict.
It's a better idea to make all the values of tr lists - even if they only have one item. It will mean less special cases to worry about later on.

Categories