How to efficiently perform a dictionary merge? - python

For a problem I am solving, I have a list of dictionaries. The problem involves multiple queries of the form merge(a, b, c). Merging means, in the result, the count for the common keys is added/subtracted and uncommon keys (and their values) are appended as is.
I am currently using Python's collection.Counter to represent the dictionaries and perform the merging as follows:
def merge(a, b, c):
counter_a, counter_b, counter_c = DICTLIST[a],DICTLIST[b],DICTLIST[c]
total = counter_a + counter_b - counter_c # Type collections.Counter
return total
Although this is a convenient solution, in the problem, there can be up to 10**5 such queries. On such a scale, using this approach is too slow. Is there a better approach to solving this?
NOTE: Pre-computation of the merge queries is not practical as the number of possible inputs is very large.
Example:
DICTLIST[a] = Counter({1:5,2:10})
DICTLIST[b] = Counter({2:10,3:20})
DICTLIST[c] = Counter({1:2})
merge(a,b,c) # Expected Output: {1:3, 2:20, 3:20}

Try this -
def mergeDict(dict1, dict2):
dict3 = {**dict1, **dict2}
for key, value in dict3.items():
if key in dict1 and key in dict2:
dict3[key] = value + dict1[key]
return dict3
Then you can call like this -
# Create first dictionary
dict1 = {1:5,2:10}
# Create second dictionary
dict2 = {2:10,3:20}
# Create third dictionary
dict3 = {1:-2}
dict4 = mergeDict(dict3, mergeDict(dict1, dict2))
Please note I have "-2" in the third dict for the subtraction logic.

you can use **kwargs here
x={1:5,2:10}
y={2:10,3:20}
z={**x, **y}
if you further want to optimize the performance as there are multiple queries you should use "caching + dictionary" because lookup table is always faster than any operation

My first instinct is to look for something like the Javascript "spread" operator for python:
https://mlpipes.com/object-spread-operator-python/
Example here:
old_dict = {'hello': 'world', 'foo': 'bar'}
new_dict = {**old_dict, 'foo': 'baz'}
For your code, you should try something like:
DICTLIST[d] = {**a,**b,**c}

Related

Splitting a dictionary by key suffixes

I have a dictionary like so
d = {"key_a":1, "anotherkey_a":2, "key_b":3, "anotherkey_b":4}
So the values and key names are not important here. The key (no pun intended) thing, is that related keys share the same suffix in my example above that is _a and _b.
These suffixes are not known before hand (they are not always _a and _b for example, and there are an unknown number of different suffixes.
What I would like to do, is to extract out related keys into their own dictionaries, and have all generated dictionaries in a list.
The output from above would be
output = [{"key_a":1, "anotherkey_a":2},{"key_b":3, "anotherkey_b":4}]
My current approach is to first get all the suffixes, and then generate the sub-dicts one at a time and append to the new list
output = list()
# Generate a set of suffixes
suffixes = set([k.split("_")[-1] for k in d.keys()])
# Create the subdict and append to output
for suffix in suffixes:
output.append({k:v for k,v in d.items() if k.endswith(suffix)})
This works (and is not prohibitively slow or anyhting) but I am simply wondering if there is a more elegant way to do it with a list or dict comprehension? Just out of interest...
Make your output a defaultdict rather than a list, with suffixes as keys:
from collections import defaultdict
output = defaultdict(lambda: {})
for k, v in d.items():
prefix, suffix = k.rsplit('_', 1)
output[suffix][k] = v
This will split your dict in a single pass and result in something like:
output = {"a" : {"key_a":1, "anotherkey_a":2}, "b": {"key_b":3, "anotherkey_b":4}}
and if you insist on converting it to a list, you can simply use:
output = list(output.values())
You could condense the lines
output = list()
for suffix in suffixes:
output.append({k:v for k,v in d.items() if k.endswith(suffix)})
to a list comprehension, like this
[{k:v for k,v in d.items() if k.endswith(suffix)} for suffix in suffixes]
Whether it is more elegant is probably in the eyes of the beholder.
The approach suggested by #Błotosmętek will probably be faster though, given a large dictionary, since it results in less looping.
def sub_dictionary_by_suffix(dictionary, suffix):
sub_dictionary = {k: v for k, v in dictionary.items() if k.endswith(suffix)}
return sub_dictionary
I hope it helps

Finding a key-value pair present only in the first dictionary

Two dictionaries:
dict1 = {'firstvalue':1, 'secondvalue':2, 'fourthvalue':4}
dict2 = {'firstvalue':1, 'thirdvalue':3, 'fourthvalue':5}
I get set(['secondvalue']) as a result upon doing:
dict1.viewkeys() - dict2
I need {'secondvalue':2} as a result.
When I use set, and then do the - operation, it does not give the desired result as it consists of {'fourthvalue:4} as well.
How could I do it?
The problem with - is that (in this context) it is an operation of dict_keys and thus the results will have no values. Using - with viewitems() does not work, either, as those are tuples, i.e. will compare both keys and values.
Instead, you can use a conditional dictionary comprehension, keeping only those keys that do not appear in the second dictionary. Other than Counter, this also works in the more general case, where the values are not integers, and with integer values, it just checks whether a key is present irrespective of the value that is accociated with it.
>>> dict1 = {'firstvalue':1, 'secondvalue':2, 'fourthvalue':4}
>>> dict2 = {'firstvalue':1, 'thirdvalue':3, 'fourthvalue':5}
>>> {k: v for k, v in dict1.items() if k not in dict2}
{'secondvalue': 2}
IIUC and providing a solution to Finding a key-value pair present only in the first dictionary as specified, you could take a set from the key/value pairs as tuples, subtract both sets and construct a dictionary from the result:
dict(set(dict1.items()) - set(dict2.items()))
# {'fourthvalue': 4, 'secondvalue': 2}
Another simple variation with set difference:
res = {k: dict1[k] for k in dict1.keys() - dict2.keys()}
Python 2.x:
dict1 = {'firstvalue':1, 'secondvalue':2, 'fourthvalue':4}
dict2 = {'firstvalue':1, 'thirdvalue':3, 'fourthvalue':5}
keys = dict1.viewkeys() - dict2.viewkeys()
print ({key:dict1[key] for key in keys})
output:
{'secondvalue': 2}

How to reference a combination of dict() and list() in Python3?

This is an example of a complexe data structure. The depth of the structure is not fixed. To reference a specific datum in the structure I need a unknown number of indices (for list()) and keys (for dict()).
>>> x = [{'child': [{'text': 'ass'}, {'group': 'wef'}]}]
>>> x[0]['child'][0]['text']
'ass'
Now I want to have single keys for the values like this.
keys = {'ID01': [0]['child'][0]['text'],
'ID02': [1]['group']}
But this is not possible. Is there another pythonic way?
I think you need a couple of things here. First is a custom lookup function:
def lookup(obj, keys):
for k in keys:
obj = obj[k]
return obj
Then a dictionary of keys to key list tuples:
keys = {'ID01': (0,'child',0,'text'),
'ID02': (1,'group')}
then you can do this:
lookup(x, keys['ID01']) # returns 'ass'

Returning unique elements from values in a dictionary

I have a dictionary like this :
d = {'v03':["elem_A","elem_B","elem_C"],'v02':["elem_A","elem_D","elem_C"],'v01':["elem_A","elem_E"]}
How would you return a new dictionary with the elements that are not contained in the key of the highest value ?
In this case :
d2 = {'v02':['elem_D'],'v01':["elem_E"]}
Thank you,
I prefer to do differences with the builtin data type designed for it: sets.
It is also preferable to write loops rather than elaborate comprehensions. One-liners are clever, but understandable code that you can return to and understand is even better.
d = {'v03':["elem_A","elem_B","elem_C"],'v02':["elem_A","elem_D","elem_C"],'v01':["elem_A","elem_E"]}
last = None
d2 = {}
for key in sorted(d.keys()):
if last:
if set(d[last]) - set(d[key]):
d2[last] = sorted(set(d[last]) - set(d[key]))
last = key
print d2
{'v01': ['elem_E'], 'v02': ['elem_D']}
from collections import defaultdict
myNewDict = defaultdict(list)
all_keys = d.keys()
all_keys.sort()
max_value = all_keys[-1]
for key in d:
if key != max_value:
for value in d[key]:
if value not in d[max_value]:
myNewDict[key].append(value)
You can get fancier with set operations by taking the set difference between the values in d[max_value] and each of the other keys but first I think you should get comfortable working with dictionaries and lists.
defaultdict(<type 'list'>, {'v01': ['elem_E'], 'v02': ['elem_D']})
one reason not to use sets is that the solution does not generalize enough because sets can only have hashable objects. If your values are lists of lists the members (sublists) are not hashable so you can't use a set operation
Depending on your python version, you may be able to get this done with only one line, using dict comprehension:
>>> d2 = {k:[v for v in values if not v in d.get(max(d.keys()))] for k, values in d.items()}
>>> d2
{'v01': ['elem_E'], 'v02': ['elem_D'], 'v03': []}
This puts together a copy of dict d with containing lists being stripped off all items stored at the max key. The resulting dict looks more or less like what you are going for.
If you don't want the empty list at key v03, wrap the result itself in another dict:
>>> {k:v for k,v in d2.items() if len(v) > 0}
{'v01': ['elem_E'], 'v02': ['elem_D']}
EDIT:
In case your original dict has a very large keyset [or said operation is required frequently], you might also want to substitute the expression d.get(max(d.keys())) by some previously assigned list variable for performance [but I ain't sure if it doesn't in fact get pre-computed anyway]. This speeds up the whole thing by almost 100%. The following runs 100,000 times in 1.5 secs on my machine, whereas the unsubstituted expression takes more than 3 seconds.
>>> bl = d.get(max(d.keys()))
>>> d2 = {k:v for k,v in {k:[v for v in values if not v in bl] for k, values in d.items()}.items() if len(v) > 0}

Dict comprehension, tuples and lazy evaluation

I am trying to see if I can pull off something quite lazy in Python.
I have a dict comprehension, where the value is a tuple. I want to be able to create the second entry of the tuple by using the first entry of the tuple.
An example should help.
dictA = {'a': 1, 'b': 3, 'c': 42}
{key: (a = someComplexFunction(value), moreComplexFunction(a)) for key, value in dictA.items()}
Is it possible that the moreComplexFunction uses the calculation in the first entry of the tuple?
You could add a second loop over a one-element tuple:
{key: (a, moreComplexFuntion(a)) for key, value in dictA.items()
for a in (someComplexFunction(value),)}
This gives you access to the output of someComplexFunction(value) in the value expression, but that's rather ugly.
Personally, I'd move to a regular loop in such cases:
dictB = {}
for key, value in dictA.items():
a = someComplexFunction(value)
dictB[key] = (a, moreComplexFunction(a))
and be done with it.
or, you could just write a function to return the tuple:
def kv_tuple(a):
tmp = someComplexFunction(a)
return (a, moreComplexFunction(tmp))
{key:kv_tuple(value) for key, value in dictA.items()}
this also gives you the option to use things like namedtuple to get names for the tuple items, etc. I don't know how much faster/slower this would be though... the regular loop is likely to be faster (fewer function calls)...
Alongside Martijn's answer, using a generator expression and a dict comprehension is also quite semantic and lazy:
dictA = { ... } # Your original dict
partially_computed = ((key, someComplexFunction(value))
for key, value in dictA.items())
dictB = {key: (a, moreComplexFunction(a)) for key, a in partially_computed}

Categories