How to find all differences between two dictionaries efficiently in python - python

So, I have 2 dictionaries, I have to check for missing keys and for matching keys, check if they have same or different values.
dict1 = {..}
dict2 = {..}
#key values in a list that are missing in each
missing_in_dict1_but_in_dict2 = []
missing_in_dict2_but_in_dict1 = []
#key values in a list that are mismatched between the 2 dictionaries
mismatch = []
What's the most efficient way to do this?

You can use dictionary view objects, which act as sets. Subtract sets to get the difference:
missing_in_dict1_but_in_dict2 = dict2.keys() - dict1
missing_in_dict2_but_in_dict1 = dict1.keys() - dict2
For the keys that are the same, use the intersection, with the & operator:
mismatch = {key for key in dict1.keys() & dict2 if dict1[key] != dict2[key]}
If you are still using Python 2, use dict.viewkeys().
Using dictionary views to produce intersections and differences is very efficient, the view objects themselves are very lightweight the algorithms to create the new sets from the set operations can make direct use of the O(1) lookup behaviour of the underlying dictionaries.
Demo:
>>> dict1 = {'foo': 42, 'bar': 81}
>>> dict2 = {'bar': 117, 'spam': 'ham'}
>>> dict2.keys() - dict1
{'spam'}
>>> dict1.keys() - dict2
{'foo'}
>>> [key for key in dict1.keys() & dict2 if dict1[key] != dict2[key]]
{'bar'}
and a performance comparison with creating separate set() objects:
>>> import timeit
>>> import random
>>> def difference_views(d1, d2):
... missing1 = d2.keys() - d1
... missing2 = d1.keys() - d2
... mismatch = {k for k in d1.keys() & d2 if d1[k] != d2[k]}
... return missing1, missing2, mismatch
...
>>> def difference_sets(d1, d2):
... missing1 = set(d2) - set(d1)
... missing2 = set(d1) - set(d2)
... mismatch = {k for k in set(d1) & set(d2) if d1[k] != d2[k]}
... return missing1, missing2, mismatch
...
>>> testd1 = {random.randrange(1000000): random.randrange(1000000) for _ in range(10000)}
>>> testd2 = {random.randrange(1000000): random.randrange(1000000) for _ in range(10000)}
>>> timeit.timeit('d(d1, d2)', 'from __main__ import testd1 as d1, testd2 as d2, difference_views as d', number=1000)
1.8643521590274759
>>> timeit.timeit('d(d1, d2)', 'from __main__ import testd1 as d1, testd2 as d2, difference_sets as d', number=1000)
2.811345119960606
Using set() objects is slower, especially when your input dictionaries get larger.

One easy way is to create sets from the dict keys and subtract them:
>>> dict1 = { 'a': 1, 'b': 1 }
>>> dict2 = { 'b': 1, 'c': 1 }
>>> missing_in_dict1_but_in_dict2 = set(dict2) - set(dict1)
>>> missing_in_dict1_but_in_dict2
set(['c'])
>>> missing_in_dict2_but_in_dict1 = set(dict1) - set(dict2)
>>> missing_in_dict2_but_in_dict1
set(['a'])
Or you can avoid casting the second dict to a set by using .difference():
>>> set(dict1).difference(dict2)
set(['a'])
>>> set(dict2).difference(dict1)
set(['c'])

Related

get original key set from defaultdict

Is there a way to get the original/consistent list of keys from defaultdict even when non existing keys were requested?
from collections import defaultdict
>>> d = defaultdict(lambda: 'default', {'key1': 'value1', 'key2' :'value2'})
>>>
>>> d.keys()
['key2', 'key1']
>>> d['bla']
'default'
>>> d.keys() # how to get the same: ['key2', 'key1']
['key2', 'key1', 'bla']
You have to exclude. the keys that has the default value!
>>> [i for i in d if d[i]!=d.default_factory()]
['key2', 'key1']
Time comparison with method suggested by Jean,
>>> def funct(a=None,b=None,c=None):
... s=time.time()
... eval(a)
... print time.time()-s
...
>>> funct("[i for i in d if d[i]!=d.default_factory()]")
9.29832458496e-05
>>> funct("[k for k,v in d.items() if v!=d.default_factory()]")
0.000100135803223
>>> ###storing the default value to a variable and using the same in the list comprehension reduces the time to a certain extent!
>>> defa=d.default_factory()
>>> funct("[i for i in d if d[i]!=defa]")
8.82148742676e-05
>>> funct("[k for k,v in d.items() if v!=defa]")
9.79900360107e-05
[key for key in d.keys() if key != 'default']
default_factory() is a callable and need not return the same value each time!
>>> from collections import defaultdict
>>> from random import random
>>> d = defaultdict(lambda: random())
>>> d[1]
0.7411252345322932
>>> d[2]
0.09672701444816645
>>> d.keys()
dict_keys([1, 2])
>>> d.default_factory()
0.06277993247659297
>>> d.default_factory()
0.4388136209046052
>>> d.keys()
dict_keys([1, 2])
>>> [k for k in d.keys() if d[k] != d.default_factory()]
[1, 2]

Compare two large dictionaries and create lists of values for keys they have in common

I have a two dictionaries like:
dict1 = { (1,2) : 2, (2,3): 3, (1,3): 3}
dict2 = { (1,2) : 1, (1,3): 2}
What I want as output is two list of values for the items which exist in both dictionaries:
[2,3]
[1,2]
What I am doing right now is something like this:
list1 = []
list2 = []
for key in dict1.keys():
if key in dict2.keys():
list1.append(dict1.get(key))
list2.append(dict2.get(key))
This code is taking too long running which is not something that I am looking forward to. I was wondering if there might be a more efficient way of doing it?
commons = set(dict1).intersection(set(dict2))
list1 = [dict1[k] for k in commons]
list2 = [dict2[k] for k in commons]
Don't use dict.keys. On python2.x, it creates a new list every time it is called (which is an O(N) operation -- And list.__contains__ is another O(N) operation on average). Just rely on the fact that dictionaries are iterable containers directly (with O(1) lookup):
list1 = []
list2 = []
for key in dict1:
if key in dict2:
list1.append(dict1.get(key))
list2.append(dict2.get(key))
Note that on python2.7, you can use viewkeys to get the intersection directly:
>>> a = {'foo': 'bar', 'baz': 'qux'}
>>> b = {'foo': 'bar'}
>>> a.viewkeys() & b
set(['foo'])
(on python3.x, you can use keys here instead of viewkeys)
for key in dict1.viewkeys() & dict2:
list1.append(dict1[key]))
list2.append(dict2[key]))
You can use a list comprehension within zip() function:
>>> vals1, vals2 = zip(*[(dict1[k], v) for k, v in dict2.items() if k in dict1])
>>>
>>> vals1
(2, 3)
>>> vals2
(1, 2)
Or as a more functional approach using view object and operator.itemgetter() you can do:
>>> from operator import itemgetter
>>> intersect = dict1.viewkeys() & dict2.viewkeys()
>>> itemgetter(*intersect)(dict1)
(2, 3)
>>> itemgetter(*intersect)(dict2)
(1, 2)
Benchmark with accepted answer:
from timeit import timeit
inp1 = """
commons = set(dict1).intersection(set(dict2))
list1 = [dict1[k] for k in commons]
list2 = [dict2[k] for k in commons]
"""
inp2 = """
zip(*[(dict1[k], v) for k, v in dict2.items() if k in dict1])
"""
inp3 = """
intersect = dict1.viewkeys() & dict2.viewkeys()
itemgetter(*intersect)(dict1)
itemgetter(*intersect)(dict2)
"""
dict1 = {(1, 2): 2, (2, 3): 3, (1, 3): 3}
dict2 = {(1, 2): 1, (1, 3): 2}
print 'inp1 ->', timeit(stmt=inp1,
number=1000000,
setup="dict1 = {}; dict2 = {}".format(dict1, dict2))
print 'inp2 ->', timeit(stmt=inp2,
number=1000000,
setup="dict1 = {}; dict2 = {}".format(dict1, dict2))
print 'inp3 ->', timeit(stmt=inp3,
number=1000000,
setup="dict1 = {}; dict2 = {};from operator import itemgetter".format(dict1, dict2))
Output:
inp1 -> 0.000132083892822
inp2 -> 0.000128984451294
inp3 -> 0.000160932540894
For dictionaries with length 10000 and random generated items, in 100 loop with:
inp1 -> 1.18336105347
inp2 -> 1.00519990921
inp3 -> 1.52266311646
Edit:
As #Davidmh mentioned in comment for refusing of raising an exception for second approach you can wrap the code in a try-except expression:
try:
intersect = dict1.viewkeys() & dict2.viewkeys()
vals1 = itemgetter(*intersect)(dict1)
vals2 = itemgetter(*intersect)(dict2)
except TypeError:
vals1 = vals2 = []
This should be done with keys in python3 and viewkeys in python2. These are view objects that behave like sets, and cost no extra effort to construct them... they are just "views" of the underlying dictionary keys. This way you save the construction of set objects.
common = dict1.viewkeys() & dict2.viewkeys()
list1 = [dict1[k] for k in common]
list2 = [dict2[k] for k in common]
dict_views objects can be intersected directly with dictionaries, thus the following code works as well. I would prefer the previous sample though.
common = dict1.viewkeys() & dict2

Python for loop indexing

I am taking example from Mark Lutz book,Learning Python.
keys = ['spam','eggs','toast']
vals=[1,4,7]
D2={}
for (v,k) in zip(keys, vals): D2[k] = v
D2
{1: 'spam', 4: 'eggs', 7: 'toast'}
My example:
D1={}
for (k,v) in zip(keys, vals): D1[k] = v
D1
{'toast': 7, 'eggs': 4, 'spam': 1}
So,I still do not understand indexing,why is for(v,k)?
It is unpacking the key and value from each tuple of the zipped list of the keys and values list, then assigning key/value pairs . The parens are unnecessary for v, k in zip(keys, vals) will also work. The difference between yours and the book's code is the the order of v,k, you use the keys list as keys and the book does it in reverse.
You could also create the dict in one step calling dict on the zipped items, if you reverse the order of the lists passed to zip then you will get the exact same behaviour:
D2 = dict(zip(keys, vals))
print D2
D2 = dict(zip(vals, keys))
print(D2)
{'toast': 7, 'eggs': 4, 'spam': 1}
{1: 'spam', 4: 'eggs', 7: 'toast'}
The only difference is the order. The fact the lists are named keys and values is probably a bit confusing because the values end up being keys and vice versa but the main thing to understand is you are assigning k in your loop to each element from the keys list and the book code is doing the opposite.
zip will return list of tuples:
Demo:
>>> keys = ['spam','eggs','toast']
>>> vals=[1,4,7]
>>> zip(keys, vals)
[('spam', 1), ('eggs', 4), ('toast', 7)]
Unpacking
Demo:
>>> t = (1,2,3)
>>> t
(1, 2, 3)
>>> a,b,c = t
>>> a
1
>>> b
2
>>> c
3
>>> a,b = t
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: too many values to unpack
>>>
We Iterate on list, so its unpacking first item from tuple to v and second to k.
Create new key and value pair in the D2 dictionary.
Code:
>>> D2={}
>>> for (v,k) in zip(keys, vals):
... print "v:", v
... print "k", k
... D2[k] = v
... # ^ ^
# Key Value
v: spam
k 1
v: eggs
k 4
v: toast
k 7
>>> D2
{1: 'spam', 4: 'eggs', 7: 'toast'}
>>>

Differentiating two dictionaries key sets to generate result dictionary

I have,
dict1={a:1, b:2, c:3}
dict2={a:3, c:7}
I want to find out what keys I have in dict1 that I don't have in dict2. So I do
diff_as_set = set(dict1.keys()) - set (dict2.keys())
This gives me: b
However, I want a dictionary which contains all the key value mappings from dict1 for all keys that are not in dict 2 so I then do:
diff_as_dict = {k:v for k,v in dict1 if k in diff_as_set}
I get:
diff_as_dict = {k:v for k, v in dict1 if k in diff_as_set}
ValueError: too many values to unpack (expected 2)
Any ideas?
Looping over a dict only provides keys, you need to use:
diff_as_dict = {k:v for k, v in dict1.iteritems() if k in diff_as_set}
^^^^^^^^^^^
Or use .items() for Python 3.x
Instead of going through the entire dict to pick out the ones that match your set, just iterate the set.
diff_as_dict = {k:dict1[k] for k in diff_as_set}
Example:
>>> dict1={'a':1, 'b':2, 'c':3}
>>> dict2={'a':3, 'c':7}
>>> diff_as_set = set(dict1.keys()) - set (dict2.keys())
>>> diff_as_set
set(['b'])
>>> diff_as_dict = {k:dict1[k] for k in diff_as_set}
>>> diff_as_dict
{'b': 2}
You're missing the .iteritems() part:
dict1 = {'a':1, 'b':2, 'c':3}
dict2 = {'a':3, 'c':7}
newdict = {k : v for k,v in dict1.iteritems() if not(k in dict2)}
After this, newdict is equal to {'b': 2}. This does everything in one go.

Update dict without adding new keys?

What is a good 1-liner (other than putting this into a function) to achieve this:
# like dict1.update(dict2) but does not add any keys from dict2
# that are not already in dict1
for k in dict1:
if k in dict2:
dict1[k]=dict2[k]
I guess that this would work, but uses a "protected" function:
[dict1.__setitem__(k, dict2[k]) for k in dict1 if k in dict2]
dict1.update((k, dict2[k]) for k in set(dict2).intersection(dict1))
is how I'd do it in Python 2.6 or below (see further on how to do this in later versions).
Next to another mapping, dict.update() can also take an iterable of (key, value) tuples, which we generate based on the set intersection of the two dictionaries (so all keys they have in common).
Demo:
>>> dict1 = {'foo':'bar', 'ham': 'eggs'}
>>> dict2 = {'ham': 'spam', 'bar': 'baz'}
>>> dict1.update((k, dict2[k]) for k in set(dict2).intersection(dict1))
>>> dict1
{'foo': 'bar', 'ham': 'spam'}
In python 2.7 you can use the new Dict views to achieve the same without casting to sets:
dict1.update((k, dict2[k]) for k in dict1.viewkeys() & dict2.viewkeys())
In Python 3, dict views are the default, so you can instead spell this as:
dict1.update((k, dict2[k]) for k in dict1.keys() & dict2.keys())
Non-destructively:
dict((k, dict2.get(k, v)) for k, v in dict1.items())
Modifying dict1:
dict1.update((k, v) for k, v in dict2.items() if k in dict1)
Using map/zip
As two lines for readability:
match_keys = dict1.keys() & dict2.keys()
dict2.update(**dict(zip(match_keys, map(dict2.get, match_keys))))
Or as a one-liner:
dict2.update(**dict(zip(dict1.keys() & dict2.keys(), map(dict2.get, dict1.keys() & dict2.keys()))))
non-destrctively:
new_dict = {**dict2, **dict(zip(dict1.keys() & dict2.keys(), map(dict2.get, dict1.keys() & dict2.keys())))}

Categories