Efficiently filtering a dictionary in-place - python

We have a dictionary d1 and a condition cond. We want d1 to contain only the values that satisfy the condition cond. One way to do it is:
d1 = {k:v for k,v in d1.items() if cond(v)}
But, this creates a new dictionary, which may be very memory-inefficient if d1 is large.
Another option is:
for k,v in d1.items():
if not cond(v):
d1.pop(k)
But, this modifies the dictionary while it is iterated upon, and generates an error: "RuntimeError: dictionary changed size during iteration".
What is the correct way in Python 3 to filter a dictionary in-place?

If there are not many keys the corresponding values of which satisfy the condition, then you might first aggregate the keys and then prune the dictionary:
for k in [k for k,v in d1.items() if cond(v)]:
del d1[k]
In case the list [k for k,v in d1.items() if cond(v)] would be too large, one might process the dictionary "in turns", i.e., to assemble the keys until their count does not exceed a threshold, prune the dictionary, and repeat until there are no more keys satisfying the condition:
from itertools import islice
def prune(d, cond, chunk_size = 1000):
change = True
while change:
change = False
keys = list(islice((k for k,v in d.items() if cond(v)), chunk_size))
for k in keys:
change = True
del d[k]

Related

Corroborating 2 large Python dictionaries

Say I have a 2 dictionaries, each with around 100000 entries (each can be of different length):
dict1 = {"a": ["w", "x"], "b":["y"], "c":["z"] ...}
dict2 = {"x": ["a", "b"], "y":["b", "d"], "z":["d"] ...}
I need to perform an operation using these two dictionaries:
Treat each dict item as a set of mapping (i.e list of all mappings in dict1 would be"a"->"w", "a"->"x", "b"->"y" and "c"->"z")
Only keep mappings in dict1 if the reverse mapping exists in dict2.
The resulting dictionary would be:
{"a": ["x"], "b", ["y"]}
My current solution uses 2 m*n all zeros dataframes where m and n are the lengths of dict1 and dict2 respectively and the index labels are the keys in dict1 and the column labels are the keys in dict2.
For the first dataframe, I insert a 1 at each value where the index label -> column label represent a mapping in dict1. For the second dataframe, I insert a 1 at each value where the column label -> index label represent a mapping in dict2.
I then perform an element-size product between the two dataframes, which only leaves values that have a mapping "a1"->"x1" in dict1 and "x1"->"a1" in dict2.
However, this takes up way too much memory and is very expensive. Is there an alternative algorithm I can use?
How about to use the same idea, but replace a sparse matrix you're using with a set of key pairs? Something like:
import collections
def fn(dict1, dict2):
mapping_set = set()
for k, vv in dict2.items():
for v in vv:
mapping_set.add((k, v))
result_dict = collections.defaultdict(list)
for k, vv in dict1.items():
for v in vv:
if (v, k) in mapping_set: # Note reverse order of k and v
result_dict[k].append(v)
return result_dict
Update: It will use O(total number of values in dict2) of memory and O(total number of values in dict1) + O(total number of values in dict2) time - both a linear. It's not possible to solve the problem algorithmically faster as every value in every dict has to be visited at least once.
Given that you have python objects to begin with, you may want to stay in the python domain. If you need to iterate through the entire dict to create your matrix anyway, you may find that filtering in-place doesn't take much longer.
default = ()
result = {k: v for k, v in dict1.items() if k in dict2.get(v, default)}
If your list are short, this will be totally fine. If they contain many elements, linear search will start to compare to the overhead of set lookup. In that case, you may want to preprocess dict2 to contain sets rather than lists:
dict2 = {k: set(v) for k, v in dict2.items}
or in-place
for k, v in dict2.items():
dict2[k] = set(v)

Comparing dictionaries on shared keys only

I have two dictionaries with some shared keys, and some different ones. (Each dictionary has some keys not present in the other). What's a nice way to compare the two dictionaries for equality, as if only the shared keys were present?
In other words I want a simplest way to calculate the following:
commonkeys = set(dict1).intersection(dict2)
simple1 = dict((k, v) for k,v in dict1.items() if k in commonkeys)
simple2 = dict((k, v) for k,v in dict2.items() if k in commonkeys)
return simple1 == simple2
I've managed to simplify it to this:
commonkeys = set(dict1).intersection(dict2)
return all(dict1[key] == dict2[key] for key in commonkeys)
But I'm hoping for an approach that doesn't require precalculation of the common keys. (In reality I have two lists of dictionaries that I'll be comparing pairwise. All dictionaries in each list have the same set of keys, so if a computation like commonkeys above is necessary, it would only need to be done once.)
What about the following?
return all(dict2[key] == val for key, val in dict1.iteritems() if key in dict2)
Or even shorter (although it possibly involves a few more comparisons):
return all(dict2.get(key, val) == val for key, val in dict1.iteritems())
Try this
dict((k, dict1[k]) for k in dict1.keys() + dict2.keys() if dict1.get(k) == dict2.get(k))
O(m + n) comparisons.
If you want a true/false result put a simple check on the above result. If not none return true

Adding 700 dictionaries by value : too many values to unpack

I have a list of 2 lists, each with 700 dictionaries.
Each dictionary has a word count, and I want to combine them, such that values of same keys will be added.
I tried doing :
combine_dicts = collections.defaultdict(int)
for k, v in itertools.chain(x.iteritems() for x in tuple(dicts[0])):
combine_dicts[k] += v
dicts[0] and dicts[1] are 2 lists of dictionaries.
But it throws the following error:
ValueError: too many values to unpack.
Is there any better way of doing this?
You misused chain; you wanted chain.from_iterable to chain the iterable outputs of your generator expression, not just wrap the generator function as a no-op:
for k, v in itertools.chain.from_iterable(x.iteritems() for x in dicts[0]):
That only gets the first list of dicts though; to get both, we need MOAR CHAINING!:
# Qualifying chain over and over is a pain
from itertools import chain
for k, v in chain.from_iterable(x.iteritems() for x in chain(*dicts)):
combine_dicts = defaultdict(int)
for i in range(0,2):
for d in dicts[i]:
for k,v in d.iteritems():
combine_dicts[k] += v
This iterates each dictionary once so memory usage should be efficient.

Returning unique elements from values in a dictionary

I have a dictionary like this :
d = {'v03':["elem_A","elem_B","elem_C"],'v02':["elem_A","elem_D","elem_C"],'v01':["elem_A","elem_E"]}
How would you return a new dictionary with the elements that are not contained in the key of the highest value ?
In this case :
d2 = {'v02':['elem_D'],'v01':["elem_E"]}
Thank you,
I prefer to do differences with the builtin data type designed for it: sets.
It is also preferable to write loops rather than elaborate comprehensions. One-liners are clever, but understandable code that you can return to and understand is even better.
d = {'v03':["elem_A","elem_B","elem_C"],'v02':["elem_A","elem_D","elem_C"],'v01':["elem_A","elem_E"]}
last = None
d2 = {}
for key in sorted(d.keys()):
if last:
if set(d[last]) - set(d[key]):
d2[last] = sorted(set(d[last]) - set(d[key]))
last = key
print d2
{'v01': ['elem_E'], 'v02': ['elem_D']}
from collections import defaultdict
myNewDict = defaultdict(list)
all_keys = d.keys()
all_keys.sort()
max_value = all_keys[-1]
for key in d:
if key != max_value:
for value in d[key]:
if value not in d[max_value]:
myNewDict[key].append(value)
You can get fancier with set operations by taking the set difference between the values in d[max_value] and each of the other keys but first I think you should get comfortable working with dictionaries and lists.
defaultdict(<type 'list'>, {'v01': ['elem_E'], 'v02': ['elem_D']})
one reason not to use sets is that the solution does not generalize enough because sets can only have hashable objects. If your values are lists of lists the members (sublists) are not hashable so you can't use a set operation
Depending on your python version, you may be able to get this done with only one line, using dict comprehension:
>>> d2 = {k:[v for v in values if not v in d.get(max(d.keys()))] for k, values in d.items()}
>>> d2
{'v01': ['elem_E'], 'v02': ['elem_D'], 'v03': []}
This puts together a copy of dict d with containing lists being stripped off all items stored at the max key. The resulting dict looks more or less like what you are going for.
If you don't want the empty list at key v03, wrap the result itself in another dict:
>>> {k:v for k,v in d2.items() if len(v) > 0}
{'v01': ['elem_E'], 'v02': ['elem_D']}
EDIT:
In case your original dict has a very large keyset [or said operation is required frequently], you might also want to substitute the expression d.get(max(d.keys())) by some previously assigned list variable for performance [but I ain't sure if it doesn't in fact get pre-computed anyway]. This speeds up the whole thing by almost 100%. The following runs 100,000 times in 1.5 secs on my machine, whereas the unsubstituted expression takes more than 3 seconds.
>>> bl = d.get(max(d.keys()))
>>> d2 = {k:v for k,v in {k:[v for v in values if not v in bl] for k, values in d.items()}.items() if len(v) > 0}

Inverting Dictionaries in Python

I want to know which would be an efficient method to invert dictionaries in python. I also want to get rid of duplicate values by comparing the keys and choosing the larger over the smaller assuming they can be compared. Here is inverting a dictionary:
inverted = dict([[v,k] for k,v in d.items()])
To remove duplicates by using the largest key, sort your dictionary iterator by value. The call to dict will use the last key inserted:
import operator
inverted = dict((v,k) for k,v in sorted(d.iteritems(), key=operator.itemgetter(1)))
Here is a simple and direct implementation of inverting a dictionary and keeping the larger of any duplicate values:
inverted = {}
for k, v in d.iteritems():
if v in inverted:
inverted[v] = max(inverted[v], k)
else:
inverted[v] = k
This can be tightened-up a bit with dict.get():
inverted = {}
for k, v in d.iteritems():
inverted[v] = max(inverted.get(v, k), k)
This code makes fewer comparisons and uses less memory than an approach using sorted().

Categories