Python: Tool to compare pairs of dicts of varying deepness? - python

I have a couple of pairs of rather big dicts. The structure of the pair dicts is exactly the same but the values will differ. All pairs differ in how nested they are.
To clarify:
dict_a has same structure as dict_b
dict_c has same structure as dict_d (but is different from dict_a and dict_b)
etc.
Is there a tool out there that makes it easy to implement a function to compare the values only, and/or do some basic arithmetic on them? My dicts can be quite nested, so a simple [for k,v in dict_x.iteritems()...] won't do.

Sounds like a problem for...recursive functions!
Basically, if I understand your question, you have a deep dictionary with varying levels of depths at unspecified keys. You'd like to compare the values of dict_a to dict_b but don't care much about the keys: just the differences in values. Here's an idea using a recursive function to print out each set of values that doesn't match.
def dict_compare(da, db):
for k, v in da.iteritems():
if isinstance(v, dict): #if the value is another dict:
dict_compare(v, db[k]) #enter into the comparison function again!
else:
if v != db[k]:
print 'values not equal at', k
Then you just can call
dict_compare(dict_a, dict_b)
The magic being that if the value of a given key is in fact another dictionary, just call your comparison function again.
Obviously, if you wanted to do something more complicated than just print the simple key of the values that don't match, just modify what happens after the if v != db[k] line.

Related

Python: Create sorted list of keys moving one key to the head

Is there a more pythonic way of obtaining a sorted list of dictionary keys with one key moved to the head? So far I have this:
# create a unique list of keys headed by 'event' and followed by a sorted list.
# dfs is a dict of dataframes.
for k in (dict.fromkeys(['event']+sorted(dfs))):
display(k,dfs[k]) # ideally this should be (k,v)
I suppose you would be able to do
for k, v in list(dfs.items()) + [('event', None)]:
.items() casts a dictionary to a list of tuples (or technically a dict_items, which is why I have to cast it to list explicitly to append), to which you can append a second list. Iterating through a list of tuples allows for automatic unpacking (so you can do k,v in list instead of tup in list)
What we really want is an iterable, but that's not possible with sorted, because it must see all the keys before it knows what the first item should be.
Using dict.fromkeys to create a blank dictionary by insertion order was pretty clever, but relies on an implementation detail of the current version of python. (dict is fundamentally unordered) I admit, it took me a while to figure out that line.
Since the code you posted is just working with the keys, I suggest you focus on that. Taking up a few more lines for readability is a good thing, especially if we can hide it in a testable function:
def display_by_keys(dfs, priority_items=None):
if not priority_items:
priority_items = ['event']
featured = {k for k in priority_items if k in dfs}
others = {k for k in dfs.keys() if k not in featured}
for key in list(featured) + sorted(others):
display(key, dfs[key])
The potential downside is you must sort the keys every time. If you do this much more often than the data store changes, on a large data set, that's a potential concern.
Of course you wouldn't be displaying a really large result, but if it becomes a problem, then you'll want to store them in a collections.OrderedDict (https://stackoverflow.com/a/13062357/1766544) or find a sorteddict module.
from collections import OrderedDict
# sort once
ordered_dfs = OrderedDict.fromkeys(sorted(dfs.keys()))
ordered_dfs.move_to_end('event', last=False)
ordered_dfs.update(dfs)
# display as often as you need
for k, v in ordered_dfs.items():
print (k, v)
If you display different fields first in different views, that's not a problem. Just sort all the fields normally, and use a function like the one above, without the sort.

Python Dictionaries: making the key:value the key and appending new values

EDIT: My question has been getting a lot of follow up questions because on the surface, it doesn't appear to make any sense. For most people, dictionaries are an illogical way to solve this problem. I agree, and have been frustrated by my constraints (explained in the comments). In my scenario, the original KV pairs are going to be encoded as data to be read by another server using the ObjectID. This, however, must be fed into an encoding function as a dictionary. The order does not matter, but the KV pairs must be given a new unique value. The original KV pairs will end up as a new string key in this new dictionary with the ObjectID as a new unique value.
Keep in mind that I am using Python 2.7.
The Issue
Note that this is a matter of presenting a dictionary (dictA), encoded by the ObjectID values, within the constraints of what I have been given
I have a dictionary, say dictA = {'a':'10', 'b':'20', 'c':'30'}, and I have a list of ObjectIdentifier('n'), where n is a number. What is the best way to create dictB so that dictB is a new dictionary with the key equal to dictA's key:value pair and the value equal to the corresponding ObjectIdentifier('n') in the list.
The new dictB should be:
{"'a':'10'":ObjectIdentifier('n'), "'b':'20'":ObjectIdentifier('n+1'), "'c':'30'":ObjectIdentifier('n+2')}
If that makes any sense.
The problem is that dictionaries aren't ordered. So you say
dictA = {'a':'10', 'b':'20', 'c':'30'}
but as far as python knows it could be
dictA = {'c':'30', 'a':'10', 'b':'20'}
Because dictionaries don't have order.
You could create your dict like this:
result = {key: ObjectIdentifier(n+pos) for pos, key in enumerate(dictA.items())}
But there is no way to determine which key will fall in which position, because, as I said, dictionaries don't have order.
If you want alphabetic order, just use sorted()
result = {key: ObjectIdentifier(n+pos)
for pos, key in enumerate(sorted(dictA.items()))}
I don't know why you would want this
def ObjectIdentifier(n):
print(n)
return "ObjectIdentifier("+ str(n) + ")"
dictA = {'a':'10', 'b':'20', 'c':'30'}
dictB = {}
for n, key in enumerate(sorted(dictA.keys())):
dictB[key] = {dictA[key] : ObjectIdentifier(str(n))}
Output:
{'a': {'10': 'ObjectIdentifier(0)'}, 'b': {'20': 'ObjectIdentifier(1)'}, 'c': {'30': 'ObjectIdentifier(2)'}}

In-place Python dictionary inverse

The task that I wanted to see if possible to solve is, swapping key,value pairs of a dictionary (in Python), with an in-place calculation, without additional data-structures (Only a constant number of extra variables). It seems rather impossible (in a finite world), but I'm open to hear suggestions on solving it.
I've seen a few posts about in-place dictionary inverse in python, and I've found one common thing between all of the solutions.
The following dictionary won't be properly inversed:
dict = {'b':'a','a':'c','c':123}
The reason for that is, when swapping the first argument, we overwrite 'a''s actual value (The values are unique, the keys are unique, but that doesn't mean there isn't a value that is the same as an already existing key)
NOTES:
1) The dictionary given as an example has hashable values.
2) The key/values can be of any data-type. Not necessarily strings.
I'd love to hear ways to solve it, I've thought of one but it only works if we have infinite memory, which obviously is not true.
EDIT:
1) My Idea was, changing the dictionary such that I add a constant number of underscores ("_") to the beginning of each key entry. The number of underscores is determined based on the keys, if some key has X underscores, I'll add X+1 underscores (max_underscores_of_key_in_prefix+1).
To work around objects in the keys, I'll make a wrapper class for that.
I have tried my best explaining my intuition, but I am not sure this is practical.
2) #Mark Ransom's solution works perfectly, but if anyone has an other algorithmic solution to the problem, I'd still love to hear it out!
I mark this question as solved because it is solved, but again, other solutions are more than welcome :-)
Obviously for this to be possible, both keys and values must be hashable. This means that none of your keys or values can be a list. We can take advantage of this to know which dictionary elements have been already processed.
Since you can't iterate and modify a dictionary at the same time, we must start over every time we swap a key/value. That makes this very slow, an O(n^2) operation.
def invert_dict(d):
done = False
while not done:
done = True
for key, val in d.items():
if isinstance(val, list):
if len(val) > 1:
d[key] = [val[1]]
val = val[0]
else:
del d[key]
if not isinstance(val, list):
if val in d:
d[val] = [d[val], key]
else:
d[val] = [key]
done = False
break
for key, val in d.items():
d[key] = val[0]

Set value for existing key in nested dictionary without iterating through upper levels?

I have a very large nested dictionary of the form and example:
keyDict = {f: {t: {c_1: None, c_2: None, c_3: None, ..., c_n: None}}}
And another dictionary with keys and values:
valDict = {c_1: 13.37, c_2: -42.00, c_3: 0.00, ... c_n: -0.69}
I want to use the valDict to assign the values to the lowest level of the keyDict as fast as possible.
My current implementation is very slow I think because I iterate through the 2 upper levels [f][t] of the keyDict. There must be a way to set the values of the low level without concern for the upper levels because the value of [c] does not depend on the values of [f][t].
My current SLOW implementation:
for f in keyDict:
for t in keyDict[f]:
for c in keyDict[f][t]:
keyDict[f][t][c] = valDict[c]
Still looking for a solution. [c] only has a few thousands keys, but [f][t] can have millions, so the way I do it, distinct value assignment is happening millions of times when it should be able to go through the bottom level and assign the value which does NOT depend on f,t but ONLY on c.
To clarify example per Alexis request: c dictionaries don't necessarily all have the same keys, but c dictionaries DO have the same values for a given key. For example, to make things simple, lets say there are only 3 possible keys for c dict (c_1, c_2, c_3). Now one parent dictionary (ex f=1,t=1) may have just {c_2} and another parent diction (f=1,t=2) may have {c_2 and c_3} and yet another (ex f=999,t=999) might have all three {c_1, c_2, and c_3}. Some parent dicts may have the same set of c's. What I am trying to do is assign the value to the c dict, which is defined purely by the c key, not T or F.
If the most nested dicts and valDict share exactly the same keys, it would be faster to use dict.update instead of looping over all the keys of the dict:
for dct in keyDict.values()
for d in dct.values():
d.update(valDict)
Also, it is more elegant and probably more faster to loop on the values of the outer dicts directly instead of iterating on the keys and then accessing the value using the current key.
So you have millions of "c" dictionaries that you need to keep synchronized. The dictionaries have different sets of keys (presumably for good reason, but I trust you realize that your update code puts the new values in all the dictionaries), but the non-None values must change in lockstep.
You haven't explained what this data structure is for, but judging from your description, you should have a single c dictionary, not millions of them.
After all, you only have one set of valid "c" values; maintaining multiple copies is not only a performance problem, it puts an incredible burden of consistency on your code. But obviously, updating a single dictionary will be hugely faster than updating millions of them.
Of course you also want to know which keys were contained in each dictionary: To do this, your tree of dictionaries should terminate with sets of keys, which you can use to look up values as necessary.
In case my description is not clear, here is how your structure would be transformed:
all_c = dict()
for for f in keyDict:
for t in keyDict[f]:
all_c.update(k,v for k, v in keydict[f][t].items() if v is not None)
keydict[f][t] = set(keydict[f][t].keys())
This code builds a combined dictionary all_c with the non-null values from each of your bottom-level "c" dictionaries, then replaces the latter with a list of its keys. If you later need the complete dictionary at keyDict[f][t] (rather than access to particular values), you can reconstruct it like this:
f_t_cdict = dict((k, all_c[k]) for k in keyDict[f][t])
But I'm pretty sure you can do whatever it is you are doing by working with the sets keyDict[f][t], and simply looking up values directly in the combined dictionary all_c.

Python dict key delete if pattern match with other dict key

Python dict key delete, if key pattern match with other dict key.
e.g.
a={'a.b.c.test':1, 'b.x.d.pqr':2, 'c.e.f.dummy':3, 'd.x.y.temp':4}
b={'a.b.c':1, 'b.p.q':20}
result
a={'b.x.d.pqr':2,'c.e.f.dummy':3,'d.x.y.temp':4}`
If "pattern match with other dict key" means "starts with any key in the other dict", the most direct way to write that would be like this:
a = {k:v for (k, v) in a.items() if any(k.startswith(k2) for k2 in b)}
If that's hard to follow at first glance, it's basically the equivalent of this:
def matches(key1, d2):
for key2 in d2:
if key1.startswith(key2):
return True
return False
c = {}
for key in a:
if not matches(key, b):
c[key] = a[key]
a = c
This is going to be slower than necessary. If a has N keys, and b has M keys, the time taken is O(NM). While you can checked "does key k exist in dict b" in constant time, there's no way to check "does any key starting with k exist in dict b" without iterating over the whole dict. So, if b is potentially large, you probably want to search sorted(b.keys()) and write a binary search, which will get the time down to O(N log M). But if this isn't a bottleneck, you may be better off sticking with the simple version, just because it's simple.
Note that I'm generating a new a with the matches filtered out, rather than deleting the matches. This is almost always a better solution than deleting in-place, for multiple reasons:
* It's much easier to reason about. Treating objects as immutable and doing pure operations on them means you don't need to think about how states change over time. For example, the naive way to delete in place would run into the problem that you're changing the dictionary while iterating over it, which will raise an exception. Issues like that never come up without mutable operations.
* It's easier to read, and (once you get the hang of it) even to write.
* It's almost always faster. (One reason is that it takes a lot more memory allocations and deallocations to repeatedly modify a dictionary than to build one with a comprehension.)
The one tradeoff is memory usage. The delete-in-place implementation has to make a copy of all of the keys; the built-a-new-dict implementation has to have both the filtered dict and the original dict in memory. If you're keeping 99% of the values, and the values are much larger than the keys, this could hurt you. (On the other hand, if you're keeping 10% of the values, and the values are about the same size as the keys, you'll actually save space.) That's why it's "almost always" a better solution, rather than "always".
for key in list(a.keys()):
if any(key.startswith(k) for k in b):
del a[key]
Replace key.startswith(k) with an appropriate condition for "matching".
c={} #result in dict c
for key in b.keys():
if all([z.count(key)==0 for z in a.keys()]): #string of the key in b should not be substring for any of the keys in a
c[key]=b[key]

Categories