Python dict key delete, if key pattern match with other dict key.
e.g.
a={'a.b.c.test':1, 'b.x.d.pqr':2, 'c.e.f.dummy':3, 'd.x.y.temp':4}
b={'a.b.c':1, 'b.p.q':20}
result
a={'b.x.d.pqr':2,'c.e.f.dummy':3,'d.x.y.temp':4}`
If "pattern match with other dict key" means "starts with any key in the other dict", the most direct way to write that would be like this:
a = {k:v for (k, v) in a.items() if any(k.startswith(k2) for k2 in b)}
If that's hard to follow at first glance, it's basically the equivalent of this:
def matches(key1, d2):
for key2 in d2:
if key1.startswith(key2):
return True
return False
c = {}
for key in a:
if not matches(key, b):
c[key] = a[key]
a = c
This is going to be slower than necessary. If a has N keys, and b has M keys, the time taken is O(NM). While you can checked "does key k exist in dict b" in constant time, there's no way to check "does any key starting with k exist in dict b" without iterating over the whole dict. So, if b is potentially large, you probably want to search sorted(b.keys()) and write a binary search, which will get the time down to O(N log M). But if this isn't a bottleneck, you may be better off sticking with the simple version, just because it's simple.
Note that I'm generating a new a with the matches filtered out, rather than deleting the matches. This is almost always a better solution than deleting in-place, for multiple reasons:
* It's much easier to reason about. Treating objects as immutable and doing pure operations on them means you don't need to think about how states change over time. For example, the naive way to delete in place would run into the problem that you're changing the dictionary while iterating over it, which will raise an exception. Issues like that never come up without mutable operations.
* It's easier to read, and (once you get the hang of it) even to write.
* It's almost always faster. (One reason is that it takes a lot more memory allocations and deallocations to repeatedly modify a dictionary than to build one with a comprehension.)
The one tradeoff is memory usage. The delete-in-place implementation has to make a copy of all of the keys; the built-a-new-dict implementation has to have both the filtered dict and the original dict in memory. If you're keeping 99% of the values, and the values are much larger than the keys, this could hurt you. (On the other hand, if you're keeping 10% of the values, and the values are about the same size as the keys, you'll actually save space.) That's why it's "almost always" a better solution, rather than "always".
for key in list(a.keys()):
if any(key.startswith(k) for k in b):
del a[key]
Replace key.startswith(k) with an appropriate condition for "matching".
c={} #result in dict c
for key in b.keys():
if all([z.count(key)==0 for z in a.keys()]): #string of the key in b should not be substring for any of the keys in a
c[key]=b[key]
Related
I am looking for an efficient python method to utilise a hash table that has two keys:
E.g.:
(1,5) --> {a}
(2,3) --> {b,c}
(2,4) --> {d}
Further I need to be able to retrieve whole blocks of entries, for example all entries that have "2" at the 0-th position (here: (2,3) as well as (2,4)).
In another post it was suggested to use list comprehension, i.e.:
sum(val for key, val in dict.items() if key[0] == 'B')
I learned that dictionaries are (probably?) the most efficient way to retrieve a value from an object of key:value-pairs. However, calling only an incomplete tuple-key is a bit different than querying the whole key where I either get a value or nothing. I want to ask if python can still return the values in a time proportional to the number of key:value-pairs that match? Or alternatively, is the tuple-dictionary (plus list comprehension) better than using pandas.df.groupby() (but that would occupy a bit much memory space)?
The "standard" way would be something like
d = {(randint(1,10),i):"something" for i,x in enumerate(range(200))}
def byfilter(n,d):
return list(filter(lambda x:x==n, d.keys()))
byfilter(5,d) ##returns a list of tuples where x[0] == 5
Although in similar situations I often used next() to iterate manually, when I didn't need the full list.
However there may be some use cases where we can optimize that. Suppose you need to do a couple or more accesses by key first element, and you know the dict keys are not changing meanwhile. Then you can extract the keys in a list and sort it, and make use of some itertools functions, namely dropwhile() and takewhile():
ls = [x for x in d.keys()]
ls.sort() ##I do not know why but this seems faster than ls=sorted(d.keys())
def bysorted(n,ls):
return list(takewhile(lambda x: x[0]==n, dropwhile(lambda x: x[0]!=n, ls)))
bysorted(5,ls) ##returns the same list as above
This can be up to 10x faster in the best case (i=1 in my example) and more or less take the same time in the worst case (i=10) because we are trimming the number of iterations needed.
Of course you can do the same for accessing keys by x[1], you just need to add a key parameter to the sort() call
Is there a more pythonic way of obtaining a sorted list of dictionary keys with one key moved to the head? So far I have this:
# create a unique list of keys headed by 'event' and followed by a sorted list.
# dfs is a dict of dataframes.
for k in (dict.fromkeys(['event']+sorted(dfs))):
display(k,dfs[k]) # ideally this should be (k,v)
I suppose you would be able to do
for k, v in list(dfs.items()) + [('event', None)]:
.items() casts a dictionary to a list of tuples (or technically a dict_items, which is why I have to cast it to list explicitly to append), to which you can append a second list. Iterating through a list of tuples allows for automatic unpacking (so you can do k,v in list instead of tup in list)
What we really want is an iterable, but that's not possible with sorted, because it must see all the keys before it knows what the first item should be.
Using dict.fromkeys to create a blank dictionary by insertion order was pretty clever, but relies on an implementation detail of the current version of python. (dict is fundamentally unordered) I admit, it took me a while to figure out that line.
Since the code you posted is just working with the keys, I suggest you focus on that. Taking up a few more lines for readability is a good thing, especially if we can hide it in a testable function:
def display_by_keys(dfs, priority_items=None):
if not priority_items:
priority_items = ['event']
featured = {k for k in priority_items if k in dfs}
others = {k for k in dfs.keys() if k not in featured}
for key in list(featured) + sorted(others):
display(key, dfs[key])
The potential downside is you must sort the keys every time. If you do this much more often than the data store changes, on a large data set, that's a potential concern.
Of course you wouldn't be displaying a really large result, but if it becomes a problem, then you'll want to store them in a collections.OrderedDict (https://stackoverflow.com/a/13062357/1766544) or find a sorteddict module.
from collections import OrderedDict
# sort once
ordered_dfs = OrderedDict.fromkeys(sorted(dfs.keys()))
ordered_dfs.move_to_end('event', last=False)
ordered_dfs.update(dfs)
# display as often as you need
for k, v in ordered_dfs.items():
print (k, v)
If you display different fields first in different views, that's not a problem. Just sort all the fields normally, and use a function like the one above, without the sort.
The task that I wanted to see if possible to solve is, swapping key,value pairs of a dictionary (in Python), with an in-place calculation, without additional data-structures (Only a constant number of extra variables). It seems rather impossible (in a finite world), but I'm open to hear suggestions on solving it.
I've seen a few posts about in-place dictionary inverse in python, and I've found one common thing between all of the solutions.
The following dictionary won't be properly inversed:
dict = {'b':'a','a':'c','c':123}
The reason for that is, when swapping the first argument, we overwrite 'a''s actual value (The values are unique, the keys are unique, but that doesn't mean there isn't a value that is the same as an already existing key)
NOTES:
1) The dictionary given as an example has hashable values.
2) The key/values can be of any data-type. Not necessarily strings.
I'd love to hear ways to solve it, I've thought of one but it only works if we have infinite memory, which obviously is not true.
EDIT:
1) My Idea was, changing the dictionary such that I add a constant number of underscores ("_") to the beginning of each key entry. The number of underscores is determined based on the keys, if some key has X underscores, I'll add X+1 underscores (max_underscores_of_key_in_prefix+1).
To work around objects in the keys, I'll make a wrapper class for that.
I have tried my best explaining my intuition, but I am not sure this is practical.
2) #Mark Ransom's solution works perfectly, but if anyone has an other algorithmic solution to the problem, I'd still love to hear it out!
I mark this question as solved because it is solved, but again, other solutions are more than welcome :-)
Obviously for this to be possible, both keys and values must be hashable. This means that none of your keys or values can be a list. We can take advantage of this to know which dictionary elements have been already processed.
Since you can't iterate and modify a dictionary at the same time, we must start over every time we swap a key/value. That makes this very slow, an O(n^2) operation.
def invert_dict(d):
done = False
while not done:
done = True
for key, val in d.items():
if isinstance(val, list):
if len(val) > 1:
d[key] = [val[1]]
val = val[0]
else:
del d[key]
if not isinstance(val, list):
if val in d:
d[val] = [d[val], key]
else:
d[val] = [key]
done = False
break
for key, val in d.items():
d[key] = val[0]
I have a very large nested dictionary of the form and example:
keyDict = {f: {t: {c_1: None, c_2: None, c_3: None, ..., c_n: None}}}
And another dictionary with keys and values:
valDict = {c_1: 13.37, c_2: -42.00, c_3: 0.00, ... c_n: -0.69}
I want to use the valDict to assign the values to the lowest level of the keyDict as fast as possible.
My current implementation is very slow I think because I iterate through the 2 upper levels [f][t] of the keyDict. There must be a way to set the values of the low level without concern for the upper levels because the value of [c] does not depend on the values of [f][t].
My current SLOW implementation:
for f in keyDict:
for t in keyDict[f]:
for c in keyDict[f][t]:
keyDict[f][t][c] = valDict[c]
Still looking for a solution. [c] only has a few thousands keys, but [f][t] can have millions, so the way I do it, distinct value assignment is happening millions of times when it should be able to go through the bottom level and assign the value which does NOT depend on f,t but ONLY on c.
To clarify example per Alexis request: c dictionaries don't necessarily all have the same keys, but c dictionaries DO have the same values for a given key. For example, to make things simple, lets say there are only 3 possible keys for c dict (c_1, c_2, c_3). Now one parent dictionary (ex f=1,t=1) may have just {c_2} and another parent diction (f=1,t=2) may have {c_2 and c_3} and yet another (ex f=999,t=999) might have all three {c_1, c_2, and c_3}. Some parent dicts may have the same set of c's. What I am trying to do is assign the value to the c dict, which is defined purely by the c key, not T or F.
If the most nested dicts and valDict share exactly the same keys, it would be faster to use dict.update instead of looping over all the keys of the dict:
for dct in keyDict.values()
for d in dct.values():
d.update(valDict)
Also, it is more elegant and probably more faster to loop on the values of the outer dicts directly instead of iterating on the keys and then accessing the value using the current key.
So you have millions of "c" dictionaries that you need to keep synchronized. The dictionaries have different sets of keys (presumably for good reason, but I trust you realize that your update code puts the new values in all the dictionaries), but the non-None values must change in lockstep.
You haven't explained what this data structure is for, but judging from your description, you should have a single c dictionary, not millions of them.
After all, you only have one set of valid "c" values; maintaining multiple copies is not only a performance problem, it puts an incredible burden of consistency on your code. But obviously, updating a single dictionary will be hugely faster than updating millions of them.
Of course you also want to know which keys were contained in each dictionary: To do this, your tree of dictionaries should terminate with sets of keys, which you can use to look up values as necessary.
In case my description is not clear, here is how your structure would be transformed:
all_c = dict()
for for f in keyDict:
for t in keyDict[f]:
all_c.update(k,v for k, v in keydict[f][t].items() if v is not None)
keydict[f][t] = set(keydict[f][t].keys())
This code builds a combined dictionary all_c with the non-null values from each of your bottom-level "c" dictionaries, then replaces the latter with a list of its keys. If you later need the complete dictionary at keyDict[f][t] (rather than access to particular values), you can reconstruct it like this:
f_t_cdict = dict((k, all_c[k]) for k in keyDict[f][t])
But I'm pretty sure you can do whatever it is you are doing by working with the sets keyDict[f][t], and simply looking up values directly in the combined dictionary all_c.
I have a couple of pairs of rather big dicts. The structure of the pair dicts is exactly the same but the values will differ. All pairs differ in how nested they are.
To clarify:
dict_a has same structure as dict_b
dict_c has same structure as dict_d (but is different from dict_a and dict_b)
etc.
Is there a tool out there that makes it easy to implement a function to compare the values only, and/or do some basic arithmetic on them? My dicts can be quite nested, so a simple [for k,v in dict_x.iteritems()...] won't do.
Sounds like a problem for...recursive functions!
Basically, if I understand your question, you have a deep dictionary with varying levels of depths at unspecified keys. You'd like to compare the values of dict_a to dict_b but don't care much about the keys: just the differences in values. Here's an idea using a recursive function to print out each set of values that doesn't match.
def dict_compare(da, db):
for k, v in da.iteritems():
if isinstance(v, dict): #if the value is another dict:
dict_compare(v, db[k]) #enter into the comparison function again!
else:
if v != db[k]:
print 'values not equal at', k
Then you just can call
dict_compare(dict_a, dict_b)
The magic being that if the value of a given key is in fact another dictionary, just call your comparison function again.
Obviously, if you wanted to do something more complicated than just print the simple key of the values that don't match, just modify what happens after the if v != db[k] line.