Rearranging a dictionary based on a function-condition over its items - python

(In relation to this question I posed a few days ago)
I have a dictionary whose keys are strings, and whose values are sets of integers, for example:
db = {"a":{1,2,3}, "b":{5,6,7}, "c":{2,5,4}, "d":{8,11,10,18}, "e":{0,3,2}}
I would like to have a procedure that joins the keys whose values satisfy a certain generic condition given in an external function. The new item will therefore have as a key the union of both keys (the order is not important). The value will be determined by the condition itserf.
For example: given this condition function:
def condition(kv1: tuple, kv2: tuple):
key1, val1 = kv1
key2, val2 = kv2
union = val1 | val2 #just needed for the following line
maxDif = max(union) - min(union)
newVal = set()
for i in range(maxDif):
auxVal1 = {pos - i for pos in val2}
auxVal2 = {pos + i for pos in val2}
intersection1 = val1.intersection(auxVal1)
intersection2 = val1.intersection(auxVal2)
print(intersection1, intersection2)
if (len(intersection1) >= 3):
newVal.update(intersection1)
if (len(intersection2) >= 3):
newVal.update({pos - i for pos in intersection2})
if len(newVal)==0:
return False
else:
newKey = "".join(sorted(key1+key2))
return newKey, newVal
That is, the satisfying pair of items have at least 3 numbers in their values at the same distance (difference) between them. As said, if satisfied, the resulting key is the union of the two keys. And for this particular example, the value is the (minimum) matching numbers in the original value sets.
How can I smartly apply a function like this to a dictionary like db? Given the aforementioned dictionary, the expected result would be:
result = {"ab":{1,2,3}, "cde":{0,3,2}, "d":{18}}

Your "condition" in this case is more than just a mere condition. It is actually merging rule that identifies values to keep and values to drop. This may or may not allow a generalized approach depending on how the patterns and merge rules vary.
Given this, each merge operation could leave values in the original keys that may be merged with some of the remaining keys. Multiple merges can also occur (e.g. key "cde"). In theory the merging process would need to cover a power set of all keys which may be impractical. Alternatively, this can be performed by successive refinements using pairings of (original and/or merged) keys.
The merge condition/function:
db = {"a":{1,2,3}, "b":{5,6,7}, "c":{2,5,4}, "d":{8,11,10,18}, "e":{0,3,2}}
from itertools import product
from collections import Counter
# Apply condition and return a keep-set and a remove-set
# the keep-set will be empty if the matching condition is not met
def merge(A,B,inverted=False):
minMatch = 3
distances = Counter(b-a for a,b in product(A,B) if b>=a)
delta = [d for d,count in distances.items() if count>=minMatch]
keep = {a for a in A if any(a+d in B for d in delta)}
remove = {b for b in B if any(b-d in A for d in delta)}
if len(keep)>=minMatch: return keep,remove
return None,None
print( merge(db["a"],db["b"]) ) # ({1, 2, 3}, {5, 6, 7})
print( merge(db["e"],db["d"]) ) # ({0, 2, 3}, {8, 10, 11})
Merge Process:
# combine dictionary keys using a merging function/condition
def combine(D,mergeFunction):
result = { k:set(v) for k,v in D.items() } # start with copy of input
merging = True
while merging: # keep merging until no more merges are performed
merging = False
for a,b in product(*2*[list(result.keys())]): # all key pairs
if a==b: continue
if a not in result or b not in result: continue # keys still there?
m,n = mergeFunction(result[a],result[b]) # call merge function
if not m : continue # if merged ...
mergedKey = "".join(sorted(set(a+b))) # combine keys
result[mergedKey] = m # add merged set
if mergedKey != a: result[a] -= m; merging = True # clean/clear
if not result[a]: del result[a] # original sets,
if mergedKey != b: result[b] -= n; merging = True # do more merges
if not result[b]: del result[b]
return result

Related

Summing keys and values in a list of dictionaries python

I have a list of dictionaries called "timebucket" :
[{0.9711533363722904: 0.008296776727415599},
{0.97163564816067838: 0.008153794130319884},
{0.99212783984967068: 0.0022392112909864364},
{0.98955473263127025: 0.0029843621053514003}]
I would like to return the top two largest keys (.99 and .98) and average them , plus , get both of their values and average those as well.
Expected output would like something like:
{ (avg. two largest keys) : (avg. values of two largest keys) }
I've tried:
import numpy as np
import heapq
[np.mean(heapq.nlargest(2, i.keys())) for i in timebucket]
but heapq doesn't work in this scenario, and not sure how to keep keys and values linked
Doing this with numpy:
In []:
a = np.array([e for i in timebucket for e in i.items()]);
a[a[:,1].argsort()][:2].mean(axis=0)
Out[]
array([ 0.99084129, 0.00261179])
Though I suspect creating a better data-structure up front would probably be a better approach.
This gives you the average of 2 largest keys (keyave) and the average of the two corresponding values (valave).
The keys and values are put into a dictionary called newdict.
timebucket = [{0.9711533363722904: 0.008296776727415599},
{0.97163564816067838: 0.008153794130319884},
{0.99212783984967068: 0.0022392112909864364},
{0.98955473263127025: 0.0029843621053514003}]
keys = []
for time in timebucket:
for x in time:
keys.append(x)
result = {}
for d in timebucket:
result.update(d)
largestkey = (sorted(keys)[-1])
ndlargestkey = (sorted(keys)[-2])
keyave = (float((largestkey)+(ndlargestkey))/2)
largestvalue = (result[(largestkey)])
ndlargestvalue = (result[(ndlargestkey)])
valave = (float((largestvalue)+(ndlargestvalue))/2)
newdict = {}
newdict[keyave] = valave
print(newdict)
#print(keyave)
#print(valave)
Output
{0.9908412862404705: 0.002611786698168918}
Here is a solution to your problem:
def dothisthing(mydict) # define the function with a dictionary a the only parameter
keylist = [] # create an empty list
for key in mydict: # iterate the input dictionary
keylist.append(key) # add the key from the dictionary to a list
keylist.sort(reverse = True) # sort the list from highest to lowest numbers
toptwokeys = 0 # create a variable
toptwovals = 0 # create a variable
count = 0 # create an integer variable
for item in keylist: # iterate the list we created above
if count <2: # this limits the iterations to the first 2
toptwokeys += item # add the key
toptwovals += (mydict[item]) # add the value
count += 1
finaldict = {(toptwokeys/2):(toptwovals/2)} # create a dictionary where the key and val are the average of the 2 from the input dict with the greatest keys
return finaldict # return the output dictionary
dothisthing({0.9711533363722904: 0.008296776727415599, 0.97163564816067838: 0.008153794130319884, 0.99212783984967068: 0.0022392112909864364, 0.98955473263127025: 0.0029843621053514003})
#call the function with your dictionary as the parameter
I hope it helps
You can do it in just four lines without importing numpy :
One line solution
For two max average keys :
max_keys_average=sorted([keys for item in timebucket for keys,values in item.items()])[::-1][:2]
print(sum(max_keys_average)/len(max_keys_average))
output:
0.9908412862404705
for their keys average :
max_values_average=[values for item in max_keys_average for item_1 in timebucket for keys,values in item_1.items() if item==keys]
print(sum(max_values_average)/len(max_values_average))
output:
0.002611786698168918
If you are facing issue with understanding list comprehension here is detailed solution for you:
Detailed Solution
first step:
get all the keys of dict in one list :
Here is your timebucket list:
timebucket=[{0.9711533363722904: 0.008296776727415599},
{0.97163564816067838: 0.008153794130319884},
{0.99212783984967068: 0.0022392112909864364},
{0.98955473263127025: 0.0029843621053514003}]
now let's store all the keys in one list:
keys_list=[]
for dict in timebucket:
for key,value in dict.items():
keys_list.append(key)
Now next step is sort this list and get last two values of this list :
max_keys=sorted(keys_list)[::-1][:2]
Next step just take sum of this new list and divide by len of list :
print(sum(max_keys)/len(max_keys))
output:
0.9908412862404705
Now just iterate the max_keys and keys in timebucket and see if both item match then get the value of that item in a list.
max_values=[]
for item in max_keys:
for dict in timebucket:
for key, value in dict.items():
if item==key:
max_values.append(value)
print(max_values)
Now last part , just take sum and divide by len of max_values:
print(sum(max_values)/len(max_values))
Gives the output :
0.002611786698168918
This is an alternative solution to the problem:
In []:
import numpy as np
import time
def AverageTB(time_bucket):
tuples = [tb.items() for tb in time_bucket]
largest_keys = []
largest_keys.append(max(tuples))
tuples.remove(max(tuples))
largest_keys.append(max(tuples))
keys = [i[0][0] for i in largest_keys]
values = [i[0][1] for i in largest_keys]
return np.average(keys), np.average(values)
time_bucket = [{0.9711533363722904: 0.008296776727415599},
{0.97163564816067838: 0.008153794130319884},
{0.99212783984967068: 0.0022392112909864364},
{0.98955473263127025: 0.0029843621053514003}]
time_exe = time.time()
print('avg. (keys, values): {}'.format(AverageTB(time_bucket)))
print('time: {}'.format(time.time() - time_exe))
Out[]:
avg. (keys, values): (0.99084128624047052, 0.0026117866981689181)
time: 0.00037789344787

consolidating list of sets

Given a list of sets (sets of strings such as setlist = [{'this','is'},{'is','a'},{'test'}]), the idea is to join pairwise -union- sets that share strings in common. The snippet below takes the literal approach of testing pairwise overlap, joining, and starting anew using an inner loop break.
I know this is the pedestrian approach, and it does take forever for lists of usable size (200K sets of between 2 and 10 strings).
Any advice on how to make this more efficient? Thanks.
j = 0
while True:
if j == len(setlist): # both for loops are done
break # while
for i in range(0,len(setlist)-1):
for j in range(i+1,len(setlist)):
a = setlist[i];
b = setlist[j];
if not set(a).isdisjoint(b): # ... then join them
newset = set.union( a , b ) # ... new set
del setlist[j] # ... drop highest index
del setlist[i] # ... drop lowest index
setlist.insert(0,newset) # ... introduce consolidated set, which messes up i,j
break # ... back to the top for fresh i,j
else:
continue
break
As #user2357112 mentioned in comments this can be thought of as a graph problem. Every set is a vertex and every word shared between two sets is an edge. Then you can just iterate over vertices and do BFS (or DFS) for every unseen vertex to generate a connected component.
Other option is to use Union-Find. The advantage of the union find is that you don't need to construct a graph and there's no degenerate case when all the sets have same contents. Here's an example of it in action:
from collections import defaultdict
# Return ancestor of given node
def ancestor(parent, node):
if parent[node] != node:
# Do path compression
parent[node] = ancestor(parent, parent[node])
return parent[node]
def merge(parent, rank, x, y):
# Merge sets that x & y belong to
x = ancestor(parent, x)
y = ancestor(parent, y)
if x == y:
return
# Union by rank, merge smaller set to larger one
if rank[y] > rank[x]:
x, y = y, x
parent[y] = x
rank[x] += rank[y]
def merge_union(setlist):
# For every word in sets list what sets contain it
words = defaultdict(list)
for i, s in enumerate(setlist):
for w in s:
words[w].append(i)
# Merge sets that share the word
parent = list(range(len(setlist)))
rank = [1] * len(setlist)
for sets in words.values():
it = iter(sets)
merge_to = next(it)
for x in it:
merge(parent, rank, merge_to, x)
# Construct result by union the sets within a component
result = defaultdict(set)
for merge_from, merge_to in enumerate(parent):
result[merge_to] |= setlist[merge_from]
return list(result.values())
setlist = [
{'this', 'is'},
{'is', 'a'},
{'test'},
{'foo'},
{'foobar', 'foo'},
{'foobar', 'bar'},
{'alone'}
]
print(merge_union(setlist))
Output:
[{'this', 'is', 'a'}, {'test'}, {'bar', 'foobar', 'foo'}, {'alone'}]

Python Tulpe Key For Dict Partial Lookup

I have a dictionary with a tuple of 5 values as a key. For example:
D[i,j,k,g,h] = value.
Now i need to process all elements with a certain partial key pair (i1,g1):
I need now for each pair (i1,g1) all values that have i == i1 and g == g1 in the full key.
What is an pythonic and efficient way to retrieve this, knowing that i need the elements for all pairs and each full key belongs to exactly one partial key?
Is there a more appropriate data structure than dictionaries?
One reference implementation is this:
results = {}
for i in I:
for g in G:
results[i,g] = []
for i,j,k,g,h in D:
if i1 == i and g1 == g:
results[i,g].append(D[i,j,k,g,h])
Assuming you know all the valid values for the different indices you can get all possible keys using itertools.product:
import itertools
I = [3,6,9]
J = range(10)
K = "abcde"
G = ["first","second"]
H = range(10,20)
for tup in itertools.product(I,J,K,G,H):
my_dict[tup] = 0
To restrict the indices generated just put a limit on one / several of the indices that gets generated, for instance all of the keys where i = 6 would be:
itertools.product((6,), J,K,G,H)
A function to let you specify you want all the indices where i==6 and g =="first" would look like this:
def partial_indices(i_vals=I, j_vals=J, k_vals=K, g_vals = G, h_vals = H):
return itertools.product(i_vals, j_vals, k_vals, g_vals, h_vals)
partial_indices(i_vals=(6,), g_vals=("first",))
Or assuming that not all of these are present in the dictionary you can also pass the dictionary as an argument and check for membership before generating the keys:
def items_with_partial_indices(d, i_vals=I, j_vals=J, k_vals=K, g_vals = G, h_vals = H):
for tup in itertools.product(i_vals, j_vals, k_vals, g_vals, h_vals):
try:
yield tup, d[tup]
except KeyError:
pass
for k,v in D.iteritems():
if i in k and p in k:
print v

Python: Sum entries in list of tuples entries with case sensitive keys?

I have a list of tuples holding hashtags and frequencies for example:
[('#Example', 92002),
('#example', 65544)]
I want to sum entries which have have the same string as the first entry in the tuple (but a different case-sensitive version), keeping the first entry with the highest value in the second entry. The above would be transformed to:
[('#Example', 157,546)]
I've tried this so far:
import operator
for hashtag in hashtag_freq_list:
if hashtag[0].lower() not in [res_entry[0].lower() for res_entry in res]:
entries = [entry for entry in hashtag_freq_list if hashtag[0].lower() == entry[0].lower()]
k = max(entries,key=operator.itemgetter(1))[0]
v = sum([entry[1] for entry in entries])
res.append((k,v))
I was just wondering if this could be approached in a more elegant way?
I would use dictionary
data = [('#example', 65544),('#Example', 92002)]
hashtable = {}
for i in data:
# See if this thing exists regardless of casing
if i[0].lower() not in hashtable:
# Create a dictionary
hashtable[i[0].lower()] = {
'meta':'',
'value':[]
}
# Copy the relevant information
hashtable[i[0].lower()]['value'].append(i[1])
hashtable[i[0].lower()]['meta'] = i[0]
# If the value exists
else:
# Check if the number it holds is the max against
# what was collected so far. If so, change meta
if i[1] > max(hashtable[i[0].lower()]['value']):
hashtable[i[0].lower()]['meta'] = i[0]
# Append the value regardless
hashtable[i[0].lower()]['value'].append(i[1])
# For output purposes
myList = []
# Build the tuples
for node in hashtable:
myList.append((hashtable[node]['meta'],sum(hashtable[node]['value'])))
# Voila!
print myList
# [('#Example', 157546)]

dynamically nesting a python dictionary

I want to create a function that will create dynamic levels of nesting in a python dictionary.
e.g. if I call my function nesting, I want the outputs like the following:
nesting(1) : dict = {key1:<value>}
nesting(2) : dict = {key1:{key2:<value>}}
nesting(3) : dict = {key1:{key2:{key3:<value>}}}
and so on. I have all the keys and values before calling this function, but not before I start executing the code.
I have the keys stored in a variable 'm' where m is obtained from:
m=re.match(pattern,string)
the pattern is constructed dynamically for this case.
You can iterate over the keys like this:
def nesting(level):
ret = 'value'
for l in range(level, 0, -1):
ret = {'key%d' % l: ret}
return ret
Replace the range(...) fragment with the code which yields the keys in the desired order. So, if we assume that the keys are the captured groups, you should change the function as follows:
def nesting(match): # `match' is a match object like your `m' variable
ret = 'value'
for key in match.groups():
ret = {key: ret}
return ret
Or use reversed(match.groups()) if you want to get the keys in the opposite order.
def nesting(level, l=None):
# assuming `m` is accessible in the function
if l is None:
l = level
if level == 1:
return {m[l-level]: 'some_value'}
return {m[l-level]: nesting(level-1, l)
For reasonable levels, this won't exceed the recursion depth. This is also assuming that the value is always the same and that m is of the form:
['key1', 'key2', ...]
An iterative form of this function can be written as such:
def nesting(level):
# also assuming `m` is accessible within the function
d = 'some_value'
l = level
while level > 0:
d = {m[l-level]: d}
level -= 1
return d
Or:
def nesting(level):
# also assuming `m` is accessible within the function
d = 'some_value'
for l in range(level, 0, -1): # or xrange in Python 2
d = {m[l-level]: d}
return d

Categories