Remove redundant tuples from dictionary based on the score - python

I wonder if there is a fast way to remove redundant tuples from dictionary. Suppose I have a dictionary as below:
a = {
'trans': [('pickup', 1.0), ('boat', 1.0), ('plane', 1.0), ('walking', 1.0), ('foot', 1.0), ('train', 0.7455259731472191), ('trailer', 0.7227749512667475), ('car', 0.7759192750865143)],
'actor': {
'autori': [('smug', 1.0), ('pol', 1.0), ('traff', 1.0), ('local authori', 0.6894454471465952), ('driv', 0.6121365092485745), ('car', 0.6297345748705596)],
'fam': [('fa', 1.0), ('mo', 1.0), ('bro', 1.0), ('son', 0.9925431812951816), ('sis', 0.9789254869156859), ('fami', 0.8392597243422916)],
'fri': [('fri', 1.0), ('compats', 1.0), ('mo', 0.814126196299157), ('neighbor', 0.7433986938516075), ('parent', 0.32202418215134565), ('bro', 0.8496284151715676), ('fami', 0.6375584385858655), ('best fri', 0.807654599975373)]
}
}
In this dictionary for example we have tuples like: ('car', 0.7759192750865143) for key 'trans' and ('car', 0.6297345748705596) for key 'autori'. I want to remove the tuple ('car', 0.6297345748705596) because it has a lower score.
My desired output is:
new_a = {
'trans': [('pickup', 1.0), ('boat', 1.0), ('plane', 1.0), ('walking', 1.0), ('foot', 1.0), ('train', 0.7455259731472191), ('trailer', 0.7227749512667475), ('car', 0.7759192750865143)],
'actor': {
'autori': [('smug', 1.0), ('pol', 1.0), ('traff', 1.0), ('local authori', 0.6894454471465952), ('driv', 0.6121365092485745)],
'fam': [('fa', 1.0), ('mo', 1.0), ('bro', 1.0), ('son', 0.9925431812951816), ('sis', 0.9789254869156859), ('fami', 0.8392597243422916)],
'fri': [('fri', 1.0), ('compats', 1.0), ('neighbor', 0.7433986938516075), ('parent', 0.32202418215134565), ('best fri', 0.807654599975373)]
}
}
Is there a fast way to do this or we still need to loop through all values for each key?

Not sure it's the most efficient, but since you also mentioned "a simple solution" in a comment....
I think the simplest method would involve looping through every tuple twice: once to collect best scores, and then again to filter everything else. Something like new_a = onlyBest( a, bestRef=dict(sorted(getAllPairs(a))) ) [see function definitions below].
def getAllPairs(obj):
if isinstance(obj, tuple) and len(obj)==2: return [obj]
allPairs = []
if isinstance(obj, dict): obj = obj.values()
if hasattr(obj, '__iter__') and not isinstance(obj, str):
for i in obj: allPairs += getAllPairs(i)
return allPairs
def onlyBest(obj, bestRef:dict):
if isinstance(obj, list):
# if all(isinstance(i, tuple) and len(i)==2 for i in obj):
return [i for i in obj if not i[1] < bestRef.get(i[0],i[1])]
if isinstance(obj, dict):
return {k: onlyBest(v,bestRef) for k, v in obj.items()}
return obj

To remove lower values, you need to detect duplicate, compare, keep track of the higher value, and remove value if a bigger one is found. The algorithm you want is at least time O(n) and space O(n).

Related

Extract new dictionary from a list of dictionaries

I have a list of dictionaries, I would like to create a new dictionary where the first key 'value' corresponds to the second value of the 'b' key of each dictionary in the list. The second key 'number' of the new dictionary corresponds to the third (therefore last) value of the 'b' key of each dictionary in the list.
my_list = [
{
'a': (2.6, 0.08, 47.0, 1),
'b': (5.7, 0.05, 1)
},
{
'a': (2.6, 0.08, 47.0, 2),
'b': (5.7, 0.06, 2)
}
]
expected output:
new_dic = {'value': (0.05, 0.06), number = (1, 2)}
you can use comprehension as follows:
new_dict = {}
new_dict['value'] = tuple(val['b'][1] for val in my_list)
new_dict['number'] = tuple(val['b'][2] for val in my_list)
Note that you need to call the tuple constructor, because (val['b'][2] for val in my_list) alone returns a generator object.

Build dict from list of tuples combining two multi index dfs and column index

I have two multi-index dataframes: mean and std
arrays = [['A', 'A', 'B', 'B'], ['Z', 'Y', 'X', 'W']]
mean=pd.DataFrame(data={0.0:[np.nan,2.0,3.0,4.0], 60.0: [5.0,np.nan,7.0,8.0], 120.0:[9.0,10.0,np.nan,12.0]},
index=pd.MultiIndex.from_arrays(arrays, names=('id', 'comp')))
mean.columns.name='Times'
std=pd.DataFrame(data={0.0:[10.0,10.0,10.0,10.0], 60.0: [10.0,10.0,10.0,10.0], 120.0:[10.0,10.0,10.0,10.0]},
index=pd.MultiIndex.from_arrays(arrays, names=('id', 'comp')))
std.columns.name='Times'
My task is to combine them in a dictionary with '{id:' as first level, followed by second level dictionary with '{comp:' and then for each comp a list of tuples, which combines the (time-points, mean, std). So, the result should look like that:
{'A': {
'Z': [(60.0,5.0,10.0),
(120.0,9.0,10.0)],
'Y': [(0.0,2.0,10.0),
(120.0,10.0,10.0)]
},
'B': {
'X': [(0.0,3.0,10.0),
(60.0,7.0,10.0)],
'W': [(0.0,4.0,10.0),
(60.0,8.0,10.0),
(120.0,12.0,10.0)]
}
}
Additionally, when there is NaN in data, the triplets are left out, so value A,Z at time 0, A,Y at time 60 B,X at time 120.
How do I get there? I constructed already a dict of dict of list of tuples for a single line:
iter=0
{mean.index[iter][0]:{mean.index[iter][1]:list(zip(mean.columns, mean.iloc[iter], std.iloc[iter]))}}
>{'A': {'Z': [(0.0, 1.0, 10.0), (60.0, 5.0, 10.0), (120.0, 9.0, 10.0)]}}
Now, I need to extend to a dictionary with a loop over each line {inner dict) and adding the ids each {outer dict}. I started with iterrows and dic comprehension, but here I have problems, indexing with the iter ('A','Z') which i get from iterrows(), and building the whole dict, iteratively.
{mean.index[iter[1]]:list(zip(mean.columns, mean.loc[iter[1]], std.loc[iter[1]])) for (iter,row) in mean.iterrows()}
creates errors, and I would only have the inner loop
KeyError: 'the label [Z] is not in the [index]'
Thanks!
EDIT: I exchanged the numbers to float in this example, because here integers were generated before which was not consistent with my real data, and which would fail in following json dump.
Here is a solution using a defaultdict:
from collections import defaultdict
mean_as_dict = mean.to_dict(orient='index')
std_as_dict = std.to_dict(orient='index')
mean_clean_sorted = {k: sorted([(i, j) for i, j in v.items()]) for k, v in mean_as_dict.items()}
std_clean_sorted = {k: sorted([(i, j) for i, j in v.items()]) for k, v in std_as_dict.items()}
sol = {k: [j + (std_clean_sorted[k][i][1],) for i, j in enumerate(v) if not np.isnan(j[1])] for k, v in mean_clean_sorted.items()}
solution = defaultdict(dict)
for k, v in sol.items():
solution[k[0]][k[1]] = v
Resulting dict will be defaultdict object that you can change to dict easily:
solution = dict(solution)
con = pd.concat([mean, std])
primary = dict()
for i in set(con.index.values):
if i[0] not in primary.keys():
primary[i[0]] = dict()
primary[i[0]][i[1]] = list()
for x in con.columns:
primary[i[0]][i[1]].append((x, tuple(con.loc[i[0]].loc[i[1][0].values)))
Here is sample output
I found a very comprehensive way of putting up this nested dict:
mean_dict_items=mean.to_dict(orient='index').items()
{k[0]:{u[1]:list(zip(mean.columns, mean.loc[u], std.loc[u]))
for u,v in mean_dict_items if (k[0],u[1]) == u} for k,l in mean_dict_items}
creates:
{'A': {'Y': [(0.0, 2.0, 10.0), (60.0, nan, 10.0), (120.0, 10.0, 10.0)],
'Z': [(0.0, nan, 10.0), (60.0, 5.0, 10.0), (120.0, 9.0, 10.0)]},
'B': {'W': [(0.0, 4.0, 10.0), (60.0, 8.0, 10.0), (120.0, 12.0, 10.0)],
'X': [(0.0, 3.0, 10.0), (60.0, 7.0, 10.0), (120.0, nan, 10.0)]}}

Sort List of Tuples Containing Strings By First Element Float (Python)

I have a long list of tuples:
[...
(0.862, 'beehive', 'bug'),
(0.12, 'yard', 'property'),
(0.0, 'lake', 'mailbox'),
(0.37, 'maintenance', 'shears'),
(0.1666, 'summer', 'popsicle'),
(0.9, 'poplar', 'tree')
...]
and I need to sort this list descending by the float values. I know the Python automatically sorts lists by the first value, however even when I call sorted or even explicitly specify the first element, I haven't had success.
sorted(mylist) # doesn't sort the list at all
sorted(mylist, key = x[0]) # on this sort attempt I get "'str' object is not callable"
Can anyone provide detail as to why the list is still disorganized despite these sorting attempts and what might sort by floats in descending order?
sorted(..) returns a new list. What you are looking for is .sort(..) to sort the list inplace.
Furthermore you can use the reverse parameter to sort in descending order:
data.sort(reverse=True) # sort the list inplace
This will return:
>>> data.sort(reverse=True)
>>> data
[(0.9, 'poplar', 'tree'), (0.862, 'beehive', 'bug'), (0.37, 'maintenance', 'shears'), (0.1666, 'summer', 'popsicle'), (0.12, 'yard', 'property'), (0.0, 'lake', 'mailbox')]
The default sorting of tuples will first sort on the first elements. If these are equal, it will consider the second element of each tuple and so on.
If you do not want this tie breaker, but use the original order in that case, you can use an itemgetter as key:
from operator import itemgetter
data.sort(reverse=True,key=itemgetter(0)) # sort the list inplace
You can use the same arguments with sorted(..) if you want to construct a new list that is sorted:
data_sorted = sorted(data,reverse=True) # construct a new sorted list
Try this way :
data = [
(0.862, 'beehive', 'bug'),
(0.12, 'yard', 'property'),
(0.0, 'lake', 'mailbox'),
(0.37, 'maintenance', 'shears'),
(0.1666, 'summer', 'popsicle'),
(0.9, 'poplar', 'tree')
]
print(*reversed(sorted(data)))
Output :
(0.9, 'poplar', 'tree') (0.862, 'beehive', 'bug') (0.37, 'maintenance', 'shears') (0.1666, 'summer', 'popsicle') (0.12, 'yard', 'property') (0.0, 'lake', 'mailbox')
Or, You can follow another process :
>>> data = [
... (0.862, 'beehive', 'bug'),
... (0.12, 'yard', 'property'),
... (0.0, 'lake', 'mailbox'),
... (0.37, 'maintenance', 'shears'),
... (0.1666, 'summer', 'popsicle'),
... (0.9, 'poplar', 'tree')
... ]
>>> data.sort(key=lambda tup: tup[0], reverse = True)
>>> data
[(0.9, 'poplar', 'tree'), (0.862, 'beehive', 'bug'), (0.37, 'maintenance', 'shears'), (0.1666, 'summer', 'popsicle'), (0.12, 'yard', 'property'), (0.0, 'lake', 'mailbox')]
>>>

Find values of keys in dictionary that contain a key word and return those keys and values

I'm working on a function where I need to find values in a dictionary that contain a keyword and return them (only the ones with the keyword) along with their keys. I believe my code is on the right track.
Example Dictionary
{'M':[("One",1400,30.0, 20.5,"oil paint","Austria"),("Three",1430,100.0,102.0,"watercolor","France")],
'P':[("Eight",1460, 225.0, 200.0, "fresco","Netherlands"),("Six",1465,81.0, 127.1, "tempera", "Netherlands")],
'V':[("Four",1661, 148.0, 257.0,"oil paint", "Austria"),("Two",1630, 91.0, 77.0, "oil paint","USA")],
'K':[("Five",1922,63.8,48.1,"watercolor","USA"),("Seven",1950,61.0,61.0,"acrylic paint","USA")],
'C':[("Ten",1496,365.0,389.0,"tempera","Italy")],
'U':[("Nine",1203,182.0, 957.0,"egg tempera","Italy"), ("Twelve",1200,76.2,101.6,"egg tempera","France")]
}
So if I was searching for the keyword 'watercolor' the function should return this
find_keyword(dictionary2(),'watercolor')
{'M': [('Three', 1430, 100.0, 102.0,
'watercolor', 'France')], 'K': [('Five',
1922, 63.8, 48.1, 'watercolor', 'USA')]}
As you can see the function just searched for the keyword watercolor and returned the keys and values in the order that they appeared in the dictionary. I think my current code must be close but it is currently giving me an Assertion error and returning nothing every time. Does anyone know how to fix this?
Current code:
def find_keyword(dictionary,theword):
keyword = {}
for key, record_list in dictionary.items():
for record in record_list:
value = record[1]
if theword in record:
if key in keyword:
keyword[key].append(record)
else:
keyword[key] = [record]
return keyword
Use an OrderedDict since you want to return the first match in your dict.
Your method has many unnecessary bells and whistles. Simply iterate thought your dict, search each key for your keyword, and return each key value pair that is a match.
from collections import OrderedDict
d = {'M':[("One",1400,30.0, 20.5,"oil paint","Austria"),("Three",1430,100.0,102.0,"watercolor","France")],
'P':[("Eight",1460, 225.0, 200.0, "fresco","Netherlands"),("Six",1465,81.0, 127.1, "tempera", "Netherlands")],
'V':[("Four",1661, 148.0, 257.0,"oil paint", "Austria"),("Two",1630, 91.0, 77.0, "oil paint","USA")],
'K':[("Five",1922,63.8,48.1,"watercolor","USA"),("Seven",1950,61.0,61.0,"acrylic paint","USA")],
'C':[("Ten",1496,365.0,389.0,"tempera","Italy")],
'U':[("Nine",1203,182.0, 957.0,"egg tempera","Italy"), ("Twelve",1200,76.2,101.6,"egg tempera","France")]}
d = OrderedDict(sorted(d.items(), key=lambda x:x[1], reverse=True))
# -------------- ########## ------------ #
def contains(seq, key):
"""
Create A helper function to search for a key
In a nested sequence of sequences.
"""
return any(key in el for el in seq)
def find_key(d, key):
"""
Iterate thought your `dict`, search each key for your keyword, and
return each key value pair that is a match.
"""
tmp_dict ={}
for k, v in d.items():
for tup in v:
if key in tup:
tmp_dict[k] = tup
return tmp_dict
print(find_key(d,'watercolor'))
Output:
{'M': ('Three', 1430, 100.0, 102.0, 'watercolor', 'France'),
'K': ('Five', 1922, 63.8, 48.1, 'watercolor', 'USA')}

Searching for keys in dictionary that have a value between two numbers

I need to analyze a dictionary for values that include a number between two given numbers (as parameters) and return those values preceded by their key
Dictionary:
{'P':[("Eight",1460, 225.0, 200.0, "fresco","Netherlands"),("Six",1465,81.0, 127.1, "tempera", "Netherlands")],
'V':[("Four",1661, 148.0, 257.0,"oil paint", "Austria"),("Two",1630, 91.0, 77.0, "oil paint","USA")],
'K':[("Five",1922,63.8,48.1,"watercolor","USA"),("Seven",1950,61.0,61.0,"acrylic paint","USA"),("Two",1965,81.3,100.3,"oil paint","United Kingdom")],
'C':[("Ten",1496,365.0,389.0,"tempera","Italy")],
'U':[("Nine",1203,182.0, 957.0,"egg tempera","Italy"), ("Twelve",1200,76.2,101.6,"egg tempera","France")]
}
The function should only return the values where a number between the two numbers is present. So if the function was called between_two_values it should return this if searching for values between 1464 and 1496:
between_two_values(dictionary1(), 1464, 1496)
{'P': [('Six', 1465, 81.0, 127.1, 'tempera',
'Netherlands')], 'C': [('Ten', 1496, 365.0,
389.0, 'tempera', 'Italy')]}
If one of the values of the key doesn't have a number between 1464-1496 it shouldnt return that value and only the ones that have a number in that range preceded by its key. This is why in the above example for 'P' the first value which has 1460 wasnt returned since it is not between the 2 numbers. The first number in the function should always be smaller then the second if the first number is larger then it should just return an empty dictionary.
This is the code I have come up with I don't think it's correct but it kind of shows the logic that could solve this function. I appreciate any help I receive
def between_two_values(dictionary,start,end):
for x in dictionary:
if end < x < start in dictionary:
return dictionary(x)
You're on the right track. Here's one solution to the problem posed.
I've formatted the data better for clarity. When it was condensed down I didn't immediately see that each dictionary value was wrapped in a list. Of course this is a style-oriented change, but style helps with readability.
Note that I have made a few assumptions, such as that each dictionary value will be a list. For example, that your edge case of a key with no values will be represented as [] rather than None. I have also sort of extrapolated what I think the desired output is from the example you gave. Finally, you may consider using collections.defaultdict to simplify where matches are stored.
Besides that, this code is nothing fancy. You certainly could condense it down more, or use classes for semantics. Speaking of semantics, I recommend that you use better variable names than I did: "data", "record", and "value" are pretty generic, but I feel they helped explain the solution without me having insight as to what this data represents.
If you're using Python 2, consider using dictionary.iteritems() instead of dictionary.items().
Data
data = {
'P': [
('Eight', 1460, 225.0, 200.0, 'fresco', 'Netherlands'),
('Six', 1465, 81.0, 127.1, 'tempera', 'Netherlands'),
],
'V': [
('Four', 1661, 148.0, 257.0, 'oil paint', 'Austria'),
('Two', 1630, 91.0, 77.0, 'oil paint', 'USA'),
],
'K': [
('Five', 1922, 63.8, 48.1, 'watercolor', 'USA'),
('Seven', 1950, 61.0, 61.0, 'acrylic paint', 'USA'),
('Two', 1965, 81.3, 100.3, 'oil paint', 'United Kingdom'),
],
'C': [
('Ten', 1496, 365.0, 389.0, 'tempera', 'Italy'),
],
'U': [
('Nine', 1203, 182.0, 957.0, 'egg tempera', 'Italy'),
('Twelve', 1200, 76.2, 101.6, 'egg tempera', 'France'),
],
}
Code
def between_two_values(dictionary, start, end):
matches = {}
for key, record_list in dictionary.items():
for record in record_list:
value = record[1]
if start < value < end:
if key in matches:
matches[key].append(record)
else:
matches[key] = [record]
return matches
result = between_two_values(data, 1464, 1496)
print(result)
Output
{'P': [('Six', 1465, 81.0, 127.1, 'tempera', 'Netherlands')]}
You can use a dict comprehension to construct a result, e.g.:
>>> {k: [e for e in v if 1464 < e[1] < 1496] for k, v in dictionary.items()}
{'C': [],
'K': [],
'P': [('Six', 1465, 81.0, 127.1, 'tempera', 'Netherlands')],
'U': [],
'V': []}
Then just eliminate the empty results:
def between_two_values(dictionary, start, end):
result = {k: [e for e in v if start < e[1] < end] for k, v in dictionary.items()}
return {k: v for k, v in result.items() if v}

Categories