Optimizing code to get keys of matching values in two dictionaries - python

I am seeking a way to improve performance of my code: Given two dictionaries I need to find the keys of matching value pairs. So far I am iterating over both dictionaries, which will be very slow when both have up to 100000 key-value-pairs.
Given:
keys of both dictionaries are always numeric and sorted ascending
keys of both dictionaries refer to features of a QGIS layer I need to work with, so I really need to keep them this way
values of both dictionaries can have any datatype but both always do have the same datatype
values of both dictionaries are randomly filled
values can contain duplicates which may not be removed
Does anyone have a brilliant idea how I could improve the performance? Note that also a "no, absolutely not possible" is an acceptable answer, if well founded, so I can finally stop trying and searching.
dict_a = {1:'abc',2:'def',3:'abc',4:'ghj',5:'klm',6:'nop',7:'def',8:'abc',9:'xyz',10:'abc'}
dict_b = {1:'abc',2:'a',3:'b',4:'xyz',5:'abc',6:'b',7:'c',8:'def',9:'d',10:'e'}
# imagine both dictionaries have up to 100000 entries...
desired_matching_dict = {1:1,1:5,2:8,3:1,3:5,7:8,8:1,8:5,9:4,10:1,10:5} # example of my desired output
matching_dict_slow = {}
matching_dict_fast = {}
# This will be very slow when having huge dictionaries...
for key_a, value_a in dict_a.items():
for key_b, value_b in dict_b.items():
if value_a == value_b:
matching_dict_slow[key_a] = key_b
# Seeking an attempt to speed this up
# But getting lost...
for key, value in dict_a.items():
if value in dict_b.items():
if dict_a[key] == dict_b[key]:
matching_dict_fast[key]=dict_a[key]
print('Slow method works: ' + str(desired_matching_dict == matching_dict_slow))
print('Fast method works: ' + str(desired_matching_dict == matching_dict_fast))

From the competitive programming uses I've generally faced, this simple approach should work fine:
dict_a = {1:'abc',2:'def',3:'abc',4:'ghj',5:'klm',6:'nop',7:'def',8:'abc',9:'xyz',10:'abc'}
dict_b = {1:'abc',2:'a',3:'b',4:'xyz',5:'abc',6:'b',7:'c',8:'def',9:'d',10:'e'}
dic2 = {}
for i in dict_b.keys():
elem = dict_b[i]
if dic2.get(elem, None):
dic2[elem].append(i)
else:
dic2[elem] = [i]
matches = {}
for i in dict_a.keys():
elem = dict_a[i]
x = dic2.get(elem, None)
if x:
matches[i] = x
print(matches) #prints {1: [1, 5], 2: [8], 3: [1, 5], 7: [8], 8: [1, 5], 9: [4], 10: [1, 5]}
You can then access your features like:
for k, v in matches.items():
l = len(v) - 1
i = 0
for l in v:
print('desired pair: ' + 'key (dict_a feature) = ' + str(k) + ' | value(dict_b feature) = ' + str(v[i]))
i += 1

def dict_gen(a, b):
for i in a:
res = []
for j in b:
if a[i] == b[j]:
res.append(j)
if res:
yield [(i), res]
d = dict(i for i in dict_gen(dict_a, dict_b))
print(d)
Output:
{1: [1, 5], 2: [8], 3: [1, 5], 7: [8], 8: [1, 5], 9: [4], 10: [1, 5]}
[Finished in 0.1s]

Related

How to group list of duplicate continuous value in a list with a recursion function?

I want to group consecutive values if it's duplicates and each value is just in one group, let's see my example below:
Note: results is an index of the value in test_list
test_list = ["1","2","1","2","1","1","5325235","2","62623","1","1"]
--->results = [[[0, 1], [2, 3]],
[[4, 5], [9, 10]]]
test_list = ["1","2","1","1","2","1","5325235","2","62623","1","2","1","236","2388","626236437","1","2","1","236","2388"]
--->results = [[[9, 10, 11, 12, 13], [15, 16, 17, 18, 19]],
[[0, 1, 2], [3, 4, 5]]]
I build a recursive function:
def group_duplicate_continuous_value(list_label_group):
# how to know which continuous value is duplicate, I implement take next number minus the previous number
list_flag_grouping = [str(int(j.split("_")[0]) - int(i.split("_")[0])) +f"_{j}_{i}" for i,j in zip(list_label_group,list_label_group[1:])]
# I find duplicate value in list_flag_grouping
counter_elements = Counter(list_flag_grouping)
list_have_duplicate = [k for k,v in counter_elements.items() if v > 1]
if len(list_have_duplicate) > 0:
list_final_index = group_duplicate_continuous_value(list_flag_grouping)
# To return exactly value, I use index to define
for k, v in list_final_index.items():
temp_list = [v[i] + [v[i][-1] + 1] for i in range(0,len(v))]
list_final_index[k] = temp_list
check_first_cursive = list_label_group[0].split("_")
# If we have many list grouping duplicate countinous value with different length, we need function below to return exactly results
if len(check_first_cursive) > 1:
list_temp_index = find_index_duplicate(list_label_group)
list_duplicate_index = list_final_index.values()
list_duplicate_index = [val for sublist in list_duplicate_index for val1 in sublist for val in val1]
for k,v in list_temp_index.items():
list_index_v = [val for sublist in v for val in sublist]
if any(x in list_index_v for x in list_duplicate_index) is False:
list_final_index[k] = v
return list_final_index
else:
if len(list_label_group) > 0:
check_first_cursive = list_label_group[0].split("_")
if len(check_first_cursive) > 1:
list_final_index = find_index_duplicate(list_label_group)
return list_final_index
list_final_index = None
return list_final_index
Support function:
def find_index_duplicate(list_data):
dups = defaultdict(list)
for i, e in enumerate(list_data):
dups[e].append([i])
new_dict = {key:val for key, val in dups.items() if len(val) >1}
return new_dict
But when I run with test_list = [5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,1,2,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,1,2,5,5,5], it's very slow and make out of memory (~6GB). I knew a reason is stack overflow of my recursive function group_duplicate_continuous_value but I don't know how to fix it.
You can create a dict of lists, where every item from the original list is a key in the dict, and every key is mapped to the list of its indices in the original list. For instance, your list ["1","3","5","5","7","1","3","5"] would result in the dict {"1": [0, 5], "3": [1, 6], "5": [2, 3, 7], "7": [4]}.
Creating a dict of lists in this way is very idiomatic in python, and fast, too: it can be done by iterating just once on the list.
def build_dict(l):
d = {}
for i, x in enumerate(l):
d.setdefault(x, []).append(i)
return d
l = ["1","3","5","5","7","1","3","5"]
d = build_dict(l)
print(d)
# {'1': [0, 5], '3': [1, 6], '5': [2, 3, 7], '7': [4]}
Then you can iterate on the dict to build two lists of indices:
def build_index_results(l):
d = build_dict(l)
idx1, idx2 = [], []
for v in d.values():
if len(v) > 1:
idx1.append(v[0])
idx2.append(v[1])
return idx1, idx2
print(build_index_results(l))
# ([0, 1, 2], [5, 6, 3])
Or using zip:
from operator import itemgetter
def build_index_results(l):
d = build_dict(l)
return list(zip(*map(itemgetter(0,1), (v for v in d.values() if len(v) > 1))))
print(build_index_results(l))
# [(0, 1, 2), (5, 6, 3)]
I can't resist showcasing more_itertools.map_reduce for this:
from more_itertools import map_reduce
from operator import itemgetter
def build_index_results(l):
d = map_reduce(enumerate(l),
keyfunc=itemgetter(1),
valuefunc=itemgetter(0),
reducefunc=lambda v: v[:2] if len(v) > 1 else None
)
return list(zip(*filter(None, d.values())))
print(build_index_results(l))
# [(0, 1, 2), (5, 6, 3)]

Loops and Dictionary

How to loop in a list while using dictionaries and return the value that repeats the most, and if the values are repeated the same amount return that which is greater?
Here some context with code unfinished
def most_frequent(lst):
dict = {}
count, itm = 0, ''
for item in lst:
dict[item] = dict.get(item, 0) + 1
if dict[item] >= count:
count, itm = dict[item], item
return itm
#lst = ["a","b","b","c","a","c"]
lst = [2, 3, 2, 2, 1, 3, 3,1,1,1,1] #this should return 1
lst2 = [2, 3, 2, 2, 1, 3, 3] # should return 3
print(most_frequent(lst))
Here is a different way to go about it:
def most_frequent(lst):
# Simple check to ensure lst has something.
if not lst:
return -1
# Organize your data as: {number: count, ...}
dct = {}
for i in lst:
dct[i] = dct[i] + 1 if i in dct else 1
# Iterate through your data and create a list of all large elements.
large_list, large_count = [], 0
for num, count in dct.items():
if count > large_count:
large_count = count
large_list = [num]
elif count == large_count:
large_list.append(num)
# Return the largest element in the large_list list.
return max(large_list)
There are many other ways to solve this problem, including using filter and other built-ins, but this is intended to give you a working solution so that you can start thinking on how to possibly optimize it better.
Things to take out of this; always think:
How can I break this problem down into smaller parts?
How can I organize my data so that it is more useful and easier to manipulate?
What shortcuts can I use along the way to make this function easier/better/faster?
Your code produces the result as you describe in your question, i.e. 1. However, your question states that you want to consider the case where two list elements are co-equals in maximum occurrence and return the largest. Therefore, tracking and returning a single element doesn't satisfy this requirement. You need to compile the dict and then evaluate the result.
def most_frequent(lst):
dict = {}
for item in lst:
dict[item] = dict.get(item, 0) + 1
itm = sorted(dict.items(), key = lambda kv:(-kv[1], -kv[0]))
return itm[0]
#lst = ["a","b","b","c","a","c"]
lst = [2, 3, 2, 2, 2, 2, 1, 3, 3,1,1,1,1] #this should return 1
lst2 = [2, 3, 2, 2, 1, 3, 3] # should return 3
print(most_frequent(lst))
I edited the list 'lst' so that '1' and '2' both occur 5 times. The result returned is a tuple:
(2,5)
I reuse your idea which is quite neat, and I just modified your program a bit.
def get_most_frequent(lst):
counts = dict()
most_frequent = (None, 0) # (item, count)
ITEM_IDX = 0
COUNT_IDX = 1
for item in lst:
counts[item] = counts.get(item, 0) + 1
if most_frequent[ITEM_IDX] is None:
# first loop, most_frequent is "None"
most_frequent = (item, counts[item])
elif counts[item] > most_frequent[COUNT_IDX]:
# if current item's "counter" is bigger than the most_frequent's counter
most_frequent = (item, counts[item])
elif counts[item] == most_frequent[COUNT_IDX] and item > most_frequent[ITEM_IDX]:
# if the current item's "counter" is the same as the most_frequent's counter
most_frequent = (item, counts[item])
else:
pass # do nothing
return most_frequent
lst1 = [2, 3, 2, 2, 1, 3, 3,1,1,1,1, 2] # 1: 5 times
lst2 = [2, 3, 1, 3, 3, 2, 2] # 3: 3 times
lst3 = [1]
lst4 = []
print(get_most_frequent(lst1))
print(get_most_frequent(lst2))
print(get_most_frequent(lst3))
print(get_most_frequent(lst4))

Building a custom Counter function without using built-ins

I have this code:
L = [1, 4, 7, 5, 5, 4, 5, 1, 1, 1]
def frequency(L):
counter = 0
number = L[0]
for i in L:
amount_times = L.count(i)
if amount_times > counter:
counter = amount_times
number = i
return number
print(frequency(L))
But I don't want to use counter function. I want to make code run without any built-in functions. How can I do this?
If you really want to reinvent collections.Counter, this is possible with and without list.count. However, I see no rationale.
Using list.count, you can use a dictionary comprehension. This is inefficient as the list is passed once for each variable.
def frequency2(L):
return {i: L.count(i) for i in set(L)}
If you do not wish to use list.count, this is possible using if / else:
def frequency3(L):
d = {}
for i in L:
if i in d:
d[i] += 1
else:
d[i] = 0
return d
Then to extract the highest count(s):
maxval = max(d.values())
res = [k for k, v in d.items() if v == maxval]
You could try this one. Not sure if this one is acceptable to you.
This finds the most frequent item in a list without using built-ins:
L = [1, 4, 7, 5, 5, 4, 5, 1, 1, 1]
def frequency(L):
count, item = 0, ''
d = {i:0 for i in L}
for i in L[::-1]:
d[i] = d[i] + 1
if d[i] >= count :
count = d[i]
item = i
return item
print(frequency(L))
# 1

Take the mean of values in a list if a duplicate is found

I have 2 lists which are associated with each other. E.g., here, 'John' is associated with '1', 'Bob' is associated with 4, and so on:
l1 = ['John', 'Bob', 'Stew', 'John']
l2 = [1, 4, 7, 3]
My problem is with the duplicate John. Instead of adding the duplicate John, I want to take the mean of the values associated with the Johns, i.e., 1 and 3, which is (3 + 1)/2 = 2. Therefore, I would like the lists to actually be:
l1 = ['John', 'Bob', 'Stew']
l2 = [2, 4, 7]
I have experimented with some solutions including for-loops and the "contains" function, but can't seem to piece it together. I'm not very experienced with Python, but linked lists sound like they could be used for this.
Thank you
I believe you should use a dict. :)
def mean_duplicate(l1, l2):
ret = {}
# Iterating through both lists...
for name, value in zip(l1, l2):
if not name in ret:
# If the key doesn't exist, create it.
ret[name] = value
else:
# If it already does exist, update it.
ret[name] += value
# Then for the average you're looking for...
for key, value in ret.iteritems():
ret[key] = value / l1.count(key)
return ret
def median_between_listsElements(l1, l2):
ret = {}
for name, value in zip(l1, l2):
# Creating key + list if doesn't exist.
if not name in ret:
ret[name] = []
ret[name].append(value)
for key, value in ret.iteritems():
ret[key] = np.median(value)
return ret
l1 = ['John', 'Bob', 'Stew', 'John']
l2 = [1, 4, 7, 3]
print mean_duplicate(l1, l2)
print median_between_listsElements(l1, l2)
# {'Bob': 4, 'John': 2, 'Stew': 7}
# {'Bob': 4.0, 'John': 2.0, 'Stew': 7.0}
The following might give you an idea. It uses an OrderedDict assuming that you want the items in the order of appearance from the original list:
from collections import OrderedDict
d = OrderedDict()
for x, y in zip(l1, l2):
d.setdefault(x, []).get(x).append(y)
# OrderedDict([('John', [1, 3]), ('Bob', [4]), ('Stew', [7])])
names, values = zip(*((k, sum(v)/len(v)) for k, v in d.items()))
# ('John', 'Bob', 'Stew')
# (2.0, 4.0, 7.0)
Here is a shorter version using dict,
final_dict = {}
l1 = ['John', 'Bob', 'Stew', 'John']
l2 = [1, 4, 7, 3]
for i in range(len(l1)):
if final_dict.get(l1[i]) == None:
final_dict[l1[i]] = l2[i]
else:
final_dict[l1[i]] = int((final_dict[l1[i]] + l2[i])/2)
print(final_dict)
Something like this:
#!/usr/bin/python
l1 = ['John', 'Bob', 'Stew', 'John']
l2 = [1, 4, 7, 3]
d={}
for i in range(0, len(l1)):
key = l1[i]
if d.has_key(key):
d[key].append(l2[i])
else:
d[key] = [l2[i]]
r = []
for values in d.values():
r.append((key,sum(values)/len(values)))
print r
Hope following code helps
l1 = ['John', 'Bob', 'Stew', 'John']
l2 = [1, 4, 7, 3]
def remove_repeating_names(names_list, numbers_list):
new_names_list = []
new_numbers_list = []
for first_index, first_name in enumerate(names_list):
amount_of_occurencies = 1
number = numbers_list[first_index]
for second_index, second_name in enumerate(names_list):
# Check if names match and
# if this name wasn't read in earlier cycles or is not same element.
if (second_name == first_name):
if (first_index < second_index):
number += numbers_list[second_index]
amount_of_occurencies += 1
# Break the loop if this name was read earlier.
elif (first_index > second_index):
amount_of_occurencies = -1
break
if amount_of_occurencies is not -1:
new_names_list.append(first_name)
new_numbers_list.append(number/amount_of_occurencies)
return [new_names_list, new_numbers_list]
# Unmodified arrays
print(l1)
print(l2)
l1, l2 = remove_repeating_names(l1, l2)
# If you want numbers list to be integer, not float, uncomment following line:
# l2 = [int(number) for number in l2]
# Modified arrays
print(l1)
print(l2)

How to get the keys from value in python?

I'm trying to solve question. which gives following output:
>>> frequency([13,12,11,13,14,13,7,11,13,14,12,14,14])
ANSWER: ([7], [13, 14])
Basically it's returning list of HIGHEST and LOWEST frequency.
I'm using collection.Counter() function So I got this:
Counter({13: 4, 14: 4, 11: 2, 12: 2, 7: 1})
I extracted key and values and I also got my values sorted in one list. Now I want to get keys which are having least and highest values so that I can generate list from that.
I don't know how to do that.
Not the most pythonic way, but easy to understand for the beginner.
from collections import Counter
L = [13,12,11,13,14,13,7,11,13,14,12,14,14]
answer_min = []
answer_max = []
d = Counter(L)
min_value = min(d.values())
max_value = max(d.values())
for k,v in d.items():
if v == min_value:
answer_min.append(k)
if v == max_value:
answer_max.append(k)
answer = (answer_min, answer_max)
answer
Gives us ([7], [13, 14]). It looks like you only needed to know about dictionary.items() to solve this.
You can take the minimum and maximum values first, then build the list of keys at those values with list comprehensions:
c = Counter({13: 4, 14: 4, 11: 2, 12: 2, 7: 1})
values = c.values()
mn, mx = min(values), max(values)
mins = [k for k, v in c.items() if v == mn]
maxs = [k for k, v in c.items() if v == mx]
print (mins, maxs)
# ([7], [13, 14])
You can try this:
import collections
s = [13,12,11,13,14,13,7,11,13,14,12,14,14]
count = collections.Counter(s)
mins = [a for a, b in count.items() if b == min(count.values())]
maxes = [a for a, b in count.items() if b == max(count.values())]
final_vals = [mins, maxes]
Output:
[[7], [13, 14]]

Categories