Optimal algorithm for the comparison two dictionaries in Python 3 - python

I have List of dictionaries like:
Stock=[
{'ID':1,'color':'red','size':'L','material':'cotton','weight':100,'length':300,'location':'China'},
{'ID':2,'color':'green','size':'M','material':'cotton','weight':200,'length':300,'location':'China'},
{'ID':3,'color':'blue','size':'L','material':'cotton','weight':100,'length':300,'location':'China'}
]
And other list of dictionaries like:
Prices=[
{'color':'red','size':'L','material':'cotton','weight':100,'length':300,'location':'China'}
{'color':'blue','size':'S','weight':500,'length':150,'location':'USA', 'cost':1$}
{'color':'pink','size':'L','material':'cotton','location':'China','cost':5$},
{'cost':5$,'color':'blue','size':'L','material':'silk','weight':100,'length':300}
]
So I need find 'cost' for each record in Stock from Prices. But may be a situation, when I don't find 100% coincidence of dict elements, and in this case I need most similar element and get it's "cost".
output=[{'ID':1,'cost':1$},{'ID':2,'cost':5$},...]
Please, prompt the optimal solution for this task. I think it's like Loop from highest to lowest compliance, when we try find record with max coincidence, and if not found - try less matching condition.

how about this
Stock=[
{'ID':1,'color':'red','size':'L','material':'cotton','weight':100,'length':300,'location':'China'},
{'ID':2,'color':'green','size':'M','material':'cotton','weight':200,'length':300,'location':'China'},
{'ID':3,'color':'blue','size':'L','material':'cotton','weight':100,'length':300,'location':'China'}
]
Prices=[
{'color':'red','size':'L','material':'cotton','weight':100,'length':300,'location':'China'},
{'cost':'2$','color':'blue','size':'S','weight':500,'length':150,'location':'USA'},
{'cost':'5$','color':'pink','size':'L','material':'cotton','location':'China'},
{'cost':'15$','color':'blue','size':'L','material':'silk','weight':100,'length':300}
]
Prices = [p for p in Prices if "cost" in p] #make sure that everything have a 'cost'
result = []
for s in Stock:
field = set(s.items())
best_match = max(Prices, key=lambda p: len( field.intersection(p.items()) ), default=None)
if best_match:
result.append( {"ID":s["ID"], "cost":best_match["cost"] } )
print(result)
#[{'ID': 1, 'cost': '5$'}, {'ID': 2, 'cost': '5$'}, {'ID': 3, 'cost': '15$'}]
to find the most similar entry I first transform the dict to a set then use max to find the largest intersection of a price with the stock that I'm checking using a lambda function for the key of max

it reminds me of fuzzy or neural network solutions,
[on python2]
anyway , here is a Numpy solution, :
Stock=[
{'ID':1,'color':'red','size':'L','material':'cotton','weight':100,'length':300,'location':'China'},
{'ID':2,'color':'green','size':'M','material':'cotton','weight':200,'length':300,'location':'China'},
{'ID':3,'color':'blue','size':'L','material':'cotton','weight':100,'length':300,'location':'China'}
]
Prices=[
{'color':'red','size':'L','material':'cotton','weight':100,'length':300,'location':'China'},
{'cost':2,'color':'blue','size':'S','weight':500,'length':150,'location':'USA'},
{'cost':5,'color':'pink','size':'L','material':'cotton','location':'China'},
{'cost':15,'color':'blue','size':'L','material':'silk','weight':100,'length':300}
]
import numpy as np
# replace non useful records.
for p in Prices:
if not(p.has_key('cost')):
Prices.remove(p)
def numerize(lst_of_dics):
r=[]
for d in lst_of_dics:
r1=[]
for n in ['color','size','material','weight','length','location']:
try:
if n=='color':
# it is 0s if unknown
# only 3 letters, should work ,bug!!!
z=[0,0,0]
r1+=[ord(d[n][0]),ord(d[n][1]),ord(d[n][2])]
elif n=='size':
z=[0,0,0]
r1+=[ord(d[n])]*3
elif n=='material':
z=[0,0,0]
r1+=[ord(d[n][0]),ord(d[n][1]),ord(d[n][2])]
elif n=='location':
z=[0,0,0]
r1+=[ord(d[n][0]),ord(d[n][1]),ord(d[n][2])]
else:
z=[0,0,0]
r1+=[d[n]]*3
except:
r1+=z
r.append(r1)
return r
St = numerize(Stock)
Pr = np.array(numerize(Prices))
output=[]
for i,s in enumerate(St):
s0 = np.reshape(s*Pr.shape[0],Pr.shape)
# stage 0: make one element array to subtract easily
s1 = abs(Pr -s0)
# abs diff
s2 = s1 * Pr.astype('bool') * s0.astype('bool')
# non-extentent does'nt mean match..
s21 = np.logical_not(Pr.astype('bool') * s0.astype('bool'))*25
s2 = s2+s21
# ignore the zero fields..(non-extentse)
s3 = np.sum(s2,1)
# take the smallest
s4 = np.where(s3==np.min(s3))[0][0]
c = Prices[s4]['cost']
#print c,i
output.append( {'ID':i+1 ,'cost':c})
print(output)
that gives me the next results (with many assumptions):
[{'cost': 15, 'ID': 1}, {'cost': 5, 'ID': 2}, {'cost': 15, 'ID': 3}]
Note, that this is correct comparison result based on Values and Kinds of properties
please up vote and check the answer if it satisfies you..

Related

Fastest way to iterate permutation with value guidelines

I have an array of dicts that I need each combination of without duplicates based on no repeating id value and a sum of a ratio value
So the results would be:
results = [
[
{
'id': 1
'ratio': .01
},
{
'id': 2
'ratio': .99
},
],
[
{
'id': 1
'ratio': .50
},
{
'id': 2
'ratio': .50
},
],
[ ... ],
[ ... ],
]
For example:
_array_list = [
{
'id': 1
'ratio': .01
},
{
'id': 1
'ratio': .02
},
....
{
'id': 2
'ratio': .01
}
{
'id': 3
'ratio': .02
}
...
]
Each id has between .01-1.0 by .01
I then do to get each possible combination
(there is a reason for this but i am leaving out the stuff that hasn't anything to do with the issue)
from itertools import combinations
unique_list_count = 2 #(this is each id)
all_combos = []
for i in range(1,len(unique_list_count)+1):
for combo in combinations(_array_list , i):
_iter_count += 1
ids = []
# if iter_count > 1:
# break
for c in combo:
ids.append(c['id'])
is_id_duplicate = len(ids) != len(set(ids))
if is_id_duplicate is False:
# make sure only appending full values
if sum(v['ratio'] for v in combo) == 1.0:
iter_count += 1
print(iter_count, _iter_count)
all_combos.append(list(combo))
I'm not sure if this is a good way or if i can even make this better but it works. The issue is that when i have 5 IDs, each with 100 dictionaries, it will do about 600,000,000 combinations and take about 20 minutes
Is there a way to do this in a more efficient and faster way?
You could use the below code. The advantage of using it is that it won't consider cases with repeating ids:
import itertools
from math import isclose
def powerset(iterable):
"powerset([1,2,3]) --> () (1,) (2,) (3,) (1,2) (1,3) (2,3) (1,2,3)"
s = list(iterable)
return itertools.chain.from_iterable(itertools.combinations(s, r) for r in range(len(s)+1))
def combosSumToOneById(inDictArray):
results = []
uniqueIds = {d['id'] for d in inDictArray}
valuesDict = {id:[d['ratio'] for d in inDictArray if d['id']==id] for id in uniqueIds}
for idCombo in powerset(uniqueIds):
for valueCombo in itertools.product(*[v for k,v in valuesDict.items() if k in idCombo]):
if isclose(sum(valueCombo), 1.0):
results.append([{'id':xid, 'ratio': xv} for xid, xv in zip(idCombo, valueCombo)])
return results
I tested it on the below input
_array_list = [
{
'id': '1',
'ratio': .1
},
{
'id': '1',
'ratio': .2
},
{
'id': '2',
'ratio': .9
},
{
'id': '2',
'ratio': .8
},
{
'id': '3',
'ratio': .8
}]
combosSumToOneById(_array_list)
Returns: [[{'id': '1', 'ratio': 0.1}, {'id': '2', 'ratio': 0.9}], [{'id': '1', 'ratio': 0.2}, {'id': '2', 'ratio': 0.8}], [{'id': '1', 'ratio': 0.2}, {'id': '3', 'ratio': 0.8}]]
Yous should test it if the performance really exceeds the previous one.
Please note that I modified the code to check for isclose(sum, 1.0) rather than sum == 1.Since we are summing double values there most likely will be some error from the representation of the numbers which is why using this condition seems more appropriate.
Until someone who understands the algorithm better than I do comes along, I don't think there's any way of speeding that up with the data types you have.
With the algorithm you are cuurently using:
can you pre-sort your data and filter some branches out that way?
is the ratio sum test more likely to fail than the duplicate test? if so move it above.
drop the print (obviously)
avoid the cast to list from tuple when appending
And then use a multiprocessing.Pool() to use all your cpus at once. Since this is cpu-bound it will get you a reasonable speed up.
But I'm sure there is a more efficient way of doing this. You haven't said how you're getting your data, but if you can represent in an array it might be vectorisable, which will be orders of magnitude faster.
I assume the general case where not each id has all values in [0.01, 1.0].
There are 3 main optimisations you can make and they all aim to instantly drop branches that are guaranteed to not satisfy your conditions.
1. Split the ratios of each id in a dictionary
This way you instantly avoid pointless combinations, .e.g., [{'id': 1, 'ratio': 0.01}, {'id': 1, 'ratio': 0.02}]. It also makes it easier to try combinations between ids. So, instead of having everything in a flat list of dicts, reorganise the data in the following form:
# if your ids are 0-based consecutive integer numbers, a list of lists works too
array_list = {
1: [0.01, 0.02, ...],
2: [0.01, 0.02, ...],
3: [...],
}
2. For an N-size pairing, you have N-1 degrees of freedom
If you're searching for a triplet and you already have (0.54, 0.33, _), you don't have to search all possible values for the last id. There is only one that can satisfy the condition sum(ratios) == 1.0.
3. You can further restrict the possible value range of each id based on the min/max values of the others.
Say you have 3 ids and they all have all the values in [0.01, 0.44]. It is pointless to try any combinations for (0.99, _, _), because the minimum sum for the last two ids is 0.02. Therefore, the maximum value that the first id can explore is 0.98 (well, 0.44 in this example but you get my drift). Similarly, if the maximum sum of the last two ids is 0.88, there is no reason to explore values below 0.12 for the first id. A special case of this is where the sum of the minimum value of all ids is more than 1.0 (or the max < 1.0), in which case you can instantly drop this combination and move on.
Using integers instead of floats
You are blessed in dealing only with some discrete values, so you're better off converting everything to integers. The first reason is to avoid any headaches with floating arithmetic. Case in point, did you know that your code misses some combinations exactly due to these inaccuracies?
And since you will be generating your own value ranges due to optimisation #3, it's much simpler to do for i in range(12, 99) than some roundabout way to generate all values in [0.12, .99) while making sure everything is properly rounded off at the second decimal digit AND THEN properly added together and checked against some tolerance value close to 1.0.
Code
from collections import defaultdict
import itertools as it
def combined_sum(values):
def _combined_sum(values, comb, partial_sum, depth, mins, maxs):
if depth == len(values) - 1:
required = 100 - partial_sum
if required in values[-1]:
yield comb + (required,)
else:
min_value = mins[depth+1]
max_value = maxs[depth+1]
start_value = max(min(values[depth]), 100 - partial_sum - max_value)
end_value = max(1, 100 - partial_sum - min_value)
for value in range(start_value, end_value+1):
if value in values[depth]:
yield from _combined_sum(values, comb+(value,), partial_sum+value, depth+1, mins, maxs)
# precompute all the partial min/max sums, because we will be using them a lot
mins = [sum(min(v) for v in values[i:]) for i in range(len(values))]
maxs = [sum(max(v) for v in values[i:]) for i in range(len(values))]
if mins[0] > 100 or maxs[0] < 100:
return []
return _combined_sum(values, tuple(), 0, 0, mins, maxs)
def asset_allocation(array_list, max_size):
# a set is preferred here because we will be checking a lot whether
# a value is in the iterable, which is faster a set than in a tuple/list
collection = defaultdict(set)
for d in array_list:
collection[d['id']].add(int(round(d['ratio'] * 100)))
all_combos = []
for i in range(1, max_size+1):
for ids in it.combinations(collection.keys(), i):
values = [collection[ID] for ID in ids]
for group in combined_sum(values):
all_combos.append([{'id': ID, 'ratio': ratio/100} for ID, ratio in zip(ids, group)])
return all_combos
array_list = [{'id': ID, 'ratio': ratio/100}
for ID in (1, 2, 3, 4, 5)
for ratio in range(1, 101)
]
max_size = 5
result = asset_allocation(array_list, max_size)
This finishes in 14-15 seconds on my machine.
For comparison, for 3 ids this finishes in 0.007 seconds and Gabor's solution which effectively implements only optimisation #1 finishes in 0.18 seconds. For 4 ids it's .43 s and 18.45 s respectively. For 5 ids I stopped timing his solution after a few minutes, but it was expected to take at least 10 minutes.
If you are dealing with the case where all ids have all the values in [0.01, 1.0] and you insist on having the specific output indicated in your question, the above approach is still optimal. However, if you are okay with generating the output in a different format, you can do better.
For a specific group size, e.g., singles, pairs, triplets, etc, generate all the partitions that add up to 100 using the stars and bars approach. That way, instead of generating (1, 99), (2, 98), etc, for each pair of ids, i.e., (1, 2), (1, 3) and (2, 3), you do this only once.
I've modified the code from here to not allow for 0 in any partition.
import itertools as it
def partitions(n, k):
for c in it.combinations(range(1, n), k-1):
yield tuple(b-a for a, b in zip((0,)+c, c+(n,)))
def asset_allocation(ids, max_size):
all_combos = []
for k in range(1, max_size+1):
id_comb = tuple(it.combinations(ids, k))
p = tuple(partitions(100, k))
all_combos.append((id_comb, p))
return all_combos
ids = (1, 2, 3, 4, 5)
result = asset_allocation(ids, 5)
This finishes much faster, takes up less space, and also allows you to home in to all the combinations for singles, pairs, etc, individually. Now, if you were to take the product of id_comb and p to generate the output in your question, you'd lose all that time saved. In fact, it'd come out as a biiit slower than the general method from above, but at least this piece of code is still more compact.

How to sort a list of dictionaries based on the value of one certain key without using sort() in Python

I am trying to sort a large json file with Steam games in descending order based on the value of key: positive_ratings, without using the built in sort() function.
small_example = [
{'id':10,'game':'Counterstrike','positive_ratings':150},
{'id':20,'game':'Bioshock Infinite','positive_ratings':50},
{'id':30,'game':'Rust','positive_ratings':300},
{'id':40,'game':'Portal','positive_ratings':200}
]
The output in descending order would be the following list:
['Rust', 'Portal', 'Counterstrike', 'Bioshock Infinite']
For school we had to make a quick sort function that sorts lists like below. Now i would like to rewrite it so it sorts a list of dictionaries.
def quick_sort(sequence):
length = len(sequence)
if length <= 1:
return sequence
else:
centre = sequence.pop()
items_bigger = []
items_smaller = []
for item in sequence:
if item > centre:
items_bigger.append(item)
else: items_smaller.append(item)
return quick_sort(items_smaller) + [centre] + quick_sort(items_bigger)
print(quick_sort([1,2,5,6,2,10,34,54,23,1]))
In your code, you sort the list based on the element's value. But what you want is sorting list based on element['positive_ratings']. You just need to alter code a little bit:
def quick_sort(sequence):
length = len(sequence)
if length <= 1:
return sequence
else:
centre = sequence.pop()
items_bigger = []
items_smaller = []
for item in sequence:
if item['positive_ratings'] > centre['positive_ratings']: # I changed only this line
items_bigger.append(item)
else: items_smaller.append(item)
return quick_sort(items_smaller) + [centre] + quick_sort(items_bigger)
sort function also works like that, you just specify the key:
some_list.sort(key= lambda x: x['positive_ratings'])
We can adjust your code to look similar to sort function:
def quick_sort(sequence, key = lambda x: x):
length = len(sequence)
if length <= 1:
return sequence
else:
centre = sequence.pop()
items_bigger = []
items_smaller = []
for item in sequence:
if key(item> key(centre): # I changed only this line
items_bigger.append(item)
else: items_smaller.append(item)
return quick_sort(items_smaller, key) + [centre] + quick_sort(items_bigger, key)
You can call it like this:
quick_sort(small_example, key = lambda x: x['positive_ratings'])
Edit: I forgot to add key in the last line. Thanks to #DarrylG I fixed that
you can sort the example, by sorting the data based, based on the key positive_ratings ie sort the postive_ratings values first and then based on that return the output
small_example = [
{'id':10,'game':'Counterstrike','positive_ratings':150},
{'id':20,'game':'Bioshock Infinite','positive_ratings':50},
{'id':30,'game':'Rust','positive_ratings':300},
{'id':40,'game':'Portal','positive_ratings':200}
]
def func(data, key: int):
dic = {}
for i in data:
if i[key] not in dic:
dic[i[key]] = [i]
else:
dic[i[key]].append(i)
dic_key = list(dic.keys())
# sorting the dic_key, sorting data based on postive_raing values, you can
# use any sort algo here
for i in range(len(dic_key)):
for j in range(i+1, len(dic_key)):
if dic_key[i]>dic_key[j]:
dic_key[i], dic_key[j] = dic_key[j], dic_key[i]
result = []
for i in dic_key:
result.extend(dic[i])
return result
sol = func(small_example, 'positive_ratings')
print(solution)
output
[{'id': 20, 'game': 'Bioshock Infinite', 'positive_ratings': 50},
{'id': 10, 'game': 'Counterstrike', 'positive_ratings': 150},
{'id': 40, 'game': 'Portal', 'positive_ratings': 200},
{'id': 30, 'game': 'Rust', 'positive_ratings': 300}]

Get all min values from dictionary list

I have the following snippet:
list = [{"num":1,"test":"A"},{"num":6,"test":"B"},{"num":5,"test":"c"},{"num":1,"test":"D"}]
min = None
for x in list:
if x["num"]<min or min==None:
min=x["num"]
print(min)
print([index for index, element in enumerate(list)
if min == element["num"]])
Which doesn't really output anything useful, my objective was to output, as said in the title, the dictionaries with "1" in num.
A noob question I know, but this is my first contact with the language.
Thanks!
min() takes a key argument that lets you specific how to calculate the min. This will let you find an object with the min num value. You can then use that to find all of them with a list comprehension (or similar method).
l = [{"num":1,"test":"A"},{"num":6,"test":"B"},{"num":5,"test":"c"},{"num":1,"test":"D"}]
m = min(l, key=lambda d: d['num'])
# {'num': 1, 'test': 'A'}
[item for item in l if item['num'] == m['num']]
# [{'num': 1, 'test': 'A'}, {'num': 1, 'test': 'D'}]
You need to set min to an arbitrarily large number at the beginning of the program. I set it to 500. You then have to make you're checking if the "num" value is less than or equal to min, otherwise it will not grab both 1 values.
list = [{"num":1,"test":"A"},{"num":6,"test":"B"},{"num":5,"test":"c"},{"num":1,"test":"D"}]
min = 500
for x in list:
if x["num"]<=min or min==None:
min=x["num"]
print(x)
print(min)
You can try with this:
list_=[{"num":1,"test":"A"},{"num":6,"test":"B"},{"num":5,"test":"c"},{"num":1,"test":"D"}]
min_=min(list_,key=lambda x: x["num"])
min_ = min_["num"]
l=list(filter(lambda x: x["num"]==min_,list_))
print(l)

Python Dictionary - find average of value from other values

I have the following list
count = 3.5, price = 2500
count = 3, price = 400
count = 2, price = 3000
count = 3.5, price = 750
count = 2, price = 500
I want to find the average price for all where the count is the same. For example:
count = 2, price = 3000
count = 2, price = 500
3000 + 500 = 3500
3500/2 = 1750
Avg for 'count 2' is 1750
Here's my code so far
avg_list = [value["average"] for value in dictionary_database_list]
counter_obj = collections.Counter(count_list)
print ("AVG:")
for i in counter_obj:
print (i, counter_obj[i])
I'll admit I'm not 100% clear on what you're looking for here, but I'll give it a shot:
A good strategy when you want to iterate over a list of "things" and accumulate some kind of information about "the same kind of thing" is to use a hash table. In Python, we usually use a dict for algorithms that require a hash table.
To collect enough information to get the average price for each item in your list, we need:
a) the total number of items with a specific "count"
b) the total price of items with a specific "count"
So let's build a data structure that maps a "count" to a dict containing "total items" and "total price" for the item with that "count".
Let's take our input in the format:
item_list = [
{'count': 3.5, 'price': 2500},
{'count': 3, 'price': 400},
{'count': 2, 'price': 3000},
{'count': 3.5, 'price': 750},
{'count': 2, 'price': 500},
]
Now let's map the info about "total items" and "total price" in a dict called items_by_count:
for item in item_list:
count, price = item['count'], item['price']
items_by_count[count]['total_items'] += 1
items_by_count[count]['total_price'] += price
But wait! items_by_count[count] will throw a KeyError if count isn't already in the dict. This is a good use case for defaultdict. Let's define the default value of a count we've never seen before as 0 total price, and 0 total items:
from collections import defaultdict
items_by_count = defaultdict(lambda: {
'total_items': 0,
'total_price': 0
})
Now our code won't throw an exception every time we see a new value for count.
Finally, we need to actually take the average. Let's get the information we need in another dict, mapping count to average price. This is a good use case for a dict comprehension:
{count: item['total_price'] / item['total_items']
for count, item in items_by_count.iteritems()}
This iterates over the items_by_count dict and creates the new dict that we want.
Putting it all together:
from collections import defaultdict
def get_average_price(item_list):
items_by_count = defaultdict(lambda: {
'total_items': 0,
'total_price': 0
})
for item in item_list:
count, price = item['count'], item['price']
items_by_count[count]['total_items'] += 1
items_by_count[count]['total_price'] += price
return {count: item['total_price'] / item['total_items']
for count, item in items_by_count.iteritems()}
If we pass in our example input dict, this function returns:
{3.5: 1625, 2: 1750, 3: 400}
Which is hopefully the output you want! Be cautious of gotchas like float division in your particular Python version.
You need to iterate over your items
See documentation
avg(dictionary.values()) is probably what you want

Find last occurrence of any item of one list in another list in python

I have the following 2 lists in python:
ll = [500,500,500,501,500,502,500]
mm = [499,501,502]
I want to find out the position of last occurence of any item in mm, in the list ll. I can do this for a single element like this:
len(ll) - 1 - ll[::-1].index(502)
>> 5
Here ll[::-1].index(502) provides position of last occurence of 502 and len(ll) - 1 gives the total length.
How do I extend this to work for the entire list mm? I know I can write a function, but is there a more pythonic way
If you want all the last indices of each item in ll present in mm, then:
ll = [500,500,500,501,500,502,500]
mm = [499,501,502]
d = {v:k for k,v in enumerate(ll) if v in mm}
# {501: 3, 502: 5}
It's probably worth creating a set from mm first to make it an O(1) lookup, instead of O(N), but for three items, it's really not worth it.
Following #Apero's concerns about retaining missing indices as None and also using a hash lookup to make it an O(1) lookup...
# Build a key->None dict for all `mm`
d = dict.fromkeys(mm)
# Update `None` values with last index using a gen-exp instead of dict-comp
d.update((v,k) for k,v in enumerate(ll) if v in d)
# {499: None, 501: 3, 502: 5}
results = {}
reversed = ll[::-1]
for item in mm:
try:
index = ((len(ll) - 1) - reversed.index(item))
except ValueError:
index = None
finally:
results[item] = index
print results
Output:
{499: None, 501: 3, 502: 5}
You can do this with a list comprehension and the max function. In particular, for each element in ll, iterate through mm to create a list of indices where ll[i] = mm[i], and take the max of this list.
>>> indices = [ max([-1] + [i for i, m in enumerate(mm) if m == n]) for n in ll ]
>>> indices
[-1, -1, -1, 1, -1, 2, -1]
Note that you need to add [-1] in case there are no matching indices in mm.
All this being said, I don't think it's more Python-ic than writing a function and using a list comprehension (as you alluded to in your question).

Categories